[Spread-users] Leader crashing often

Jonathan Stanton jonathan at cnds.jhu.edu
Fri Nov 2 16:40:09 EDT 2007

I've been looking at this issue as well and John is correct that truncation of a 
membership list (as long as all the fields are consistent) shouldn't break anything else 
inside the daemon (as the actual membership list is still complete and that is what is 
used to deliver messages). 

However, I'm not sure that is the best solution here, because what that does break is the 
guarantee that Spread provides that the membership messages are 'correct' and complete. An 
application would now not know what the full membeship of a group was since the membership 
messages would be truncated. 

So if your truncation patch appears to be working for you and you don't need the accurate 
membership messages, then that should be ok for the short term. As I discuss below, I 
think I have a better way to fix the problem, but still need to work up the actual patch.

I think the better solution is to allow the membership code to generate larger membership 
messages so that the full membership can be delivered to the client. If a client doesn't 
want or need membership messages they already can turn them off. Allowing larger messages 
in this case would not change the limits on how big a Spread message sent by an 
application could be -- that would still be limited, but the membership messages generated 
by a daemon and delivered to clients connected to it could be larger. 

I'm working on a patch that first enforces the message and scatter limits (as the current 
code until it hits that case you triggered, is actually allowing messages larger then the 
buffer to be generated) and then will allow the larger membership messages. I don't 
believe this buffer problem to have any security consequences, since the data that is 
overflowing the buffer is lists of member names generated by the daemon itself and the 
names are already checked to be well formed before being stored in the daemon.


On Fri, Nov 02, 2007 at 02:43:27PM -0400, John Schultz wrote:
> Ed,
> I think you can truncate a too big membership messages w/o harming the 
> daemon's proper functioning.
> However, your applications are certainly not going to get the full 
> membership info and they need to be resilient to that.  In addition, you 
> will probably need to examine the client-daemon protocol for transmitting 
> membership messages and make sure that your truncation matches whatever 
> header calculations the client does upon receipt.
> For example, I think that if you simply stop writing to a membership 
> message when you are about to exceed its size, then that won't work. If 
> you did that, then I imagine the client-daemon protocol would get out of 
> sync because the client would be expecting more data from the daemon 
> (according to its header calculations) than the daemon will ever send.
> So you also need to figure out what header fields need to be modified so 
> that the client-daemon protocol doesn't get thrown out of sync.
> Cheers!
> ---
> John Schultz
> Spread Concepts
> Phn: 443 838 2200
> On Fri, 2 Nov 2007, Ed Holyat wrote:
> >John in my case, I have my MAX_GROUP_NAME==255 which is a requirement
> >for our app.
> >All daemons are at this size.  Everything works fine until this buffer
> >is blown, which is < 200 members.  John, I understand why it is
> >happening, I just would like to know if the truncation is ok.
> >
> >
> >
> >-----Original Message-----
> >From: spread-users-bounces at lists.spread.org
> >[mailto:spread-users-bounces at lists.spread.org] On Behalf Of John Schultz
> >Sent: Friday, November 02, 2007 11:54 AM
> >To: spread-users at lists.spread.org
> >Subject: RE: [Spread-users] Leader crashing often
> >
> >I believe Ed is correct.  This problem can occur when the daemon tries
> >to
> >construct a message that is too large to fit in a single message.  This
> >can occur if the number of members in a group is too large to fit in a
> >message -- but, by default (MAX_GROUP_SIZE=32), this should take
> >something
> >like having a group with at least 1500 members in it (worst case).
> >
> >The only other way I can think this could happen is if you changed the
> >default MAX_GROUP_SIZE, as Ed pointed out, or if the client and daemon
> >were compiled and have different ideas of what MAX_MESSAGE_SIZE are.
> >Normally, clients shouldn't be able to inject any messages that would
> >cause the daemon to run out of space in allocating a user message.
> >
> >So, if you have messages that you send to lots and lots of private
> >groups,
> >this could be the problem if the client and daemon have different
> >conceptions of MAX_GROUP_SIZE.
> >
> >Cheers!
> >John
> >
> >---
> >John Schultz
> >Spread Concepts
> >Phn: 443 838 2200
> >
> >On Fri, 2 Nov 2007, Ed Holyat wrote:
> >
> >>It is not necessarily the number of members in the group.  The crash
> >is
> >>related to the number of members * the size of the private names
> >>associated with that group.  This can not exceed the vector space that
> >>sends this information out which will hold maximum
> >>
> >>100 ( num_elements ) * MAX_PACKET_SIZE
> >>
> >>-----Original Message-----
> >>From: Nico Meyer [mailto:nmeyer at virtualminds.de]
> >>Sent: Friday, November 02, 2007 10:31 AM
> >>To: Ed Holyat
> >>Cc: spread-users at lists.spread.org
> >>Subject: Re: [Spread-users] Leader crashing often
> >>
> >>But does this relate to the total number of members summed over all
> >>groups, or
> >>only the number of groups a certain message is directed to? Because we
> >>might
> >>have a high number of clients which are in no group (aside from their
> >>private
> >>group of course), but we certainly have no group with 100 members.
> >>
> >>
> >>On Friday 02 November 2007 15:13:30 Ed Holyat wrote:
> >>>This seems to be related to the issue I am also experiencing.
> >>>
> >>>See subject RE: [Spread-users] To protect against SEGV in scatter on
> >>the
> >>>mailing list.
> >>>I attached a fix, the fix truncates the member list when the number
> >of
> >>>elements is exceeded.
> >>>I am waiting to find out what the side effects of truncating the
> >>member
> >>>list is.  The correct fix may be to send two messages so the member
> >>list
> >>>is not truncated.
> >>>
> >>>-----Original Message-----
> >>>From: spread-users-bounces at lists.spread.org
> >>>[mailto:spread-users-bounces at lists.spread.org] On Behalf Of Nico
> >Meyer
> >>>Sent: Friday, November 02, 2007 4:45 AM
> >>>To: spread-users at lists.spread.org
> >>>Subject: [Spread-users] Leader crashing often
> >>>
> >>>Hello,
> >>>
> >>>lately one of our spread daemons started to crash very often with the
> >>>following log message:
> >>>
> >>>spread: message.c:398: Message_add_scat_element: Assertion
> >>>`msg->num_elements
> >>>100' failed.
> >>>
> >>>The only thing special about this one daemon is that it normally is
> >>the
> >>>leader.
> >>>Does anbody know, what can cause this?
> >>>
> >>>Thank you,
> >>>
> >>>Nico
> >>>
> >>>_______________________________________________
> >>>Spread-users mailing list
> >>>Spread-users at lists.spread.org
> >>>http://lists.spread.org/mailman/listinfo/spread-users
> >>
> >>
> >>
> >>_______________________________________________
> >>Spread-users mailing list
> >>Spread-users at lists.spread.org
> >>http://lists.spread.org/mailman/listinfo/spread-users
> >>
> >
> >_______________________________________________
> >Spread-users mailing list
> >Spread-users at lists.spread.org
> >http://lists.spread.org/mailman/listinfo/spread-users
> >
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    

More information about the Spread-users mailing list