[Spread-users] Events not getting handled properly

Theo Schlossnagle jesus at omniti.com
Wed Nov 22 14:43:56 EST 2006

On Nov 20, 2006, at 4:14 PM, John Lane Schultz wrote:

> Alec H. Peterson wrote:
>> Hi all,
>> I've got an issue with a Spread deployment.  The ring is very  
>> simple; 1 segment, 3 nodes, all running Spread 3.17.3, running  
>> Solaris 10.  Under circumstances that I'm not quite sure of,  
>> Spread stops handling user events (existing connections, accepting  
>> new connections).  Based on some preliminary debugging, it appears  
>> that the Protocol_threshold is getting set to 1 (MEDIUM_PRIORITY)  
>> and never set back down.
>> Is this a known issue or something somebody may have a good way to  
>> deal with?

> AFAIK the most common time the daemon raises the threshold is  
> during a daemon membership change.  If the setting is never coming  
> back down it could be that your membership is continually thrashing  
> for some reason.  You could turn on membership logging and see what  
> is happening there and whether or not it is stuck in the membership  
> algorithm.
> We have had one similar report of the daemons getting stuck during  
> membership, in particular in the EVS state.  We have a patch that  
> works around that condition and will be available in the new 3.17.4  
> and 4.0.0 releases.  The patch is currently available through SVN  
> checkout for 4.0.0.  I have attached the work around to this email  
> as well.  The official releases are imminent as well.

To follow up on Alec's mail and in response to yours...

I think this is not the situation.  The ring in this case is in stage: 
1 gstate:1.  It is stable and not transitioning.  However, the system  
will not accepting new client connections because the  
Protocol_threshold has been set up to MEDIUM. (thus cutting off all  
LOW prio events).  As I see it, this is only possible when the down  
queue in protocol.c is full.  As the token is still spinning around  
the ring, I am left wondering why it doesn't recover.

A super simple test of starting spread on three nodes and running  
spuser on each and testing a full mesh of messages works fine.  
(everyone can see everyone and can talk to everyone -- group and  
regular messages work)


if( Down_queue_ptr->num_mess >= WATER_MARK )

is stuck.  num_message == WATER_MARK when the ring gets hosed (I've  
confirmed this is a debugger against a live core in the broken  
system).  And it never recovers despite the event system still  
successfully firing on >= MEDIUM prio events.  So, to me it seems it  
may be a flow control issue.  I'm inclined to believe that this is a  
bug in spread, but induced by an uncommon networking "condition".   
What that condition may be, I do not know.  The ring is _very_ low  
traffic, but is sending messages of up to 120k.  Any idea what type  
of network issue or system issue could cause these effects?  We're  
still puzzled.

Best regards,

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

More information about the Spread-users mailing list