[Spread-users] Events not getting handled properly

Wed Nov 22 14:43:56 EST 2006

On Nov 20, 2006, at 4:14 PM, John Lane Schultz wrote:

> Alec H. Peterson wrote:
>> Hi all,
>> I've got an issue with a Spread deployment.  The ring is very  
>> simple; 1 segment, 3 nodes, all running Spread 3.17.3, running  
>> Solaris 10.  Under circumstances that I'm not quite sure of,  
>> Spread stops handling user events (existing connections, accepting  
>> new connections).  Based on some preliminary debugging, it appears  
>> that the Protocol_threshold is getting set to 1 (MEDIUM_PRIORITY)  
>> and never set back down.
>> Is this a known issue or something somebody may have a good way to  
>> deal with?

> AFAIK the most common time the daemon raises the threshold is  
> during a daemon membership change.  If the setting is never coming  
> back down it could be that your membership is continually thrashing  
> for some reason.  You could turn on membership logging and see what  
> is happening there and whether or not it is stuck in the membership  
> algorithm.
>
> We have had one similar report of the daemons getting stuck during  
> membership, in particular in the EVS state.  We have a patch that  
> works around that condition and will be available in the new 3.17.4  
> and 4.0.0 releases.  The patch is currently available through SVN  
> checkout for 4.0.0.  I have attached the work around to this email  
> as well.  The official releases are imminent as well.

To follow up on Alec's mail and in response to yours...

I think this is not the situation.  The ring in this case is in stage: 
1 gstate:1.  It is stable and not transitioning.  However, the system  
will not accepting new client connections because the  
Protocol_threshold has been set up to MEDIUM. (thus cutting off all  
LOW prio events).  As I see it, this is only possible when the down  
queue in protocol.c is full.  As the token is still spinning around  
the ring, I am left wondering why it doesn't recover.

A super simple test of starting spread on three nodes and running  
spuser on each and testing a full mesh of messages works fine.  
(everyone can see everyone and can talk to everyone -- group and  
regular messages work)

However:

if( Down_queue_ptr->num_mess >= WATER_MARK )
                 Sess_block_users_level();

is stuck.  num_message == WATER_MARK when the ring gets hosed (I've  
confirmed this is a debugger against a live core in the broken  
system).  And it never recovers despite the event system still  
successfully firing on >= MEDIUM prio events.  So, to me it seems it  
may be a flow control issue.  I'm inclined to believe that this is a  
bug in spread, but induced by an uncommon networking "condition".   
What that condition may be, I do not know.  The ring is _very_ low  
traffic, but is sending messages of up to 120k.  Any idea what type  
of network issue or system issue could cause these effects?  We're  
still puzzled.

Best regards,
Theo

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/