[Spread-users] Events not getting handled properly
jesus at omniti.com
Wed Nov 22 14:43:56 EST 2006
On Nov 20, 2006, at 4:14 PM, John Lane Schultz wrote:
> Alec H. Peterson wrote:
>> Hi all,
>> I've got an issue with a Spread deployment. The ring is very
>> simple; 1 segment, 3 nodes, all running Spread 3.17.3, running
>> Solaris 10. Under circumstances that I'm not quite sure of,
>> Spread stops handling user events (existing connections, accepting
>> new connections). Based on some preliminary debugging, it appears
>> that the Protocol_threshold is getting set to 1 (MEDIUM_PRIORITY)
>> and never set back down.
>> Is this a known issue or something somebody may have a good way to
>> deal with?
> AFAIK the most common time the daemon raises the threshold is
> during a daemon membership change. If the setting is never coming
> back down it could be that your membership is continually thrashing
> for some reason. You could turn on membership logging and see what
> is happening there and whether or not it is stuck in the membership
> We have had one similar report of the daemons getting stuck during
> membership, in particular in the EVS state. We have a patch that
> works around that condition and will be available in the new 3.17.4
> and 4.0.0 releases. The patch is currently available through SVN
> checkout for 4.0.0. I have attached the work around to this email
> as well. The official releases are imminent as well.
To follow up on Alec's mail and in response to yours...
I think this is not the situation. The ring in this case is in stage:
1 gstate:1. It is stable and not transitioning. However, the system
will not accepting new client connections because the
Protocol_threshold has been set up to MEDIUM. (thus cutting off all
LOW prio events). As I see it, this is only possible when the down
queue in protocol.c is full. As the token is still spinning around
the ring, I am left wondering why it doesn't recover.
A super simple test of starting spread on three nodes and running
spuser on each and testing a full mesh of messages works fine.
(everyone can see everyone and can talk to everyone -- group and
regular messages work)
if( Down_queue_ptr->num_mess >= WATER_MARK )
is stuck. num_message == WATER_MARK when the ring gets hosed (I've
confirmed this is a debugger against a live core in the broken
system). And it never recovers despite the event system still
successfully firing on >= MEDIUM prio events. So, to me it seems it
may be a flow control issue. I'm inclined to believe that this is a
bug in spread, but induced by an uncommon networking "condition".
What that condition may be, I do not know. The ring is _very_ low
traffic, but is sending messages of up to 120k. Any idea what type
of network issue or system issue could cause these effects? We're
// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
More information about the Spread-users