[Spread-users] Events not getting handled properly
Theo Schlossnagle
jesus at omniti.com
Wed Nov 22 14:43:56 EST 2006
On Nov 20, 2006, at 4:14 PM, John Lane Schultz wrote:
> Alec H. Peterson wrote:
>> Hi all,
>> I've got an issue with a Spread deployment. The ring is very
>> simple; 1 segment, 3 nodes, all running Spread 3.17.3, running
>> Solaris 10. Under circumstances that I'm not quite sure of,
>> Spread stops handling user events (existing connections, accepting
>> new connections). Based on some preliminary debugging, it appears
>> that the Protocol_threshold is getting set to 1 (MEDIUM_PRIORITY)
>> and never set back down.
>> Is this a known issue or something somebody may have a good way to
>> deal with?
> AFAIK the most common time the daemon raises the threshold is
> during a daemon membership change. If the setting is never coming
> back down it could be that your membership is continually thrashing
> for some reason. You could turn on membership logging and see what
> is happening there and whether or not it is stuck in the membership
> algorithm.
>
> We have had one similar report of the daemons getting stuck during
> membership, in particular in the EVS state. We have a patch that
> works around that condition and will be available in the new 3.17.4
> and 4.0.0 releases. The patch is currently available through SVN
> checkout for 4.0.0. I have attached the work around to this email
> as well. The official releases are imminent as well.
To follow up on Alec's mail and in response to yours...
I think this is not the situation. The ring in this case is in stage:
1 gstate:1. It is stable and not transitioning. However, the system
will not accepting new client connections because the
Protocol_threshold has been set up to MEDIUM. (thus cutting off all
LOW prio events). As I see it, this is only possible when the down
queue in protocol.c is full. As the token is still spinning around
the ring, I am left wondering why it doesn't recover.
A super simple test of starting spread on three nodes and running
spuser on each and testing a full mesh of messages works fine.
(everyone can see everyone and can talk to everyone -- group and
regular messages work)
However:
if( Down_queue_ptr->num_mess >= WATER_MARK )
Sess_block_users_level();
is stuck. num_message == WATER_MARK when the ring gets hosed (I've
confirmed this is a debugger against a live core in the broken
system). And it never recovers despite the event system still
successfully firing on >= MEDIUM prio events. So, to me it seems it
may be a flow control issue. I'm inclined to believe that this is a
bug in spread, but induced by an uncommon networking "condition".
What that condition may be, I do not know. The ring is _very_ low
traffic, but is sending messages of up to 120k. Any idea what type
of network issue or system issue could cause these effects? We're
still puzzled.
Best regards,
Theo
// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
More information about the Spread-users
mailing list