[Spread-users] weird spread issue

Jonathan Stanton jonathan at cnds.jhu.edu
Wed Oct 16 16:08:39 EDT 2002


This sounds similar to another hang problem I know about. Can you check
the spmonitor output when this happens and check if the following are
true:

1) rounds is increasing (from one monitor output to the next)
2) ARU is less then Highest seq (likely by a bunch like 1000
or so)
3) ARU is NOT increasing.

This means that the system thinks that a message is missing and it has
sent about 1000 messages past that message and finally hit a flow control
limit that says you can't keep introducing new messages into the system
until the old ones are delivered and discarded (meaning ARU has increased
to be larger then the messages's sequence number). However the message
REALLY isn't missing but rather a VERY old token was received and it
caused Spread to change a value in a way that was incorrect. Since the
VERY is something like > 500 ms or 16 rounds of the token it hasn't
happened much (one case was a flaky switch, another was unknown but
assumed to be a buffering switch). But it would cause exactly these
symptoms.

3.17.0 has what I think is a fix for this hang (if it is the one I
describe above). However, as I couldn't duplicate the hang, I also
couldn't test the fix :-( 

So if you can verify if 3.17 fixes the problem I'd be quite happy :-)

Jonathan

On Wed, Oct 16, 2002 at 01:36:32PM -0400, George Schlossnagle wrote:
> Weird hang issues just started popping up.   My ring of 8 machines will 
> hang, in the sense that no events or messages are passed.  spmonitor 
> shows up everything in gstate 1/state 1, but when I sign on with spuser 
> and try to join a groupo, I never get a join confirmation.  Also a 
> 'send's are never recieved either.
> 
> This problem is sporadic.  If I terminate all the daemons and restart 
> them, the ring starts up normally, works for a bit (anywhere from a 
> couple minutes to a day), then craps out again.  This is spread 3.16.2. 
>  No recent network topology or configuration changes have occured 
> coincident wiith this.
> 
> Weird.
> 
> George
> 
> 
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------




More information about the Spread-users mailing list