[Spread-users] weird spread issue

George Schlossnagle george at omniti.com
Wed Oct 16 16:28:40 EDT 2002


Cool.  Will check this next time it happens.  Also we are in the process 
of checking out 3.17.0 - very happy to hear it may fix this as well.

George

Jonathan Stanton wrote:

>This sounds similar to another hang problem I know about. Can you check
>the spmonitor output when this happens and check if the following are
>true:
>
>1) rounds is increasing (from one monitor output to the next)
>2) ARU is less then Highest seq (likely by a bunch like 1000
>or so)
>3) ARU is NOT increasing.
>
>This means that the system thinks that a message is missing and it has
>sent about 1000 messages past that message and finally hit a flow control
>limit that says you can't keep introducing new messages into the system
>until the old ones are delivered and discarded (meaning ARU has increased
>to be larger then the messages's sequence number). However the message
>REALLY isn't missing but rather a VERY old token was received and it
>caused Spread to change a value in a way that was incorrect. Since the
>VERY is something like > 500 ms or 16 rounds of the token it hasn't
>happened much (one case was a flaky switch, another was unknown but
>assumed to be a buffering switch). But it would cause exactly these
>symptoms.
>
>3.17.0 has what I think is a fix for this hang (if it is the one I
>describe above). However, as I couldn't duplicate the hang, I also
>couldn't test the fix :-( 
>
>So if you can verify if 3.17 fixes the problem I'd be quite happy :-)
>
>Jonathan
>
>On Wed, Oct 16, 2002 at 01:36:32PM -0400, George Schlossnagle wrote:
>
>>Weird hang issues just started popping up.   My ring of 8 machines will 
>>hang, in the sense that no events or messages are passed.  spmonitor 
>>shows up everything in gstate 1/state 1, but when I sign on with spuser 
>>and try to join a groupo, I never get a join confirmation.  Also a 
>>'send's are never recieved either.
>>
>>This problem is sporadic.  If I terminate all the daemons and restart 
>>them, the ring starts up normally, works for a bit (anywhere from a 
>>couple minutes to a day), then craps out again.  This is spread 3.16.2. 
>> No recent network topology or configuration changes have occured 
>>coincident wiith this.
>>
>>Weird.
>>
>>George
>>
>>
>>
>>
>>_______________________________________________
>>Spread-users mailing list
>>Spread-users at lists.spread.org
>>http://lists.spread.org/mailman/listinfo/spread-users
>>
>







More information about the Spread-users mailing list