[Spread-users] weird spread issue
George Schlossnagle
george at omniti.com
Wed Oct 16 16:28:40 EDT 2002
Cool. Will check this next time it happens. Also we are in the process
of checking out 3.17.0 - very happy to hear it may fix this as well.
George
Jonathan Stanton wrote:
>This sounds similar to another hang problem I know about. Can you check
>the spmonitor output when this happens and check if the following are
>true:
>
>1) rounds is increasing (from one monitor output to the next)
>2) ARU is less then Highest seq (likely by a bunch like 1000
>or so)
>3) ARU is NOT increasing.
>
>This means that the system thinks that a message is missing and it has
>sent about 1000 messages past that message and finally hit a flow control
>limit that says you can't keep introducing new messages into the system
>until the old ones are delivered and discarded (meaning ARU has increased
>to be larger then the messages's sequence number). However the message
>REALLY isn't missing but rather a VERY old token was received and it
>caused Spread to change a value in a way that was incorrect. Since the
>VERY is something like > 500 ms or 16 rounds of the token it hasn't
>happened much (one case was a flaky switch, another was unknown but
>assumed to be a buffering switch). But it would cause exactly these
>symptoms.
>
>3.17.0 has what I think is a fix for this hang (if it is the one I
>describe above). However, as I couldn't duplicate the hang, I also
>couldn't test the fix :-(
>
>So if you can verify if 3.17 fixes the problem I'd be quite happy :-)
>
>Jonathan
>
>On Wed, Oct 16, 2002 at 01:36:32PM -0400, George Schlossnagle wrote:
>
>>Weird hang issues just started popping up. My ring of 8 machines will
>>hang, in the sense that no events or messages are passed. spmonitor
>>shows up everything in gstate 1/state 1, but when I sign on with spuser
>>and try to join a groupo, I never get a join confirmation. Also a
>>'send's are never recieved either.
>>
>>This problem is sporadic. If I terminate all the daemons and restart
>>them, the ring starts up normally, works for a bit (anywhere from a
>>couple minutes to a day), then craps out again. This is spread 3.16.2.
>> No recent network topology or configuration changes have occured
>>coincident wiith this.
>>
>>Weird.
>>
>>George
>>
>>
>>
>>
>>_______________________________________________
>>Spread-users mailing list
>>Spread-users at lists.spread.org
>>http://lists.spread.org/mailman/listinfo/spread-users
>>
>
More information about the Spread-users
mailing list