[Spread-users] Total freeze of all daemons after long running

Thu Mar 13 16:13:57 EDT 2008

Hi John,

thanks for the quick answer. Unfortunately you suggestion will not work, since 
as I wrote in my first post client connections are also immediately closed 
when this happens, so spmonitor will not work either.

I hope you can come up with a solution soon.

Nico

> same issue as Chanh Hua's or not.
>
> Looking at your report, I would say the Aru and various sequence numbers
> look extremely fishy.  They are all right around +/- 2^31, which could
> indicate some kind of bug in handling rollover of those counters.
> Furthermore, as I understand it, Aru should never be ahead of Highest_seq
> as it seems to be here.  It looks like Aru rolled over before Highest_seq
> and the other sequence numbers, which I believe shouldn't happen.
>
> The print out was an attempt to work around this freezing bug by
> swallowing the token and forcing a restart of the synchronization between
> the daemons.  Apparently, that isn't enough to handle your situation.
>
> My guess is that there is probably a bug with the synchronization protocol
> when the counters are right around the rollover point that daemons simply
> can't get past.
>
> Another potential work around would be to segment each of the daemons into
> its own partition using spmonitor when this happens.  They should be able
> to finish their protocol and install a membership by themselves.  Then
> unpartition them.  Your way is probably more automated and less latency
> though.
>
> We'll look into this.
>
> Cheers!
>
> ---
> John Schultz
> Spread Concepts
> Phn: 443 838 2200
>
> On Thu, 13 Mar 2008, Nico Meyer wrote:
> > Hi,
> >
> > I reported the exact same problem two months ago, but never got any
> > answer. Please see
> > http://commedia.cnds.jhu.edu/pipermail/spread-users/2008-January/003653.h
> >tml for the original post.
> >
> > Today it happend again (excactly 2 months later, but this is most likely
> > a coincidence). The logs show also the same numbers (compare with my
> > original post):
> > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> > [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> > [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> > [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 21344
> > [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> > [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> > [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> > [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> > [Thu 13 Mar 2008 15:04:28] Last Token:
> > [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> > [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731508
> > [Thu 13 Mar 2008 15:04:28]      seq:              0
> > [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731508
> > [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> > [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> > [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> > [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> > [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
> >
> > repeated every few seconds.
> > and a little later:
> >
> > [Thu 13 Mar 2008 15:06:19] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:06:19]      Aru:              3333
> > [Thu 13 Mar 2008 15:06:19]      My_aru:           3333
> > [Thu 13 Mar 2008 15:06:19]      Highest_seq:      2147482054
> > [Thu 13 Mar 2008 15:06:19]      Highest_fifo_seq: 21344
> > [Thu 13 Mar 2008 15:06:19]      Last_discarded:   2147482054
> > [Thu 13 Mar 2008 15:06:19]      Last_delivered:   2147482054
> > [Thu 13 Mar 2008 15:06:19]      Last_seq:         3333
> > [Thu 13 Mar 2008 15:06:19]      Token_rounds:     501
> > [Thu 13 Mar 2008 15:06:19] Last Token:
> > [Thu 13 Mar 2008 15:06:19]      type:             0x80050080
> > [Thu 13 Mar 2008 15:06:19]      transmiter_id:    -1062731508
> > [Thu 13 Mar 2008 15:06:19]      seq:              1
> > [Thu 13 Mar 2008 15:06:19]      proc_id:          -1062731508
> > [Thu 13 Mar 2008 15:06:19]      aru:              3333
> > [Thu 13 Mar 2008 15:06:19]      aru_last_id:      -1062731505
> > [Thu 13 Mar 2008 15:06:19]      flow_control:     0
> > [Thu 13 Mar 2008 15:06:19]      rtr_len:          1440
> > [Thu 13 Mar 2008 15:06:19]      conf_hash:        -2002019299
> >
> > also repeatedly.
> >
> > and on another server in the same spread segment:
> >
> > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> > [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> > [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> > [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 746858665
> > [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> > [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> > [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> > [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> > [Thu 13 Mar 2008 15:04:28] Last Token:
> > [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> > [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731497
> > [Thu 13 Mar 2008 15:04:28]      seq:              0
> > [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731497
> > [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> > [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> > [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> > [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> > [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
> >
> >
> > after the last incident I raised the timeout in membership.c a little, so
> > my assumption that the low timeouts are the culprit seems to be wrong.
> >
> > Please, have a look this time, as this is a really serious problem.
> > My workaround for now will be putting an exit(0) in the code where the
> > message is generated, and restart the daemons after 5s. Hopefully this
> > will at least recover the spread ring.
> >
> > Let me know, if you need any aditional info.
> >
> > Bye,
> >
> > Nico
> >
> >
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users