[Spread-users] Total freeze of all daemons after long running
Nico Meyer
nmeyer at virtualminds.de
Thu Mar 13 16:13:57 EDT 2008
Hi John,
thanks for the quick answer. Unfortunately you suggestion will not work, since
as I wrote in my first post client connections are also immediately closed
when this happens, so spmonitor will not work either.
I hope you can come up with a solution soon.
Nico
> same issue as Chanh Hua's or not.
>
> Looking at your report, I would say the Aru and various sequence numbers
> look extremely fishy. They are all right around +/- 2^31, which could
> indicate some kind of bug in handling rollover of those counters.
> Furthermore, as I understand it, Aru should never be ahead of Highest_seq
> as it seems to be here. It looks like Aru rolled over before Highest_seq
> and the other sequence numbers, which I believe shouldn't happen.
>
> The print out was an attempt to work around this freezing bug by
> swallowing the token and forcing a restart of the synchronization between
> the daemons. Apparently, that isn't enough to handle your situation.
>
> My guess is that there is probably a bug with the synchronization protocol
> when the counters are right around the rollover point that daemons simply
> can't get past.
>
> Another potential work around would be to segment each of the daemons into
> its own partition using spmonitor when this happens. They should be able
> to finish their protocol and install a membership by themselves. Then
> unpartition them. Your way is probably more automated and less latency
> though.
>
> We'll look into this.
>
> Cheers!
>
> ---
> John Schultz
> Spread Concepts
> Phn: 443 838 2200
>
> On Thu, 13 Mar 2008, Nico Meyer wrote:
> > Hi,
> >
> > I reported the exact same problem two months ago, but never got any
> > answer. Please see
> > http://commedia.cnds.jhu.edu/pipermail/spread-users/2008-January/003653.h
> >tml for the original post.
> >
> > Today it happend again (excactly 2 months later, but this is most likely
> > a coincidence). The logs show also the same numbers (compare with my
> > original post):
> > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:04:28] Aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] My_aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] Highest_seq: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Highest_fifo_seq: 21344
> > [Thu 13 Mar 2008 15:04:28] Last_discarded: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Last_delivered: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Last_seq: -2147481909
> > [Thu 13 Mar 2008 15:04:28] Token_rounds: 501
> > [Thu 13 Mar 2008 15:04:28] Last Token:
> > [Thu 13 Mar 2008 15:04:28] type: 0x80050080
> > [Thu 13 Mar 2008 15:04:28] transmiter_id: -1062731508
> > [Thu 13 Mar 2008 15:04:28] seq: 0
> > [Thu 13 Mar 2008 15:04:28] proc_id: -1062731508
> > [Thu 13 Mar 2008 15:04:28] aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] aru_last_id: 0
> > [Thu 13 Mar 2008 15:04:28] flow_control: 0
> > [Thu 13 Mar 2008 15:04:28] rtr_len: 1440
> > [Thu 13 Mar 2008 15:04:28] conf_hash: -2002019299
> >
> > repeated every few seconds.
> > and a little later:
> >
> > [Thu 13 Mar 2008 15:06:19] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:06:19] Aru: 3333
> > [Thu 13 Mar 2008 15:06:19] My_aru: 3333
> > [Thu 13 Mar 2008 15:06:19] Highest_seq: 2147482054
> > [Thu 13 Mar 2008 15:06:19] Highest_fifo_seq: 21344
> > [Thu 13 Mar 2008 15:06:19] Last_discarded: 2147482054
> > [Thu 13 Mar 2008 15:06:19] Last_delivered: 2147482054
> > [Thu 13 Mar 2008 15:06:19] Last_seq: 3333
> > [Thu 13 Mar 2008 15:06:19] Token_rounds: 501
> > [Thu 13 Mar 2008 15:06:19] Last Token:
> > [Thu 13 Mar 2008 15:06:19] type: 0x80050080
> > [Thu 13 Mar 2008 15:06:19] transmiter_id: -1062731508
> > [Thu 13 Mar 2008 15:06:19] seq: 1
> > [Thu 13 Mar 2008 15:06:19] proc_id: -1062731508
> > [Thu 13 Mar 2008 15:06:19] aru: 3333
> > [Thu 13 Mar 2008 15:06:19] aru_last_id: -1062731505
> > [Thu 13 Mar 2008 15:06:19] flow_control: 0
> > [Thu 13 Mar 2008 15:06:19] rtr_len: 1440
> > [Thu 13 Mar 2008 15:06:19] conf_hash: -2002019299
> >
> > also repeatedly.
> >
> > and on another server in the same spread segment:
> >
> > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > rounds in EVS state; swallowing token; state:
> > [Thu 13 Mar 2008 15:04:28] Aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] My_aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] Highest_seq: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Highest_fifo_seq: 746858665
> > [Thu 13 Mar 2008 15:04:28] Last_discarded: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Last_delivered: 2147482054
> > [Thu 13 Mar 2008 15:04:28] Last_seq: -2147481909
> > [Thu 13 Mar 2008 15:04:28] Token_rounds: 501
> > [Thu 13 Mar 2008 15:04:28] Last Token:
> > [Thu 13 Mar 2008 15:04:28] type: 0x80050080
> > [Thu 13 Mar 2008 15:04:28] transmiter_id: -1062731497
> > [Thu 13 Mar 2008 15:04:28] seq: 0
> > [Thu 13 Mar 2008 15:04:28] proc_id: -1062731497
> > [Thu 13 Mar 2008 15:04:28] aru: -2147481909
> > [Thu 13 Mar 2008 15:04:28] aru_last_id: 0
> > [Thu 13 Mar 2008 15:04:28] flow_control: 0
> > [Thu 13 Mar 2008 15:04:28] rtr_len: 1440
> > [Thu 13 Mar 2008 15:04:28] conf_hash: -2002019299
> >
> >
> > after the last incident I raised the timeout in membership.c a little, so
> > my assumption that the low timeouts are the culprit seems to be wrong.
> >
> > Please, have a look this time, as this is a really serious problem.
> > My workaround for now will be putting an exit(0) in the code where the
> > message is generated, and restart the daemons after 5s. Hopefully this
> > will at least recover the spread ring.
> >
> > Let me know, if you need any aditional info.
> >
> > Bye,
> >
> > Nico
> >
> >
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
More information about the Spread-users
mailing list