[Spread-users] Total freeze of all daemons after long running

John Schultz jschultz at spreadconcepts.com
Thu Mar 13 16:06:15 EDT 2008


Sorry we never got back to you on this issue.  I'm not sure if this is the 
same issue as Chanh Hua's or not.

Looking at your report, I would say the Aru and various sequence numbers 
look extremely fishy.  They are all right around +/- 2^31, which could 
indicate some kind of bug in handling rollover of those counters. 
Furthermore, as I understand it, Aru should never be ahead of Highest_seq 
as it seems to be here.  It looks like Aru rolled over before Highest_seq 
and the other sequence numbers, which I believe shouldn't happen.

The print out was an attempt to work around this freezing bug by 
swallowing the token and forcing a restart of the synchronization between 
the daemons.  Apparently, that isn't enough to handle your situation.

My guess is that there is probably a bug with the synchronization protocol 
when the counters are right around the rollover point that daemons simply 
can't get past.

Another potential work around would be to segment each of the daemons into 
its own partition using spmonitor when this happens.  They should be able 
to finish their protocol and install a membership by themselves.  Then 
unpartition them.  Your way is probably more automated and less latency 
though.

We'll look into this.

Cheers!

---
John Schultz
Spread Concepts
Phn: 443 838 2200

On Thu, 13 Mar 2008, Nico Meyer wrote:

> Hi,
>
> I reported the exact same problem two months ago, but never got any answer.
> Please see
> http://commedia.cnds.jhu.edu/pipermail/spread-users/2008-January/003653.html
> for the original post.
>
> Today it happend again (excactly 2 months later, but this is most likely a
> coincidence). The logs show also the same numbers (compare with my original
> post):
> [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many rounds
> in EVS state; swallowing token; state:
> [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 21344
> [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> [Thu 13 Mar 2008 15:04:28] Last Token:
> [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731508
> [Thu 13 Mar 2008 15:04:28]      seq:              0
> [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731508
> [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
>
> repeated every few seconds.
> and a little later:
>
> [Thu 13 Mar 2008 15:06:19] Prot_handle_token: BUG WORKAROUND: Too many rounds
> in EVS state; swallowing token; state:
> [Thu 13 Mar 2008 15:06:19]      Aru:              3333
> [Thu 13 Mar 2008 15:06:19]      My_aru:           3333
> [Thu 13 Mar 2008 15:06:19]      Highest_seq:      2147482054
> [Thu 13 Mar 2008 15:06:19]      Highest_fifo_seq: 21344
> [Thu 13 Mar 2008 15:06:19]      Last_discarded:   2147482054
> [Thu 13 Mar 2008 15:06:19]      Last_delivered:   2147482054
> [Thu 13 Mar 2008 15:06:19]      Last_seq:         3333
> [Thu 13 Mar 2008 15:06:19]      Token_rounds:     501
> [Thu 13 Mar 2008 15:06:19] Last Token:
> [Thu 13 Mar 2008 15:06:19]      type:             0x80050080
> [Thu 13 Mar 2008 15:06:19]      transmiter_id:    -1062731508
> [Thu 13 Mar 2008 15:06:19]      seq:              1
> [Thu 13 Mar 2008 15:06:19]      proc_id:          -1062731508
> [Thu 13 Mar 2008 15:06:19]      aru:              3333
> [Thu 13 Mar 2008 15:06:19]      aru_last_id:      -1062731505
> [Thu 13 Mar 2008 15:06:19]      flow_control:     0
> [Thu 13 Mar 2008 15:06:19]      rtr_len:          1440
> [Thu 13 Mar 2008 15:06:19]      conf_hash:        -2002019299
>
> also repeatedly.
>
> and on another server in the same spread segment:
>
> [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many rounds
> in EVS state; swallowing token; state:
> [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 746858665
> [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> [Thu 13 Mar 2008 15:04:28] Last Token:
> [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731497
> [Thu 13 Mar 2008 15:04:28]      seq:              0
> [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731497
> [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
>
>
> after the last incident I raised the timeout in membership.c a little, so my
> assumption that the low timeouts are the culprit seems to be wrong.
>
> Please, have a look this time, as this is a really serious problem.
> My workaround for now will be putting an exit(0) in the code where the message
> is generated, and restart the daemons after 5s. Hopefully this will at least
> recover the spread ring.
>
> Let me know, if you need any aditional info.
>
> Bye,
>
> Nico
>
>
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>




More information about the Spread-users mailing list