[Spread-users] Total freeze of all daemons after long running

Nico Meyer nmeyer at virtualminds.de
Thu Mar 13 15:38:30 EDT 2008


Hi,

I reported the exact same problem two months ago, but never got any answer.
Please see 
http://commedia.cnds.jhu.edu/pipermail/spread-users/2008-January/003653.html 
for the original post.

Today it happend again (excactly 2 months later, but this is most likely a 
coincidence). The logs show also the same numbers (compare with my original 
post):
[Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many rounds 
in EVS state; swallowing token; state:
[Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
[Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
[Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
[Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 21344
[Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
[Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
[Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
[Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
[Thu 13 Mar 2008 15:04:28] Last Token:
[Thu 13 Mar 2008 15:04:28]      type:             0x80050080
[Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731508
[Thu 13 Mar 2008 15:04:28]      seq:              0
[Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731508
[Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
[Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
[Thu 13 Mar 2008 15:04:28]      flow_control:     0
[Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
[Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299

repeated every few seconds.
and a little later:

[Thu 13 Mar 2008 15:06:19] Prot_handle_token: BUG WORKAROUND: Too many rounds 
in EVS state; swallowing token; state:
[Thu 13 Mar 2008 15:06:19]      Aru:              3333
[Thu 13 Mar 2008 15:06:19]      My_aru:           3333
[Thu 13 Mar 2008 15:06:19]      Highest_seq:      2147482054
[Thu 13 Mar 2008 15:06:19]      Highest_fifo_seq: 21344
[Thu 13 Mar 2008 15:06:19]      Last_discarded:   2147482054
[Thu 13 Mar 2008 15:06:19]      Last_delivered:   2147482054
[Thu 13 Mar 2008 15:06:19]      Last_seq:         3333
[Thu 13 Mar 2008 15:06:19]      Token_rounds:     501
[Thu 13 Mar 2008 15:06:19] Last Token:
[Thu 13 Mar 2008 15:06:19]      type:             0x80050080
[Thu 13 Mar 2008 15:06:19]      transmiter_id:    -1062731508
[Thu 13 Mar 2008 15:06:19]      seq:              1
[Thu 13 Mar 2008 15:06:19]      proc_id:          -1062731508
[Thu 13 Mar 2008 15:06:19]      aru:              3333
[Thu 13 Mar 2008 15:06:19]      aru_last_id:      -1062731505
[Thu 13 Mar 2008 15:06:19]      flow_control:     0
[Thu 13 Mar 2008 15:06:19]      rtr_len:          1440
[Thu 13 Mar 2008 15:06:19]      conf_hash:        -2002019299

also repeatedly.

and on another server in the same spread segment:

[Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many rounds 
in EVS state; swallowing token; state:
[Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
[Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
[Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
[Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 746858665
[Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
[Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
[Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
[Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
[Thu 13 Mar 2008 15:04:28] Last Token:
[Thu 13 Mar 2008 15:04:28]      type:             0x80050080
[Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731497
[Thu 13 Mar 2008 15:04:28]      seq:              0
[Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731497
[Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
[Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
[Thu 13 Mar 2008 15:04:28]      flow_control:     0
[Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
[Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299


after the last incident I raised the timeout in membership.c a little, so my 
assumption that the low timeouts are the culprit seems to be wrong.

Please, have a look this time, as this is a really serious problem.
My workaround for now will be putting an exit(0) in the code where the message 
is generated, and restart the daemons after 5s. Hopefully this will at least 
recover the spread ring.

Let me know, if you need any aditional info.

Bye,

Nico






More information about the Spread-users mailing list