[Spread-users] Error condition stopping the whole ring for good

Sun Jan 13 12:52:42 EST 2008

Hello everyone,

we just had a very nightmarish incident with our spread setup.
It consists of about 15 spread daemons, all in one gigabit ethernet segment.

First one of the daemons started to spit out these exact same messages about once per second until it was killed:
[Sun 13 Jan 2008 15:41:26] Prot_handle_token: BUG WORKAROUND: Too many rounds in EVS state; swallowing token; state:
[Sun 13 Jan 2008 15:41:26]      Aru:              3334
[Sun 13 Jan 2008 15:41:26]      My_aru:           3334
[Sun 13 Jan 2008 15:41:26]      Highest_seq:      2147482050
[Sun 13 Jan 2008 15:41:26]      Highest_fifo_seq: 23401
[Sun 13 Jan 2008 15:41:26]      Last_discarded:   2147482050
[Sun 13 Jan 2008 15:41:26]      Last_delivered:   2147482050
[Sun 13 Jan 2008 15:41:26]      Last_seq:         3334
[Sun 13 Jan 2008 15:41:26]      Token_rounds:     501
[Sun 13 Jan 2008 15:41:26] Last Token:
[Sun 13 Jan 2008 15:41:26]      type:             0x80050080
[Sun 13 Jan 2008 15:41:26]      transmiter_id:    -1062731508
[Sun 13 Jan 2008 15:41:26]      seq:              1
[Sun 13 Jan 2008 15:41:26]      proc_id:          -1062731508
[Sun 13 Jan 2008 15:41:26]      aru:              3334
[Sun 13 Jan 2008 15:41:26]      aru_last_id:      -1062731505
[Sun 13 Jan 2008 15:41:26]      flow_control:     0
[Sun 13 Jan 2008 15:41:26]      rtr_len:          1440
[Sun 13 Jan 2008 15:41:26]      conf_hash:        -2002019299

After killing this daemon all of the other daemon stopped processing messages and client connections. New client connections where immediatly closed.

We then stopped all spread clients, so no message would be injected in to the spread ring. Still the same.
Then we started to restart one spread daemon after the other, and still nothing would work. The only thing that helped was killing all daemons but one, effectively doing a 'cold boot' of the whole spread segement.
It seems like something was totally out of sync, and would event upset a daemon after restarting it.

I believe the reason for this problem lies in the fact that we lowered the't timeouts in membership.c too much. I roughly divided everything by 10 in the beginning when we only had 3 daemons and a lot less messages/s, because back then we had the problem that everything seemed to stop every Hurry_timeout (2s) for a few 100 ms (enough to be noticed by out application). Lowering the Hurry_timeout made these 'hickups' appear more often, but they where also shorter. Another workaround was producing messages with sp_flood, made the hickups go away (but I didn't llike this very much).

So my questions:

1) I my guess correct that this error might by triggered by to small timeouts under heavy load?
2) can you explain this very nasty behaviour where all daemons are stuck, and seem to get 'infected' even after restarting.
3) does anyone have an explanation for the original problem with the Hurry timeout? 

Sorry, I don't have a lot of information to go on, but our first priority was to get the system up again, and the log didn't contain much information.

Have a nice Weekend,

Nico Meyer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20080113/572da34f/attachment.html