<html><head><meta name="qrichtext" content="1" /></head><body style="font-size:9pt;font-family:Sans Serif">
<p>Hello everyone,</p>
<p></p>
<p>we just had a very nightmarish incident with our spread setup.</p>
<p>It consists of about 15 spread daemons, all in one gigabit ethernet segment.</p>
<p></p>
<p>First one of the daemons started to spit out these exact same messages about once per second until it was killed:</p>
<p>[Sun 13 Jan 2008 15:41:26] Prot_handle_token: BUG WORKAROUND: Too many rounds in EVS state; swallowing token; state:</p>
<p>[Sun 13 Jan 2008 15:41:26] Aru: 3334</p>
<p>[Sun 13 Jan 2008 15:41:26] My_aru: 3334</p>
<p>[Sun 13 Jan 2008 15:41:26] Highest_seq: 2147482050</p>
<p>[Sun 13 Jan 2008 15:41:26] Highest_fifo_seq: 23401</p>
<p>[Sun 13 Jan 2008 15:41:26] Last_discarded: 2147482050</p>
<p>[Sun 13 Jan 2008 15:41:26] Last_delivered: 2147482050</p>
<p>[Sun 13 Jan 2008 15:41:26] Last_seq: 3334</p>
<p>[Sun 13 Jan 2008 15:41:26] Token_rounds: 501</p>
<p>[Sun 13 Jan 2008 15:41:26] Last Token:</p>
<p>[Sun 13 Jan 2008 15:41:26] type: 0x80050080</p>
<p>[Sun 13 Jan 2008 15:41:26] transmiter_id: -1062731508</p>
<p>[Sun 13 Jan 2008 15:41:26] seq: 1</p>
<p>[Sun 13 Jan 2008 15:41:26] proc_id: -1062731508</p>
<p>[Sun 13 Jan 2008 15:41:26] aru: 3334</p>
<p>[Sun 13 Jan 2008 15:41:26] aru_last_id: -1062731505</p>
<p>[Sun 13 Jan 2008 15:41:26] flow_control: 0</p>
<p>[Sun 13 Jan 2008 15:41:26] rtr_len: 1440</p>
<p>[Sun 13 Jan 2008 15:41:26] conf_hash: -2002019299</p>
<p></p>
<p>After killing this daemon all of the other daemon stopped processing messages and client connections. New client connections where immediatly closed.</p>
<p></p>
<p>We then stopped all spread clients, so no message would be injected in to the spread ring. Still the same.</p>
<p>Then we started to restart one spread daemon after the other, and still nothing would work. The only thing that helped was killing all daemons but one, effectively doing a 'cold boot' of the whole spread segement.</p>
<p>It seems like something was totally out of sync, and would event upset a daemon after restarting it.</p>
<p></p>
<p>I believe the reason for this problem lies in the fact that we lowered the't timeouts in membership.c too much. I roughly divided everything by 10 in the beginning when we only had 3 daemons and a lot less messages/s, because back then we had the problem that everything seemed to stop every Hurry_timeout (2s) for a few 100 ms (enough to be noticed by out application). Lowering the Hurry_timeout made these 'hickups' appear more often, but they where also shorter. Another workaround was producing messages with sp_flood, made the hickups go away (but I didn't llike this very much).</p>
<p></p>
<p>So my questions:</p>
<p></p>
<p>1) I my guess correct that this error might by triggered by to small timeouts under heavy load?</p>
<p>2) can you explain this very nasty behaviour where all daemons are stuck, and seem to get 'infected' even after restarting.</p>
<p>3) does anyone have an explanation for the original problem with the Hurry timeout? </p>
<p></p>
<p></p>
<p>Sorry, I don't have a lot of information to go on, but our first priority was to get the system up again, and the log didn't contain much information.</p>
<p></p>
<p>Have a nice Weekend,</p>
<p></p>
<p>Nico Meyer</p>
<p></p>
</body></html>