[Spread-users] Member token loss at regular-ish 5 minute intervals

Barry Abrahamson barry at automattic.com
Thu May 10 22:56:55 EDT 2007


We are using spread + wackamole + pound to achieve a HA solution for  
our application.  We have 5 pairs of servers deployed across the  
country (1 pair per datacenter), and most of them work just fine.   
There is this one pair of servers, however, that is having problems.   
Every 5-ish minutes, 1 of 2 things happen:
Case 1:

(on segment leader)

[Fri 11 May 2007 02:33:43] Send_join: State is 4
[Fri 11 May 2007 02:33:44] Send_join: State is 4
[Fri 11 May 2007 02:33:45] Send_join: State is 4
[Fri 11 May 2007 02:33:46] Send_join: State is 4
[Fri 11 May 2007 02:33:47] Send_join: State is 4

(on segment member)

[Fri 11 May 2007 02:33:43] Memb_handle_message: handling join message  
from -1408237407, State is 1
[Fri 11 May 2007 02:33:43] Handle_join in OP

Case 2:

(on segment leader)
[Fri 11 May 2007 02:39:27] Memb_token_loss: I lost my token, state is 1
[Fri 11 May 2007 02:39:27] Scast_alive: State is 2
[Fri 11 May 2007 02:39:27] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:27] Handle_alive in SEG
[Fri 11 May 2007 02:39:28] Scast_alive: State is 2
[Fri 11 May 2007 02:39:28] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:28] Handle_alive in SEG
[Fri 11 May 2007 02:39:32] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:32] Handle_alive in SEG
[Fri 11 May 2007 02:39:33] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:33] Handle_alive in SEG
[Fri 11 May 2007 02:39:34] Memb_handle_message: handling join message  
from -1408237406, State is 2
[Fri 11 May 2007 02:39:34] Scast_alive: State is 3
[Fri 11 May 2007 02:39:35] Memb_handle_message: handling join message  
from -1408237406, State is 3
[Fri 11 May 2007 02:39:35] Scast_alive: State is 3
[Fri 11 May 2007 02:39:36] Memb_handle_message: handling join message  
from -1408237406, State is 3
[Fri 11 May 2007 02:39:36] Scast_alive: State is 3
[Fri 11 May 2007 02:39:37] Memb_handle_message: handling join message  
from -1408237406, State is 3
[Fri 11 May 2007 02:39:37] Scast_alive: State is 3
[Fri 11 May 2007 02:39:38] Memb_handle_message: handling join message  
from -1408237406, State is 3
[Fri 11 May 2007 02:39:38] Scast_alive: State is 3
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form1 token
[Fri 11 May 2007 02:39:39] Handle_form1 in REPRESENTED
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form1 token
[Fri 11 May 2007 02:39:39] Handle_form1 in FORM
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form2 token
[Fri 11 May 2007 02:39:39] Handle_form2 in FORM
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form2 token
[Fri 11 May 2007 02:39:39] Handle_form2 in EVS
[Fri 11 May 2007 02:39:39] Memb_transitional
[Fri 11 May 2007 02:39:39] Memb_regular

(on segment member)
Fri 11 May 2007 02:39:27] Memb_token_loss: I lost my token, state is 1
[Fri 11 May 2007 02:39:27] Scast_alive: State is 2
[Fri 11 May 2007 02:39:28] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:28] Handle_alive in SEG
[Fri 11 May 2007 02:39:28] Scast_alive: State is 2
[Fri 11 May 2007 02:39:32] Scast_alive: State is 2
[Fri 11 May 2007 02:39:33] Scast_alive: State is 2
[Fri 11 May 2007 02:39:34] Send_join: State is 4
[Fri 11 May 2007 02:39:34] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:34] Handle_alive in GATHER
[Fri 11 May 2007 02:39:35] Send_join: State is 4
[Fri 11 May 2007 02:39:35] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:35] Handle_alive in GATHER
[Fri 11 May 2007 02:39:36] Send_join: State is 4
[Fri 11 May 2007 02:39:36] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:36] Handle_alive in GATHER
[Fri 11 May 2007 02:39:37] Send_join: State is 4
[Fri 11 May 2007 02:39:37] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:37] Handle_alive in GATHER
[Fri 11 May 2007 02:39:38] Send_join: State is 4
[Fri 11 May 2007 02:39:38] Memb_handle_message: handling alive message
[Fri 11 May 2007 02:39:38] Handle_alive in GATHER
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form2 token
[Fri 11 May 2007 02:39:39] Handle_form2 in FORM
[Fri 11 May 2007 02:39:39] Memb_handle_token: handling form2 token
[Fri 11 May 2007 02:39:39] Handle_form2 in EVS
[Fri 11 May 2007 02:39:39] Memb_transitional
[Fri 11 May 2007 02:39:39] Memb_regular

Sometimes when Case 2 happens, the segment comes back as 1 member  
(which causes wackamole to initiate the rebalance process) and then  
adds the second member within seconds (which causes wackamole to  
rebalance again).  Most of the time, however, after the token loss is  
initiated on both machines, they both come back as a part of the same  
spread segment and there is no visible effect on wackamole or our  
application(s).

We are using spread 3.17.03 from the sarge AMD64 repo across all  
servers with the default timeouts in membership.c

Any ideas on what is causing these events at these seemingly regular  
5 minute intervals would be most helpful.

Thanks,

Barry





More information about the Spread-users mailing list