[Spread-users] Stuck Spread Daemons

Melissa Jenkins melissa-spread at temeletry.co.uk
Tue Nov 15 11:02:12 EST 2011


 
Over time we've noticed if we get a burst of high packet loss we often end up with Spread daemons in a stuck stage where they are constantly retransmitting.  Today, the logs indicated that the packet loss coincided with a few joins + leaves, though I don't have enough information to indicate that this anything more than a coincidence.
 
I’m a bit at a loss as to how to trouble shoot this, all suggestions gratefully received!  (I have tried turning up logging detail but it was just exhausting the disks and not showing anything more helpful)
 
Thanks,
Mel
 
Basic configuration is:
 
Segment 1: 4 spread daemons, many users, all on a metro ethernet L2 lan.  Configured to use broadcast
Segment 2: 1 spread daemon, several 'users', connected via a VPN to Segment 1.
 
It looks like the packet loss was in Segment 1, however, to unstick the Spread cluster we had to restart the daemon in Segment 2 (192.168.0.1).
 
This has happened several times always when other monitoring indicates high packet loss somewhere in the network, though often just when it is between segment1 & 2. 
 
spmonitor statistics show a HUGE burst of retransmissions – 14 million per second according to my monitoring, and we have a large spike in the bandwidth between services.  This burst/spike continues until the stuck daemon is restarted.  Our normal load peaks at ~60 messages a second.
 
Some messages were still being passed, during this period, though I'm not sure if all hosts had received them.
 
Here are some excerpts from the log, I’m not 100% sure they are totally relevant but it is important to note that message passing started to fail at about 09:12, the massive burst in retransmissions didn’t occur till closer to 09:30.  Two daemons in segment 1 transmit, 4 receive – one of the receivers show a big increase in the number of messages ~3000/second, the other shows a decrease.  The two transmitting show the massive spike in retransmissions.  The daemon in segment one both transmits & receives and it shows the same large spike.  According to spmonitor all three active transmitters have a large number of sent, received and retransmitted messages.
 
Log from 192.168.0.1 (seg2a1):
 
[Tue 15 Nov 2011 09:30:10] Stat_handle_message: sent status to Monitor at seg2a1
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.139, State is 1
[Tue 15 Nov 2011 09:31:54] Handle_join in OP
[Tue 15 Nov 2011 09:31:54] Memb_token_loss: I lost my token, state is 1
[Tue 15 Nov 2011 09:31:54] Scast_alive: State is 2
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.203, State is 2
[Tue 15 Nov 2011 09:31:55] Scast_alive: State is 2
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 2
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.203, State is 2
[Tue 15 Nov 2011 09:31:56] Send_join: State is 4
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:31:57] Send_join: State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:31:58] Send_join: State is 4
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:31:59] Send_join: State is 4
[Tue 15 Nov 2011 09:32:00] Send_join: State is 4
[Tue 15 Nov 2011 09:32:01] Send_join: State is 4
[Tue 15 Nov 2011 09:32:02] Send_join: State is 4
[Tue 15 Nov 2011 09:32:03] Form_or_fail: failed to gather
[Tue 15 Nov 2011 09:32:03] Send_join: State is 4
[Tue 15 Nov 2011 09:32:04] Send_join: State is 4
[Tue 15 Nov 2011 09:32:05] Send_join: State is 4
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:32:06] Send_join: State is 4
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:32:06] Handle_refer in GATHER
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:32:07] Send_join: State is 4
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:32:07] Handle_refer in GATHER
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:32:08] Send_join: State is 4
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:32:08] Handle_refer in GATHER
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:32:09] Send_join: State is 4
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:32:09] Handle_refer in GATHER
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling join message from 192.168.0.203, State is 4
[Tue 15 Nov 2011 09:32:10] Send_join: State is 4
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:32:10] Handle_refer in GATHER
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form1 token
[Tue 15 Nov 2011 09:32:11] Handle_form1 in GATHER
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form1 token
[Tue 15 Nov 2011 09:32:11] Handle_form1 in FORM
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
[Tue 15 Nov 2011 09:32:11] Handle_form2 in FORM
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
[Tue 15 Nov 2011 09:32:11] Handle_form2 in EVS
[Tue 15 Nov 2011 09:32:11] Memb_transitional
[Tue 15 Nov 2011 09:32:11] G_handle_trans_memb:  with (192.168.0.203, 1321349531) id
[Tue 15 Nov 2011 09:32:11] G_handle_trans_memb in GOP
 
 
Log from the leader, which is in segment 1 (192.168.0.203):
[Tue 15 Nov 2011 09:30:25] Send_join: State is 4
[Tue 15 Nov 2011 09:31:26] Send_join: State is 4
[Tue 15 Nov 2011 09:31:27] Send_join: State is 4
[Tue 15 Nov 2011 09:31:28] Send_join: State is 4
[Tue 15 Nov 2011 09:31:29] Send_join: State is 4
[Tue 15 Nov 2011 09:31:30] Send_join: State is 4
[Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:52] Handle_alive in OP
[Tue 15 Nov 2011 09:31:52] Memb_token_loss: I lost my token, state is 1
[Tue 15 Nov 2011 09:31:52] Scast_alive: State is 2
[Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:52] Handle_alive in SEG
[Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:52] Handle_alive in SEG
[Tue 15 Nov 2011 09:31:53] Scast_alive: State is 2
[Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
[Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
[Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
[Tue 15 Nov 2011 09:31:54] Send_join: State is 4
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:55] Send_join: State is 4
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:56] Send_join: State is 4
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:31:57] Send_join: State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:31:58] Send_join: State is 4
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
[Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:31:59] Memb_handle_token: handling form1 token
[Tue 15 Nov 2011 09:31:59] Handle_form1 in FORM
[Tue 15 Nov 2011 09:31:59] Memb_handle_token: handling form1 token
[Tue 15 Nov 2011 09:31:59] Handle_form1 in FORM
[Tue 15 Nov 2011 09:31:59] Memb_handle_message: handling join message from 192.168.0.1, State is 5
[Tue 15 Nov 2011 09:32:00] Memb_handle_message: handling join message from 192.168.0.1, State is 5
[Tue 15 Nov 2011 09:32:01] Memb_handle_message: handling join message from 192.168.0.1, State is 5
[Tue 15 Nov 2011 09:32:02] Memb_handle_message: handling join message from 192.168.0.1, State is 5
[Tue 15 Nov 2011 09:32:03] Memb_handle_message: handling join message from 192.168.0.1, State is 5
[Tue 15 Nov 2011 09:32:04] Memb_token_loss: I lost my token, state is 5
[Tue 15 Nov 2011 09:32:04] Scast_alive: State is 2
[Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling join message from 192.168.0.1, State is 2
[Tue 15 Nov 2011 09:32:05] Scast_alive: State is 2
[Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
[Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling join message from 192.168.0.1, State is 2
[Tue 15 Nov 2011 09:32:06] Send_join: State is 4
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:32:07] Send_join: State is 4
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:32:08] Send_join: State is 4
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:32:09] Send_join: State is 4
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:32:10] Send_join: State is 4
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
[Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
[Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling join message from 192.168.0.1, State is 4
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
[Tue 15 Nov 2011 09:32:11] Handle_form2 in FORM
[Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
[Tue 15 Nov 2011 09:32:11] Handle_form2 in EVS
[Tue 15 Nov 2011 09:32:11] Memb_transitional
[Tue 15 Nov 2011 09:32:11] G_handle_trans_memb:  with (192.168.0.203, 1321349531) id
[Tue 15 Nov 2011 09:32:11] G_handle_trans_memb in GOP
[Tue 15 Nov 2011 09:32:11] G_handle_trans_memb: Synced Set (with 4 members):
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20111115/d999bfba/attachment-0001.html 


More information about the Spread-users mailing list