[Spread-users] Stuck Spread Daemons

Melissa Jenkins melissa-spread at temeletry.co.uk
Thu Nov 17 09:34:07 EST 2011


Well, it turns out that it isn't solely membership related.

It happened several times yesterday, and again this morning.  I tried removing the spread daemon from the second segment, and getting the processes on that box to connect directly to a daemon in the main segment.

Unfortunately this has had no effect on it whatsoever :(	

Behaviour was as mentioned before - large spike in retransmissions, with some daemons not receiving any messages.  The sender gets stuck and stops sending messages and seems to hang for reasonably long period when I restart the spread daemon.

spmonitor says it's sretrans that are incrementing.

If the daemon that has a high number of retransmits is killed and restarted everything seems to return to normal.  Sometimes you get a couple of daemons stuck retransmitting.

One time there was nothing in the log at all, just spmonitor checks.  It may be the logging happens after it has been stuck for a while, rather than at the beginning.

I'll have a look through the code and see if I can figure out where to start looking.

Mel

On 15 Nov 2011, at 17:12, John Schultz wrote:

> Hi Melissa,
> 
> I believe this is likely a bug.  There is at least one bug in the core membership + EVS protocol of Spread that rarely bites people when they get into extreme situations.  Unfortunately, I don't see any easy way to debug this as it is likely a logic bug in the membership + EVS protocol of Spread, which is quite complex.  I will put this on the list of currently reported bugs and try to take at the core logic of the protocol to see if I can spot the issue(s) that people have been reporting.
> 
> Cheers!
> 
> -----
> John Lane Schultz
> Spread Concepts LLC
> Phn: 301 830 8100
> Cell: 443 838 2200
> 
> On Nov 15, 2011, at 11:02 AM, Melissa Jenkins wrote:
> 
> 
> Over time we've noticed if we get a burst of high packet loss we often end up with Spread daemons in a stuck stage where they are constantly retransmitting.  Today, the logs indicated that the packet loss coincided with a few joins + leaves, though I don't have enough information to indicate that this anything more than a coincidence.
> 
> I’m a bit at a loss as to how to trouble shoot this, all suggestions gratefully received!  (I have tried turning up logging detail but it was just exhausting the disks and not showing anything more helpful)
> 
> Thanks,
> Mel
> 
> Basic configuration is:
> 
> Segment 1: 4 spread daemons, many users, all on a metro ethernet L2 lan.  Configured to use broadcast
> Segment 2: 1 spread daemon, several 'users', connected via a VPN to Segment 1.
> 
> It looks like the packet loss was in Segment 1, however, to unstick the Spread cluster we had to restart the daemon in Segment 2 (192.168.0.1).
> 
> This has happened several times always when other monitoring indicates high packet loss somewhere in the network, though often just when it is between segment1 & 2. 
> 
> spmonitor statistics show a HUGE burst of retransmissions – 14 million per second according to my monitoring, and we have a large spike in the bandwidth between services.  This burst/spike continues until the stuck daemon is restarted.  Our normal load peaks at ~60 messages a second.
> 
> Some messages were still being passed, during this period, though I'm not sure if all hosts had received them.
> 
> Here are some excerpts from the log, I’m not 100% sure they are totally relevant but it is important to note that message passing started to fail at about 09:12, the massive burst in retransmissions didn’t occur till closer to 09:30.  Two daemons in segment 1 transmit, 4 receive – one of the receivers show a big increase in the number of messages ~3000/second, the other shows a decrease.  The two transmitting show the massive spike in retransmissions.  The daemon in segment one both transmits & receives and it shows the same large spike.  According to spmonitor all three active transmitters have a large number of sent, received and retransmitted messages.
> 
> Log from 192.168.0.1 (seg2a1):
> 
> [Tue 15 Nov 2011 09:30:10] Stat_handle_message: sent status to Monitor at seg2a1
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.139, State is 1
> [Tue 15 Nov 2011 09:31:54] Handle_join in OP
> [Tue 15 Nov 2011 09:31:54] Memb_token_loss: I lost my token, state is 1
> [Tue 15 Nov 2011 09:31:54] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.203, State is 2
> [Tue 15 Nov 2011 09:31:55] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 2
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.203, State is 2
> [Tue 15 Nov 2011 09:31:56] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:31:57] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:31:58] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:31:59] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:00] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:01] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:02] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:03] Form_or_fail: failed to gather
> [Tue 15 Nov 2011 09:32:03] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:04] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:05] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:32:06] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:32:06] Handle_refer in GATHER
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:32:07] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:32:07] Handle_refer in GATHER
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:32:08] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:32:08] Handle_refer in GATHER
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:32:09] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:32:09] Handle_refer in GATHER
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling join message from 192.168.0.203, State is 4
> [Tue 15 Nov 2011 09:32:10] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling refer message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:32:10] Handle_refer in GATHER
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form1 token
> [Tue 15 Nov 2011 09:32:11] Handle_form1 in GATHER
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form1 token
> [Tue 15 Nov 2011 09:32:11] Handle_form1 in FORM
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
> [Tue 15 Nov 2011 09:32:11] Handle_form2 in FORM
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
> [Tue 15 Nov 2011 09:32:11] Handle_form2 in EVS
> [Tue 15 Nov 2011 09:32:11] Memb_transitional
> [Tue 15 Nov 2011 09:32:11] G_handle_trans_memb:  with (192.168.0.203, 1321349531) id
> [Tue 15 Nov 2011 09:32:11] G_handle_trans_memb in GOP
> 
> 
> Log from the leader, which is in segment 1 (192.168.0.203):
> [Tue 15 Nov 2011 09:30:25] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:26] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:27] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:28] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:29] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:30] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:52] Handle_alive in OP
> [Tue 15 Nov 2011 09:31:52] Memb_token_loss: I lost my token, state is 1
> [Tue 15 Nov 2011 09:31:52] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:52] Handle_alive in SEG
> [Tue 15 Nov 2011 09:31:52] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:52] Handle_alive in SEG
> [Tue 15 Nov 2011 09:31:53] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
> [Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
> [Tue 15 Nov 2011 09:31:53] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:53] Handle_alive in SEG
> [Tue 15 Nov 2011 09:31:54] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:54] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:54] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:55] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:55] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:55] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:56] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:56] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:56] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:31:57] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:57] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:57] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:31:58] Send_join: State is 4
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.139, State is 4
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:31:58] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:31:58] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:31:59] Memb_handle_token: handling form1 token
> [Tue 15 Nov 2011 09:31:59] Handle_form1 in FORM
> [Tue 15 Nov 2011 09:31:59] Memb_handle_token: handling form1 token
> [Tue 15 Nov 2011 09:31:59] Handle_form1 in FORM
> [Tue 15 Nov 2011 09:31:59] Memb_handle_message: handling join message from 192.168.0.1, State is 5
> [Tue 15 Nov 2011 09:32:00] Memb_handle_message: handling join message from 192.168.0.1, State is 5
> [Tue 15 Nov 2011 09:32:01] Memb_handle_message: handling join message from 192.168.0.1, State is 5
> [Tue 15 Nov 2011 09:32:02] Memb_handle_message: handling join message from 192.168.0.1, State is 5
> [Tue 15 Nov 2011 09:32:03] Memb_handle_message: handling join message from 192.168.0.1, State is 5
> [Tue 15 Nov 2011 09:32:04] Memb_token_loss: I lost my token, state is 5
> [Tue 15 Nov 2011 09:32:04] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:04] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:04] Memb_handle_message: handling join message from 192.168.0.1, State is 2
> [Tue 15 Nov 2011 09:32:05] Scast_alive: State is 2
> [Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:05] Handle_alive in SEG
> [Tue 15 Nov 2011 09:32:05] Memb_handle_message: handling join message from 192.168.0.1, State is 2
> [Tue 15 Nov 2011 09:32:06] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:06] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:06] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:32:07] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:07] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:07] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:32:08] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:08] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:08] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:32:09] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:09] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:09] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:32:10] Send_join: State is 4
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling alive message
> [Tue 15 Nov 2011 09:32:10] Handle_alive in GATHER
> [Tue 15 Nov 2011 09:32:10] Memb_handle_message: handling join message from 192.168.0.1, State is 4
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
> [Tue 15 Nov 2011 09:32:11] Handle_form2 in FORM
> [Tue 15 Nov 2011 09:32:11] Memb_handle_token: handling form2 token
> [Tue 15 Nov 2011 09:32:11] Handle_form2 in EVS
> [Tue 15 Nov 2011 09:32:11] Memb_transitional
> [Tue 15 Nov 2011 09:32:11] G_handle_trans_memb:  with (192.168.0.203, 1321349531) id
> [Tue 15 Nov 2011 09:32:11] G_handle_trans_memb in GOP
> [Tue 15 Nov 2011 09:32:11] G_handle_trans_memb: Synced Set (with 4 members):
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users




More information about the Spread-users mailing list