[Spread-users] Link failure causes locking in spread daemon

John Schultz jschultz at spreadconcepts.com
Thu Feb 28 15:55:49 EST 2013


Looking at your error report it does look possible that you had some kind of asymmetric communication within your segment.

If the two daemons that didn't reform could hear the daemons that did, but not vice versa, then you could get the kind of behavior you describe here.  The non-forming daemons would keep hearing from the formed daemons and trying to form with them fruitlessly.  The system would appear frozen from the POV of any clients attached to the non-forming daemons.

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Phn: 301 830 8100
Cell: 443 838 2200

On Feb 28, 2013, at 1:08 PM, Melissa Jenkins wrote:

Hello,

Earlier today we had a failure on a transit link that would have resulted in our 4 machine spread segment failing quite catastrophically.

Each end of the link has two spread daemons and there is only one spread segment active (config does have two)

On 2 of the boxes the log files are filled with messages like:
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG

In fact there were over 700,000 of them before we got the daemons manually restarted.   

These boxes were both at the far end of the link from the leader.  Looking at the spread logs the leader and it's neighbour (who remained in comms) look like they synchronised together and carried on.

Applications connected to the two machines that logged the above message show the spread daemon cease to respond.

There was high packet loss briefly prior to the links actually failing.  The logging and monitoring we have don't show a large number of retransmits

This is running Spread 4.0 still :(  Is it likely related to the asymmetric link issues we have seen before (which show up as huge numbers of retransmits)?  We are planning on upgrading to 4.1 in the next couple of weeks once we've had it through test.

Mel
_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3805 bytes
Desc: not available
Url : http://lists.spread.org/pipermail/spread-users/attachments/20130228/c0954e27/attachment.bin 


More information about the Spread-users mailing list