[Spread-users] Link failure causes locking in spread daemon

Melissa Jenkins melissa-spread at temeletry.co.uk
Thu Feb 28 13:08:19 EST 2013


Hello,

Earlier today we had a failure on a transit link that would have resulted in our 4 machine spread segment failing quite catastrophically.

Each end of the link has two spread daemons and there is only one spread segment active (config does have two)

On 2 of the boxes the log files are filled with messages like:
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG
[Thu 28 Feb 2013 03:19:24] Memb_handle_message: handling alive message
[Thu 28 Feb 2013 03:19:24] Handle_alive in SEG

In fact there were over 700,000 of them before we got the daemons manually restarted.   

These boxes were both at the far end of the link from the leader.  Looking at the spread logs the leader and it's neighbour (who remained in comms) look like they synchronised together and carried on.

Applications connected to the two machines that logged the above message show the spread daemon cease to respond.

There was high packet loss briefly prior to the links actually failing.  The logging and monitoring we have don't show a large number of retransmits

This is running Spread 4.0 still :(  Is it likely related to the asymmetric link issues we have seen before (which show up as huge numbers of retransmits)?  We are planning on upgrading to 4.1 in the next couple of weeks once we've had it through test.

Mel


More information about the Spread-users mailing list