[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sat Nov 6 17:49:10 EDT 2010


Hi all,

I've got a very strange issue with Spread going "silent" (not even a
self-join message with spuser "j foo") after adding a third node to a
network of two.

The problem does not occur if all three Spread daemons are launched
simultaneously. It only happens if I launch two nodes, wait a few
seconds (until they've announced the group memberships) and then add the
third node.

Here is the spread config (everything else is stock 4.1.0):

        Spread_Segment 178.22.65.249:4803 {
            f497c15415a34ba8 178.22.65.249
        }
        Spread_Segment 178.22.65.74:4803 {
            2f8919e6ea14416a 178.22.65.74
        }
        Spread_Segment 178.22.67.120:4803 {
            a816c9ebce424d8b 178.22.67.120
        }

For some background, these nodes are running on cloud infrastructure in
the same data centre but without a local broadcast address, hence the
three distinct Spread segments.

And here's the output of spmonitor with any the first two nodes
connected (working):

============================
Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
seconds :
Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
rounds   :     988      tok_hurry :     225     memb change:       1
sent pack:     136      recv pack :     136     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :     299      Aru       :     299     Highest seq:     299
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:     295      Deliver Pk:     299     Pers Window:      15
Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
==================================
 
Monitor> 
============================
Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
seconds :
Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
rounds   :     988      tok_hurry :     238     memb change:       1
sent pack:     136      recv pack :     136     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :     299      Aru       :     299     Highest seq:     299
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:     295      Deliver Pk:     299     Pers Window:      15
Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
==================================
 
Then when I start spread on the third node, Bad Things Happen:

Monitor> Monitor: send status query
 
============================
Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
seconds :
Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
rounds   :    1465      tok_hurry :     335     memb change:       1
sent pack:     199      recv pack :     199     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :     426      Aru       :     426     Highest seq:     426
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:     422      Deliver Pk:     426     Pers Window:      15
Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
==================================
 
Monitor> 
============================
Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
seconds :
Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
rounds   :    1465      tok_hurry :     357     memb change:       1
sent pack:     199      recv pack :     199     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :     426      Aru       :     426     Highest seq:     426
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:     422      Deliver Pk:     426     Pers Window:      15
Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
==================================
 
Monitor> 
============================
Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
seconds :
Membership  :  0  procs in 0 segments, leader is 0
rounds   :       0      tok_hurry :       0     memb change:       0
sent pack:       0      recv pack :       0     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :       0      Aru       :       0     Highest seq:       0
Sessions :       1      Groups    :       0     Window     :      60
Deliver M:       0      Deliver Pk:       0     Pers Window:      15
Delta Mes:    -422      Delta Pack:    -426     Delta sec  :    -131
==================================
 
After the issue occurs, spuser will no longer connect to Spread on any
node:

hybrid at f497c15415a34ba8:~$ spuser
Spread library version is 4.1.0
recv_nointr_timeout: Timed out
SP_error: (-8) Connection closed by spread

Any insight would be very much appreciated, as we're about to launch a
major product which relies on this!

The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
KVM) infrastructure. I can provide detailed log output, please tell me
which flags you would like.

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420








More information about the Spread-users mailing list