[Spread-users] Issue with Spread going silent

Yair Amir yairamir at cs.jhu.edu
Sat Nov 6 18:12:56 EDT 2010


Hi,

It is possibly a connectivity issue between the different computers. This
means that it may be not possible to send and receive a packet from each
computer to each other computer. You can check this building the spsend
and sprecv programs and running them to verify if this hypothesis is
correct.

If you let the monitor run for another 40-50 seconds beyond what you
sent (for a few more reports) this can help.

Cheers,

	:) Yair.

On 11/6/10 5:49 PM, Luke Marsden wrote:
> Hi all,
> 
> I've got a very strange issue with Spread going "silent" (not even a
> self-join message with spuser "j foo") after adding a third node to a
> network of two.
> 
> The problem does not occur if all three Spread daemons are launched
> simultaneously. It only happens if I launch two nodes, wait a few
> seconds (until they've announced the group memberships) and then add the
> third node.
> 
> Here is the spread config (everything else is stock 4.1.0):
> 
>         Spread_Segment 178.22.65.249:4803 {
>             f497c15415a34ba8 178.22.65.249
>         }
>         Spread_Segment 178.22.65.74:4803 {
>             2f8919e6ea14416a 178.22.65.74
>         }
>         Spread_Segment 178.22.67.120:4803 {
>             a816c9ebce424d8b 178.22.67.120
>         }
> 
> For some background, these nodes are running on cloud infrastructure in
> the same data centre but without a local broadcast address, hence the
> three distinct Spread segments.
> 
> And here's the output of spmonitor with any the first two nodes
> connected (working):
> 
> ============================
> Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
> seconds :
> Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> rounds   :     988      tok_hurry :     225     memb change:       1
> sent pack:     136      recv pack :     136     retrans    :       0
> u retrans:       0      s retrans :       0     b retrans  :       0
> My_aru   :     299      Aru       :     299     Highest seq:     299
> Sessions :       1      Groups    :       1     Window     :      60
> Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> ==================================
>  
> Monitor> 
> ============================
> Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
> seconds :
> Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> rounds   :     988      tok_hurry :     238     memb change:       1
> sent pack:     136      recv pack :     136     retrans    :       0
> u retrans:       0      s retrans :       0     b retrans  :       0
> My_aru   :     299      Aru       :     299     Highest seq:     299
> Sessions :       1      Groups    :       1     Window     :      60
> Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> ==================================
>  
> Then when I start spread on the third node, Bad Things Happen:
> 
> Monitor> Monitor: send status query
>  
> ============================
> Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
> seconds :
> Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> rounds   :    1465      tok_hurry :     335     memb change:       1
> sent pack:     199      recv pack :     199     retrans    :       0
> u retrans:       0      s retrans :       0     b retrans  :       0
> My_aru   :     426      Aru       :     426     Highest seq:     426
> Sessions :       1      Groups    :       1     Window     :      60
> Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> ==================================
>  
> Monitor> 
> ============================
> Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
> seconds :
> Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> rounds   :    1465      tok_hurry :     357     memb change:       1
> sent pack:     199      recv pack :     199     retrans    :       0
> u retrans:       0      s retrans :       0     b retrans  :       0
> My_aru   :     426      Aru       :     426     Highest seq:     426
> Sessions :       1      Groups    :       1     Window     :      60
> Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> ==================================
>  
> Monitor> 
> ============================
> Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
> seconds :
> Membership  :  0  procs in 0 segments, leader is 0
> rounds   :       0      tok_hurry :       0     memb change:       0
> sent pack:       0      recv pack :       0     retrans    :       0
> u retrans:       0      s retrans :       0     b retrans  :       0
> My_aru   :       0      Aru       :       0     Highest seq:       0
> Sessions :       1      Groups    :       0     Window     :      60
> Deliver M:       0      Deliver Pk:       0     Pers Window:      15
> Delta Mes:    -422      Delta Pack:    -426     Delta sec  :    -131
> ==================================
>  
> After the issue occurs, spuser will no longer connect to Spread on any
> node:
> 
> hybrid at f497c15415a34ba8:~$ spuser
> Spread library version is 4.1.0
> recv_nointr_timeout: Timed out
> SP_error: (-8) Connection closed by spread
> 
> Any insight would be very much appreciated, as we're about to launch a
> major product which relies on this!
> 
> The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
> KVM) infrastructure. I can provide detailed log output, please tell me
> which flags you would like.
> 




More information about the Spread-users mailing list