[Spread-users] Issue with Spread going silent
Yair Amir
yairamir at cs.jhu.edu
Sat Nov 6 18:12:56 EDT 2010
Hi,
It is possibly a connectivity issue between the different computers. This
means that it may be not possible to send and receive a packet from each
computer to each other computer. You can check this building the spsend
and sprecv programs and running them to verify if this hypothesis is
correct.
If you let the monitor run for another 40-50 seconds beyond what you
sent (for a few more reports) this can help.
Cheers,
:) Yair.
On 11/6/10 5:49 PM, Luke Marsden wrote:
> Hi all,
>
> I've got a very strange issue with Spread going "silent" (not even a
> self-join message with spuser "j foo") after adding a third node to a
> network of two.
>
> The problem does not occur if all three Spread daemons are launched
> simultaneously. It only happens if I launch two nodes, wait a few
> seconds (until they've announced the group memberships) and then add the
> third node.
>
> Here is the spread config (everything else is stock 4.1.0):
>
> Spread_Segment 178.22.65.249:4803 {
> f497c15415a34ba8 178.22.65.249
> }
> Spread_Segment 178.22.65.74:4803 {
> 2f8919e6ea14416a 178.22.65.74
> }
> Spread_Segment 178.22.67.120:4803 {
> a816c9ebce424d8b 178.22.67.120
> }
>
> For some background, these nodes are running on cloud infrastructure in
> the same data centre but without a local broadcast address, hence the
> three distinct Spread segments.
>
> And here's the output of spmonitor with any the first two nodes
> connected (working):
>
> ============================
> Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
> seconds :
> Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> rounds : 988 tok_hurry : 225 memb change: 1
> sent pack: 136 recv pack : 136 retrans : 0
> u retrans: 0 s retrans : 0 b retrans : 0
> My_aru : 299 Aru : 299 Highest seq: 299
> Sessions : 1 Groups : 1 Window : 60
> Deliver M: 295 Deliver Pk: 299 Pers Window: 15
> Delta Mes: 32 Delta Pack: 32 Delta sec : 5
> ==================================
>
> Monitor>
> ============================
> Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
> seconds :
> Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> rounds : 988 tok_hurry : 238 memb change: 1
> sent pack: 136 recv pack : 136 retrans : 0
> u retrans: 0 s retrans : 0 b retrans : 0
> My_aru : 299 Aru : 299 Highest seq: 299
> Sessions : 1 Groups : 1 Window : 60
> Deliver M: 295 Deliver Pk: 299 Pers Window: 15
> Delta Mes: 0 Delta Pack: 0 Delta sec : 5
> ==================================
>
> Then when I start spread on the third node, Bad Things Happen:
>
> Monitor> Monitor: send status query
>
> ============================
> Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
> seconds :
> Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> rounds : 1465 tok_hurry : 335 memb change: 1
> sent pack: 199 recv pack : 199 retrans : 0
> u retrans: 0 s retrans : 0 b retrans : 0
> My_aru : 426 Aru : 426 Highest seq: 426
> Sessions : 1 Groups : 1 Window : 60
> Deliver M: 422 Deliver Pk: 426 Pers Window: 15
> Delta Mes: 32 Delta Pack: 32 Delta sec : 5
> ==================================
>
> Monitor>
> ============================
> Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
> seconds :
> Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> rounds : 1465 tok_hurry : 357 memb change: 1
> sent pack: 199 recv pack : 199 retrans : 0
> u retrans: 0 s retrans : 0 b retrans : 0
> My_aru : 426 Aru : 426 Highest seq: 426
> Sessions : 1 Groups : 1 Window : 60
> Deliver M: 422 Deliver Pk: 426 Pers Window: 15
> Delta Mes: 0 Delta Pack: 0 Delta sec : 5
> ==================================
>
> Monitor>
> ============================
> Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
> seconds :
> Membership : 0 procs in 0 segments, leader is 0
> rounds : 0 tok_hurry : 0 memb change: 0
> sent pack: 0 recv pack : 0 retrans : 0
> u retrans: 0 s retrans : 0 b retrans : 0
> My_aru : 0 Aru : 0 Highest seq: 0
> Sessions : 1 Groups : 0 Window : 60
> Deliver M: 0 Deliver Pk: 0 Pers Window: 15
> Delta Mes: -422 Delta Pack: -426 Delta sec : -131
> ==================================
>
> After the issue occurs, spuser will no longer connect to Spread on any
> node:
>
> hybrid at f497c15415a34ba8:~$ spuser
> Spread library version is 4.1.0
> recv_nointr_timeout: Timed out
> SP_error: (-8) Connection closed by spread
>
> Any insight would be very much appreciated, as we're about to launch a
> major product which relies on this!
>
> The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
> KVM) infrastructure. I can provide detailed log output, please tell me
> which flags you would like.
>
More information about the Spread-users
mailing list