[Spread-users] Issue with Spread going silent
Luke Marsden
luke-lists at hybrid-logic.co.uk
Sat Nov 6 20:38:26 EDT 2010
Hi all,
Now this is really strange. I'm getting the same behaviour on a private
broadcast subnet. This time fff0bbd5e0da4103 was the machine to be added
last...
Monitor> Monitor: send status query
============================
Status at 27edda570dce48bb V 4.01. 0 (state 3, gstate 1) after 93
seconds :
Membership : 2 procs in 1 segments, leader is 27edda570dce48bb
rounds : 225 tok_hurry : 53 memb change: 1
sent pack: 33 recv pack : 30 retrans : 0
u retrans: 0 s retrans : 0 b retrans : 0
My_aru : 85 Aru : 85 Highest seq: 85
Sessions : 1 Groups : 1 Window : 60
Deliver M: 81 Deliver Pk: 85 Pers Window: 15
Delta Mes: 81 Delta Pack: 85 Delta sec : 27
==================================
Monitor>
============================
Status at fff0bbd5e0da4103 V 4.01. 0 (state 4, gstate 1) after 76
seconds :
Membership : 0 procs in 0 segments, leader is 0
rounds : 0 tok_hurry : 0 memb change: 0
sent pack: 0 recv pack : 0 retrans : 0
u retrans: 0 s retrans : 0 b retrans : 0
My_aru : 0 Aru : 0 Highest seq: 0
Sessions : 1 Groups : 0 Window : 60
Deliver M: 0 Deliver Pk: 0 Pers Window: 15
Delta Mes: -81 Delta Pack: -85 Delta sec : -17
==================================
Monitor>
============================
Status at 2f20196c853548e7 V 4.01. 0 (state 3, gstate 1) after 93
seconds :
Membership : 2 procs in 1 segments, leader is 27edda570dce48bb
rounds : 225 tok_hurry : 49 memb change: 1
sent pack: 30 recv pack : 33 retrans : 0
u retrans: 0 s retrans : 0 b retrans : 0
My_aru : 85 Aru : 85 Highest seq: 85
Sessions : 1 Groups : 1 Window : 60
Deliver M: 81 Deliver Pk: 85 Pers Window: 15
Delta Mes: 81 Delta Pack: 85 Delta sec : 17
==================================
Just before I added the third node, everything was happy:
--------------------
Configuration at 2f20196c853548e7 is:
Num Segments 1
2 10.255.255.255 4803
27edda570dce48bb 10.0.0.2
2f20196c853548e7 10.0.0.1
====================
--------------------
Configuration at 27edda570dce48bb is:
Num Segments 1
2 10.255.255.255 4803
27edda570dce48bb 10.0.0.2
2f20196c853548e7 10.0.0.1
====================
And my spread.conf:
Spread_Segment 10.255.255.255:4803 {
fff0bbd5e0da4103 10.0.0.3
27edda570dce48bb 10.0.0.2
2f20196c853548e7 10.0.0.1
}
The problem doesn't go away if you leave it for a long time. But
fff0bbd5e0da4103 "recovers" if you turn off Spread entirely on the other
two nodes. I.e., it starts showing you your own join messages again.
Could this be a Spread on FreeBSD 8.1 issue possibly?
Once again, happy to provide a debugging platform. Please let me know
what else I can do to help.
--
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.
Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting
Mobile: +447791750420
On Sat, 2010-11-06 at 23:30 +0000, Luke Marsden wrote:
> Hi Yair,
>
> Thank you for your swift response.
>
> Unfortunately there seems to be nothing wrong with the UDP connectivity
> between these servers.
>
> Using spsend and sprecv with the default options, I tried all the
> options:
>
> A -> B
> A -> C
> B -> A
> B -> C
> C -> A
> C -> B
>
> All were successful, with about 150 missed packets out of 10,000 in each
> run (which I presume is fairly normal).
>
> Example output of one of these six runs is below.
>
>
> hybrid at f497c15415a34ba8:~$ ./spsend -a 178.22.67.120
> Checking (178.22.67.120, 4444). Each burst has 100 packets, 1024 bytes
> each with 10 msec delay in between, for a total of 10000 packets
> sent 1000 packets of 1024 bytes
> sent 2000 packets of 1024 bytes
> sent 3000 packets of 1024 bytes
> sent 4000 packets of 1024 bytes
> sent 5000 packets of 1024 bytes
> sent 6000 packets of 1024 bytes
> sent 7000 packets of 1024 bytes
> sent 8000 packets of 1024 bytes
> sent 9000 packets of 1024 bytes
> sent 10000 packets of 1024 bytes
> total time is (2,138010), with 0 problems
>
> hybrid at a816c9ebce424d8b:~$ ./sprecv
> -------
> Report: total packets 10000, total missed 140, total corrupted 0
> -------
>
>
> By the way, I had to modify the Makefile in daemon/ to get sprecv to
> build, adding events.o and memory.o like this:
>
> sprecv$(EXEEXT): r.o alarm.o data_link.o events.o memory.o
> $(LD) -o $@ r.o alarm.o data_link.o events.o memory.o $(LDFLAGS)
> $(LIBS)
>
>
> Any idea how I can proceed from here? I'm going to try running Spread
> over a VLAN so we can fall back to broadcast, but we really need
> arbitrary point-to-point connectivity. One of our use-cases is one
> server in each data centre, which would be logically equivalent to the
> problematic case here.
>
> I'm happy to give you access to one of our clusters to debug it :-)
>
> --
> Best Regards,
> Luke Marsden
> CTO, Hybrid Logic Ltd.
>
> Web: http://www.hybrid-cluster.com/
> Hybrid Web Cluster - cloud web hosting
>
> Mobile: +447791750420
>
>
> On Sat, 2010-11-06 at 18:12 -0400, Yair Amir wrote:
> > Hi,
> >
> > It is possibly a connectivity issue between the different computers. This
> > means that it may be not possible to send and receive a packet from each
> > computer to each other computer. You can check this building the spsend
> > and sprecv programs and running them to verify if this hypothesis is
> > correct.
> >
> > If you let the monitor run for another 40-50 seconds beyond what you
> > sent (for a few more reports) this can help.
> >
> > Cheers,
> >
> > :) Yair.
> >
> > On 11/6/10 5:49 PM, Luke Marsden wrote:
> > > Hi all,
> > >
> > > I've got a very strange issue with Spread going "silent" (not even a
> > > self-join message with spuser "j foo") after adding a third node to a
> > > network of two.
> > >
> > > The problem does not occur if all three Spread daemons are launched
> > > simultaneously. It only happens if I launch two nodes, wait a few
> > > seconds (until they've announced the group memberships) and then add the
> > > third node.
> > >
> > > Here is the spread config (everything else is stock 4.1.0):
> > >
> > > Spread_Segment 178.22.65.249:4803 {
> > > f497c15415a34ba8 178.22.65.249
> > > }
> > > Spread_Segment 178.22.65.74:4803 {
> > > 2f8919e6ea14416a 178.22.65.74
> > > }
> > > Spread_Segment 178.22.67.120:4803 {
> > > a816c9ebce424d8b 178.22.67.120
> > > }
> > >
> > > For some background, these nodes are running on cloud infrastructure in
> > > the same data centre but without a local broadcast address, hence the
> > > three distinct Spread segments.
> > >
> > > And here's the output of spmonitor with any the first two nodes
> > > connected (working):
> > >
> > > ============================
> > > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
> > > seconds :
> > > Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> > > rounds : 988 tok_hurry : 225 memb change: 1
> > > sent pack: 136 recv pack : 136 retrans : 0
> > > u retrans: 0 s retrans : 0 b retrans : 0
> > > My_aru : 299 Aru : 299 Highest seq: 299
> > > Sessions : 1 Groups : 1 Window : 60
> > > Deliver M: 295 Deliver Pk: 299 Pers Window: 15
> > > Delta Mes: 32 Delta Pack: 32 Delta sec : 5
> > > ==================================
> > >
> > > Monitor>
> > > ============================
> > > Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
> > > seconds :
> > > Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> > > rounds : 988 tok_hurry : 238 memb change: 1
> > > sent pack: 136 recv pack : 136 retrans : 0
> > > u retrans: 0 s retrans : 0 b retrans : 0
> > > My_aru : 299 Aru : 299 Highest seq: 299
> > > Sessions : 1 Groups : 1 Window : 60
> > > Deliver M: 295 Deliver Pk: 299 Pers Window: 15
> > > Delta Mes: 0 Delta Pack: 0 Delta sec : 5
> > > ==================================
> > >
> > > Then when I start spread on the third node, Bad Things Happen:
> > >
> > > Monitor> Monitor: send status query
> > >
> > > ============================
> > > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
> > > seconds :
> > > Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> > > rounds : 1465 tok_hurry : 335 memb change: 1
> > > sent pack: 199 recv pack : 199 retrans : 0
> > > u retrans: 0 s retrans : 0 b retrans : 0
> > > My_aru : 426 Aru : 426 Highest seq: 426
> > > Sessions : 1 Groups : 1 Window : 60
> > > Deliver M: 422 Deliver Pk: 426 Pers Window: 15
> > > Delta Mes: 32 Delta Pack: 32 Delta sec : 5
> > > ==================================
> > >
> > > Monitor>
> > > ============================
> > > Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
> > > seconds :
> > > Membership : 2 procs in 2 segments, leader is f497c15415a34ba8
> > > rounds : 1465 tok_hurry : 357 memb change: 1
> > > sent pack: 199 recv pack : 199 retrans : 0
> > > u retrans: 0 s retrans : 0 b retrans : 0
> > > My_aru : 426 Aru : 426 Highest seq: 426
> > > Sessions : 1 Groups : 1 Window : 60
> > > Deliver M: 422 Deliver Pk: 426 Pers Window: 15
> > > Delta Mes: 0 Delta Pack: 0 Delta sec : 5
> > > ==================================
> > >
> > > Monitor>
> > > ============================
> > > Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
> > > seconds :
> > > Membership : 0 procs in 0 segments, leader is 0
> > > rounds : 0 tok_hurry : 0 memb change: 0
> > > sent pack: 0 recv pack : 0 retrans : 0
> > > u retrans: 0 s retrans : 0 b retrans : 0
> > > My_aru : 0 Aru : 0 Highest seq: 0
> > > Sessions : 1 Groups : 0 Window : 60
> > > Deliver M: 0 Deliver Pk: 0 Pers Window: 15
> > > Delta Mes: -422 Delta Pack: -426 Delta sec : -131
> > > ==================================
> > >
> > > After the issue occurs, spuser will no longer connect to Spread on any
> > > node:
> > >
> > > hybrid at f497c15415a34ba8:~$ spuser
> > > Spread library version is 4.1.0
> > > recv_nointr_timeout: Timed out
> > > SP_error: (-8) Connection closed by spread
> > >
> > > Any insight would be very much appreciated, as we're about to launch a
> > > major product which relies on this!
> > >
> > > The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
> > > KVM) infrastructure. I can provide detailed log output, please tell me
> > > which flags you would like.
> > >
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
>
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
More information about the Spread-users
mailing list