[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sat Nov 6 19:30:03 EDT 2010


Hi Yair,

Thank you for your swift response.

Unfortunately there seems to be nothing wrong with the UDP connectivity
between these servers.

Using spsend and sprecv with the default options, I tried all the
options:

        A -> B
        A -> C
        B -> A
        B -> C
        C -> A
        C -> B
        
All were successful, with about 150 missed packets out of 10,000 in each
run (which I presume is fairly normal).

Example output of one of these six runs is below.


hybrid at f497c15415a34ba8:~$ ./spsend -a 178.22.67.120
Checking (178.22.67.120, 4444). Each burst has 100 packets, 1024 bytes
each with 10 msec delay in between, for a total of 10000 packets
sent 1000 packets of 1024 bytes
sent 2000 packets of 1024 bytes
sent 3000 packets of 1024 bytes
sent 4000 packets of 1024 bytes
sent 5000 packets of 1024 bytes
sent 6000 packets of 1024 bytes
sent 7000 packets of 1024 bytes
sent 8000 packets of 1024 bytes
sent 9000 packets of 1024 bytes
sent 10000 packets of 1024 bytes
total time is (2,138010), with 0 problems 

hybrid at a816c9ebce424d8b:~$ ./sprecv
-------
Report: total packets 10000, total missed 140, total corrupted 0
-------


By the way, I had to modify the Makefile in daemon/ to get sprecv to
build, adding events.o and memory.o like this:

sprecv$(EXEEXT): r.o alarm.o data_link.o events.o memory.o 
	$(LD) -o $@ r.o alarm.o data_link.o events.o memory.o $(LDFLAGS)
$(LIBS)


Any idea how I can proceed from here? I'm going to try running Spread
over a VLAN so we can fall back to broadcast, but we really need
arbitrary point-to-point connectivity. One of our use-cases is one
server in each data centre, which would be logically equivalent to the
problematic case here.

I'm happy to give you access to one of our clusters to debug it :-)

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420


On Sat, 2010-11-06 at 18:12 -0400, Yair Amir wrote:
> Hi,
> 
> It is possibly a connectivity issue between the different computers. This
> means that it may be not possible to send and receive a packet from each
> computer to each other computer. You can check this building the spsend
> and sprecv programs and running them to verify if this hypothesis is
> correct.
> 
> If you let the monitor run for another 40-50 seconds beyond what you
> sent (for a few more reports) this can help.
> 
> Cheers,
> 
> 	:) Yair.
> 
> On 11/6/10 5:49 PM, Luke Marsden wrote:
> > Hi all,
> > 
> > I've got a very strange issue with Spread going "silent" (not even a
> > self-join message with spuser "j foo") after adding a third node to a
> > network of two.
> > 
> > The problem does not occur if all three Spread daemons are launched
> > simultaneously. It only happens if I launch two nodes, wait a few
> > seconds (until they've announced the group memberships) and then add the
> > third node.
> > 
> > Here is the spread config (everything else is stock 4.1.0):
> > 
> >         Spread_Segment 178.22.65.249:4803 {
> >             f497c15415a34ba8 178.22.65.249
> >         }
> >         Spread_Segment 178.22.65.74:4803 {
> >             2f8919e6ea14416a 178.22.65.74
> >         }
> >         Spread_Segment 178.22.67.120:4803 {
> >             a816c9ebce424d8b 178.22.67.120
> >         }
> > 
> > For some background, these nodes are running on cloud infrastructure in
> > the same data centre but without a local broadcast address, hence the
> > three distinct Spread segments.
> > 
> > And here's the output of spmonitor with any the first two nodes
> > connected (working):
> > 
> > ============================
> > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
> > seconds :
> > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > rounds   :     988      tok_hurry :     225     memb change:       1
> > sent pack:     136      recv pack :     136     retrans    :       0
> > u retrans:       0      s retrans :       0     b retrans  :       0
> > My_aru   :     299      Aru       :     299     Highest seq:     299
> > Sessions :       1      Groups    :       1     Window     :      60
> > Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> > Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> > ==================================
> >  
> > Monitor> 
> > ============================
> > Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
> > seconds :
> > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > rounds   :     988      tok_hurry :     238     memb change:       1
> > sent pack:     136      recv pack :     136     retrans    :       0
> > u retrans:       0      s retrans :       0     b retrans  :       0
> > My_aru   :     299      Aru       :     299     Highest seq:     299
> > Sessions :       1      Groups    :       1     Window     :      60
> > Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> > Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> > ==================================
> >  
> > Then when I start spread on the third node, Bad Things Happen:
> > 
> > Monitor> Monitor: send status query
> >  
> > ============================
> > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
> > seconds :
> > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > rounds   :    1465      tok_hurry :     335     memb change:       1
> > sent pack:     199      recv pack :     199     retrans    :       0
> > u retrans:       0      s retrans :       0     b retrans  :       0
> > My_aru   :     426      Aru       :     426     Highest seq:     426
> > Sessions :       1      Groups    :       1     Window     :      60
> > Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> > Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> > ==================================
> >  
> > Monitor> 
> > ============================
> > Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
> > seconds :
> > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > rounds   :    1465      tok_hurry :     357     memb change:       1
> > sent pack:     199      recv pack :     199     retrans    :       0
> > u retrans:       0      s retrans :       0     b retrans  :       0
> > My_aru   :     426      Aru       :     426     Highest seq:     426
> > Sessions :       1      Groups    :       1     Window     :      60
> > Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> > Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> > ==================================
> >  
> > Monitor> 
> > ============================
> > Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
> > seconds :
> > Membership  :  0  procs in 0 segments, leader is 0
> > rounds   :       0      tok_hurry :       0     memb change:       0
> > sent pack:       0      recv pack :       0     retrans    :       0
> > u retrans:       0      s retrans :       0     b retrans  :       0
> > My_aru   :       0      Aru       :       0     Highest seq:       0
> > Sessions :       1      Groups    :       0     Window     :      60
> > Deliver M:       0      Deliver Pk:       0     Pers Window:      15
> > Delta Mes:    -422      Delta Pack:    -426     Delta sec  :    -131
> > ==================================
> >  
> > After the issue occurs, spuser will no longer connect to Spread on any
> > node:
> > 
> > hybrid at f497c15415a34ba8:~$ spuser
> > Spread library version is 4.1.0
> > recv_nointr_timeout: Timed out
> > SP_error: (-8) Connection closed by spread
> > 
> > Any insight would be very much appreciated, as we're about to launch a
> > major product which relies on this!
> > 
> > The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
> > KVM) infrastructure. I can provide detailed log output, please tell me
> > which flags you would like.
> > 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list