[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sat Nov 6 20:38:26 EDT 2010


Hi all,

Now this is really strange. I'm getting the same behaviour on a private
broadcast subnet. This time fff0bbd5e0da4103 was the machine to be added
last...

Monitor> Monitor: send status query

============================
Status at 27edda570dce48bb V 4.01. 0 (state 3, gstate 1) after 93
seconds :
Membership  :  2  procs in 1 segments, leader is 27edda570dce48bb
rounds   :     225      tok_hurry :      53     memb change:       1
sent pack:      33      recv pack :      30     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :      85      Aru       :      85     Highest seq:      85
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:      81      Deliver Pk:      85     Pers Window:      15
Delta Mes:      81      Delta Pack:      85     Delta sec  :      27
==================================

Monitor> 
============================
Status at fff0bbd5e0da4103 V 4.01. 0 (state 4, gstate 1) after 76
seconds :
Membership  :  0  procs in 0 segments, leader is 0
rounds   :       0      tok_hurry :       0     memb change:       0
sent pack:       0      recv pack :       0     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :       0      Aru       :       0     Highest seq:       0
Sessions :       1      Groups    :       0     Window     :      60
Deliver M:       0      Deliver Pk:       0     Pers Window:      15
Delta Mes:     -81      Delta Pack:     -85     Delta sec  :     -17
==================================

Monitor> 
============================
Status at 2f20196c853548e7 V 4.01. 0 (state 3, gstate 1) after 93
seconds :
Membership  :  2  procs in 1 segments, leader is 27edda570dce48bb
rounds   :     225      tok_hurry :      49     memb change:       1
sent pack:      30      recv pack :      33     retrans    :       0
u retrans:       0      s retrans :       0     b retrans  :       0
My_aru   :      85      Aru       :      85     Highest seq:      85
Sessions :       1      Groups    :       1     Window     :      60
Deliver M:      81      Deliver Pk:      85     Pers Window:      15
Delta Mes:      81      Delta Pack:      85     Delta sec  :      17
==================================


Just before I added the third node, everything was happy:

--------------------
Configuration at 2f20196c853548e7 is:
Num Segments 1
        2       10.255.255.255    4803
                27edda570dce48bb        10.0.0.2        
                2f20196c853548e7        10.0.0.1        
====================

--------------------
Configuration at 27edda570dce48bb is:
Num Segments 1
        2       10.255.255.255    4803
                27edda570dce48bb        10.0.0.2        
                2f20196c853548e7        10.0.0.1        
====================

And my spread.conf:


Spread_Segment 10.255.255.255:4803 {
    fff0bbd5e0da4103 10.0.0.3
    27edda570dce48bb 10.0.0.2
    2f20196c853548e7 10.0.0.1
}

The problem doesn't go away if you leave it for a long time. But
fff0bbd5e0da4103 "recovers" if you turn off Spread entirely on the other
two nodes. I.e., it starts showing you your own join messages again.

Could this be a Spread on FreeBSD 8.1 issue possibly?

Once again, happy to provide a debugging platform. Please let me know
what else I can do to help.

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420


On Sat, 2010-11-06 at 23:30 +0000, Luke Marsden wrote:
> Hi Yair,
> 
> Thank you for your swift response.
> 
> Unfortunately there seems to be nothing wrong with the UDP connectivity
> between these servers.
> 
> Using spsend and sprecv with the default options, I tried all the
> options:
> 
>         A -> B
>         A -> C
>         B -> A
>         B -> C
>         C -> A
>         C -> B
>         
> All were successful, with about 150 missed packets out of 10,000 in each
> run (which I presume is fairly normal).
> 
> Example output of one of these six runs is below.
> 
> 
> hybrid at f497c15415a34ba8:~$ ./spsend -a 178.22.67.120
> Checking (178.22.67.120, 4444). Each burst has 100 packets, 1024 bytes
> each with 10 msec delay in between, for a total of 10000 packets
> sent 1000 packets of 1024 bytes
> sent 2000 packets of 1024 bytes
> sent 3000 packets of 1024 bytes
> sent 4000 packets of 1024 bytes
> sent 5000 packets of 1024 bytes
> sent 6000 packets of 1024 bytes
> sent 7000 packets of 1024 bytes
> sent 8000 packets of 1024 bytes
> sent 9000 packets of 1024 bytes
> sent 10000 packets of 1024 bytes
> total time is (2,138010), with 0 problems 
> 
> hybrid at a816c9ebce424d8b:~$ ./sprecv
> -------
> Report: total packets 10000, total missed 140, total corrupted 0
> -------
> 
> 
> By the way, I had to modify the Makefile in daemon/ to get sprecv to
> build, adding events.o and memory.o like this:
> 
> sprecv$(EXEEXT): r.o alarm.o data_link.o events.o memory.o 
> 	$(LD) -o $@ r.o alarm.o data_link.o events.o memory.o $(LDFLAGS)
> $(LIBS)
> 
> 
> Any idea how I can proceed from here? I'm going to try running Spread
> over a VLAN so we can fall back to broadcast, but we really need
> arbitrary point-to-point connectivity. One of our use-cases is one
> server in each data centre, which would be logically equivalent to the
> problematic case here.
> 
> I'm happy to give you access to one of our clusters to debug it :-)
> 
> -- 
> Best Regards,
> Luke Marsden
> CTO, Hybrid Logic Ltd.
> 
> Web: http://www.hybrid-cluster.com/
> Hybrid Web Cluster - cloud web hosting
> 
> Mobile: +447791750420
> 
> 
> On Sat, 2010-11-06 at 18:12 -0400, Yair Amir wrote:
> > Hi,
> > 
> > It is possibly a connectivity issue between the different computers. This
> > means that it may be not possible to send and receive a packet from each
> > computer to each other computer. You can check this building the spsend
> > and sprecv programs and running them to verify if this hypothesis is
> > correct.
> > 
> > If you let the monitor run for another 40-50 seconds beyond what you
> > sent (for a few more reports) this can help.
> > 
> > Cheers,
> > 
> > 	:) Yair.
> > 
> > On 11/6/10 5:49 PM, Luke Marsden wrote:
> > > Hi all,
> > > 
> > > I've got a very strange issue with Spread going "silent" (not even a
> > > self-join message with spuser "j foo") after adding a third node to a
> > > network of two.
> > > 
> > > The problem does not occur if all three Spread daemons are launched
> > > simultaneously. It only happens if I launch two nodes, wait a few
> > > seconds (until they've announced the group memberships) and then add the
> > > third node.
> > > 
> > > Here is the spread config (everything else is stock 4.1.0):
> > > 
> > >         Spread_Segment 178.22.65.249:4803 {
> > >             f497c15415a34ba8 178.22.65.249
> > >         }
> > >         Spread_Segment 178.22.65.74:4803 {
> > >             2f8919e6ea14416a 178.22.65.74
> > >         }
> > >         Spread_Segment 178.22.67.120:4803 {
> > >             a816c9ebce424d8b 178.22.67.120
> > >         }
> > > 
> > > For some background, these nodes are running on cloud infrastructure in
> > > the same data centre but without a local broadcast address, hence the
> > > three distinct Spread segments.
> > > 
> > > And here's the output of spmonitor with any the first two nodes
> > > connected (working):
> > > 
> > > ============================
> > > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 88
> > > seconds :
> > > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > > rounds   :     988      tok_hurry :     225     memb change:       1
> > > sent pack:     136      recv pack :     136     retrans    :       0
> > > u retrans:       0      s retrans :       0     b retrans  :       0
> > > My_aru   :     299      Aru       :     299     Highest seq:     299
> > > Sessions :       1      Groups    :       1     Window     :      60
> > > Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> > > Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> > > ==================================
> > >  
> > > Monitor> 
> > > ============================
> > > Status at f497c15415a34ba8 V 4.01. 0 (state 1, gstate 1) after 93
> > > seconds :
> > > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > > rounds   :     988      tok_hurry :     238     memb change:       1
> > > sent pack:     136      recv pack :     136     retrans    :       0
> > > u retrans:       0      s retrans :       0     b retrans  :       0
> > > My_aru   :     299      Aru       :     299     Highest seq:     299
> > > Sessions :       1      Groups    :       1     Window     :      60
> > > Deliver M:     295      Deliver Pk:     299     Pers Window:      15
> > > Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> > > ==================================
> > >  
> > > Then when I start spread on the third node, Bad Things Happen:
> > > 
> > > Monitor> Monitor: send status query
> > >  
> > > ============================
> > > Status at 2f8919e6ea14416a V 4.01. 0 (state 1, gstate 1) after 128
> > > seconds :
> > > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > > rounds   :    1465      tok_hurry :     335     memb change:       1
> > > sent pack:     199      recv pack :     199     retrans    :       0
> > > u retrans:       0      s retrans :       0     b retrans  :       0
> > > My_aru   :     426      Aru       :     426     Highest seq:     426
> > > Sessions :       1      Groups    :       1     Window     :      60
> > > Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> > > Delta Mes:      32      Delta Pack:      32     Delta sec  :       5
> > > ==================================
> > >  
> > > Monitor> 
> > > ============================
> > > Status at f497c15415a34ba8 V 4.01. 0 (state 4, gstate 1) after 133
> > > seconds :
> > > Membership  :  2  procs in 2 segments, leader is f497c15415a34ba8
> > > rounds   :    1465      tok_hurry :     357     memb change:       1
> > > sent pack:     199      recv pack :     199     retrans    :       0
> > > u retrans:       0      s retrans :       0     b retrans  :       0
> > > My_aru   :     426      Aru       :     426     Highest seq:     426
> > > Sessions :       1      Groups    :       1     Window     :      60
> > > Deliver M:     422      Deliver Pk:     426     Pers Window:      15
> > > Delta Mes:       0      Delta Pack:       0     Delta sec  :       5
> > > ==================================
> > >  
> > > Monitor> 
> > > ============================
> > > Status at a816c9ebce424d8b V 4.01. 0 (state 4, gstate 1) after 2
> > > seconds :
> > > Membership  :  0  procs in 0 segments, leader is 0
> > > rounds   :       0      tok_hurry :       0     memb change:       0
> > > sent pack:       0      recv pack :       0     retrans    :       0
> > > u retrans:       0      s retrans :       0     b retrans  :       0
> > > My_aru   :       0      Aru       :       0     Highest seq:       0
> > > Sessions :       1      Groups    :       0     Window     :      60
> > > Deliver M:       0      Deliver Pk:       0     Pers Window:      15
> > > Delta Mes:    -422      Delta Pack:    -426     Delta sec  :    -131
> > > ==================================
> > >  
> > > After the issue occurs, spuser will no longer connect to Spread on any
> > > node:
> > > 
> > > hybrid at f497c15415a34ba8:~$ spuser
> > > Spread library version is 4.1.0
> > > recv_nointr_timeout: Timed out
> > > SP_error: (-8) Connection closed by spread
> > > 
> > > Any insight would be very much appreciated, as we're about to launch a
> > > major product which relies on this!
> > > 
> > > The environment is FreeBSD 8.1 with Spread 4.1.0 on CloudSigma (Linux
> > > KVM) infrastructure. I can provide detailed log output, please tell me
> > > which flags you would like.
> > > 
> > 
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list