[Spread-users] Issue with Spread going silent

Sun Nov 7 09:13:45 EST 2010

Hi Yair,

Thank you. I agree 4% packet loss is high. I get quite a bit of packet
loss when saturating the network interfaces (spsend/recv or ping -f),
but none at all when transmitting just a small amount of traffic. Since
in normal operation Spread shouldn't go near saturating the network
interfaces, I agree that this is unlikely to be the cause of the
problem. An interesting artefact of the virtualisation though.

I have rearranged the machines in the spread.conf. They are using their
public IPs for this test, not the 10.0.0.* addresses (although they
exhibit the same behaviour either way):

Spread_Segment 178.22.66.147:4803 {
    2f20196c853548e7 178.22.66.147
}
Spread_Segment 178.22.67.102:4803 {
    27edda570dce48bb 178.22.67.102
}
Spread_Segment 178.22.67.48:4803 {
    fff0bbd5e0da4103 178.22.67.48
}

I've added the MEMBERSHIP debug flag, and this is the output. I started
the spread daemons from left-to-right, which now corresponds to
top-to-bottom :-)

http://lukemarsden.net/debugging.png

Does this shed any light?

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420

On Sun, 2010-11-07 at 08:59 -0500, Yair Amir wrote:
> Hi Luke,
> 
> The way to see what is going on is to change in the conf file of Spread
> on all of the computers to have an uncommented line
> 
> DebugFlags = { MEMBERSHIP PRINT EXIT }
> 
> This will show us what happens more clearly.
> 
> I have a hard time believing this is a FreeBSD issue per se.
> I do think that 4% loss is extremely high for a cluster by the way,
> but I don't think this is the cause because it seems to always happen at the
> same exact message. Let verify that.
> 
> BTW - can you change the order of them in the conf file to be
> first 10.0.0.1, then 10.0.0.2 and then 10.0.0.3?
> This will not matter but will simplify understanding what is going on.
> 
> Cheers,
> 
> 	:) Yair.
> 
> 
> On 11/7/10 8:45 AM, Luke Marsden wrote:
> > Hi John,
> > 
> > Thank you for your email.
> > 
> > The strange thing is that this happens whichever order I add the nodes.
> > There is pretty good connectivity (max 4% packet loss) between all
> > nodes, as shown here:
> > 
> >         http://lukemarsden.net/spsendrecv.png
> > 
> > I have run the same tests on multiple sets of VMs on this cloud
> > infrastructure, and found it still happened in situations with smaller
> > levels of packet loss.
> > 
> > The weird thing is that if you start all the spread daemons
> > simultaneously, they sync up and it works fine. Then if you kill a
> > spread daemon, the other two notice and you get the expected behaviour.
> > It's only when you *add a third daemon to an established group of two*
> > (such as adding a failed node back in) that it stalls everything, and
> > you get the behaviour I posted in my spmonitor output.
> > 
> > Spread is behaving as expected on the same network with Debian 5.0. So
> > my working assumption is that it might be an issue to do with Spread on
> > FreeBSD 8.1.
> > 
> > What do you think?
> > 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users