[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sun Nov 7 08:45:49 EST 2010


Hi John,

Thank you for your email.

The strange thing is that this happens whichever order I add the nodes.
There is pretty good connectivity (max 4% packet loss) between all
nodes, as shown here:

        http://lukemarsden.net/spsendrecv.png

I have run the same tests on multiple sets of VMs on this cloud
infrastructure, and found it still happened in situations with smaller
levels of packet loss.

The weird thing is that if you start all the spread daemons
simultaneously, they sync up and it works fine. Then if you kill a
spread daemon, the other two notice and you get the expected behaviour.
It's only when you *add a third daemon to an established group of two*
(such as adding a failed node back in) that it stalls everything, and
you get the behaviour I posted in my spmonitor output.

Spread is behaving as expected on the same network with Debian 5.0. So
my working assumption is that it might be an issue to do with Spread on
FreeBSD 8.1.

What do you think?

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420


On Sun, 2010-11-07 at 08:26 -0500, John Schultz wrote:
> From my read of it, it looks like the daemons are repeatedly trying to establish a membership and failing.  The membership states of the daemons continually fluctuate between (2, 4, 5, 6) => (Segment, Gather, Form, EVS).  This would cause the daemons to appear to freeze from the point of view of clients too.
> 
> So, my guess would be that there is very bad loss between the 3rd node and the others for some reason.
> 
> Cheers!
> 
> -----
> John Lane Schultz
> Spread Concepts LLC
> Phn: 301 830 8100
> Cell: 443 838 2200
> 
> On Nov 7, 2010, at 7:44 AM, Yair Amir wrote:
> 
> Luke,
> 
> If you can, please e-mail a monitor report that lasts for two minutes or so,
> so that we can see several (10 or so) reports from each daemon such that
> after a while you add the third machine and then it continues for a while
> so that we can see several reports after you added the third machine.
> 
> Cheers,
> 
> 	:) Yair.
> 
> On 11/7/10 7:36 AM, Luke Marsden wrote:
> > Hi all,
> > To pin this down as a potential FreeBSD 8.1 issue, I have now
> > demonstrated that Spread 4.1.0 works fine on Debian 5.0 in the same
> > network infrastructure (with multiple Spread segments, one for each
> > public IP).
> > I will now test it on FreeBSD 8.0 to see if it was some change in
> > FreeBSD 8.1 which is subtly interacting with Spread to cause this
> > "self-destruct-on-new-join" behaviour.
> > If so, would anyone be able to help me create a patch to Spread which
> > fixes it? I know precious little about Spread's internal protocol.
> > Full verbosity failure logs coming later, which we can hopefully compare
> > to the successful Debian run to figure out where it's going wrong!
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list