[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sun Nov 7 08:58:21 EST 2010


Hi Yair,

Here is the output from a series of experiments which show that
whichever permutation of two machines you choose to start initially,
they successfully join up:

http://lukemarsden.net/exp1.png (started #2, then #3)
http://lukemarsden.net/exp2.png (started #1 and #2)
http://lukemarsden.net/exp3.png (started #1 and #3)

But here is what happens if you then add the third node in to a group of
two (this corresponds to the spmonitor output), with a three minute wait
afterwards:

http://lukemarsden.net/exp3-fail.png

However if you start all the spread daemons simultaneously, they
succeed:

http://lukemarsden.net/simultaneous-start.png

Any ideas?

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420


On Sun, 2010-11-07 at 08:39 -0500, Yair Amir wrote:
> Yes - it seems that the node with IP address 10.0.0.1
> does not get the Form2 token from node 10.0.0.2 and hence the membership
> fails to complete. But it seems to me it is always that specific message.
> 
> 
> Can you start first the node 10.0.0.3 and then node 10.0.0.1 and let us see
> if only these 2 can work (without adding 10.0.0.2)?
> 
> Also, can you try only 10.0.0.3 and 10.0.0.2 and see if they would work?
> 
> Cheers,
> 
> 	:) Yair.
> 
> On 11/7/10 8:26 AM, John Schultz wrote:
> > From my read of it, it looks like the daemons are repeatedly trying to establish a membership and failing.  The membership states of the daemons continually fluctuate between (2, 4, 5, 6) => (Segment, Gather, Form, EVS).  This would cause the daemons to appear to freeze from the point of view of clients too.
> > 
> > So, my guess would be that there is very bad loss between the 3rd node and the others for some reason.
> > 
> > Cheers!
> > 
> > -----
> > John Lane Schultz
> > Spread Concepts LLC
> > Phn: 301 830 8100
> > Cell: 443 838 2200
> > 
> > On Nov 7, 2010, at 7:44 AM, Yair Amir wrote:
> > 
> > Luke,
> > 
> > If you can, please e-mail a monitor report that lasts for two minutes or so,
> > so that we can see several (10 or so) reports from each daemon such that
> > after a while you add the third machine and then it continues for a while
> > so that we can see several reports after you added the third machine.
> > 
> > Cheers,
> > 
> > 	:) Yair.
> > 
> > On 11/7/10 7:36 AM, Luke Marsden wrote:
> >> Hi all,
> >> To pin this down as a potential FreeBSD 8.1 issue, I have now
> >> demonstrated that Spread 4.1.0 works fine on Debian 5.0 in the same
> >> network infrastructure (with multiple Spread segments, one for each
> >> public IP).
> >> I will now test it on FreeBSD 8.0 to see if it was some change in
> >> FreeBSD 8.1 which is subtly interacting with Spread to cause this
> >> "self-destruct-on-new-join" behaviour.
> >> If so, would anyone be able to help me create a patch to Spread which
> >> fixes it? I know precious little about Spread's internal protocol.
> >> Full verbosity failure logs coming later, which we can hopefully compare
> >> to the successful Debian run to figure out where it's going wrong!
> > 
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> > 
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list