[Spread-users] Issue with Spread going silent
Luke Marsden
luke-lists at hybrid-logic.co.uk
Sun Nov 7 08:58:21 EST 2010
Hi Yair,
Here is the output from a series of experiments which show that
whichever permutation of two machines you choose to start initially,
they successfully join up:
http://lukemarsden.net/exp1.png (started #2, then #3)
http://lukemarsden.net/exp2.png (started #1 and #2)
http://lukemarsden.net/exp3.png (started #1 and #3)
But here is what happens if you then add the third node in to a group of
two (this corresponds to the spmonitor output), with a three minute wait
afterwards:
http://lukemarsden.net/exp3-fail.png
However if you start all the spread daemons simultaneously, they
succeed:
http://lukemarsden.net/simultaneous-start.png
Any ideas?
--
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.
Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting
Mobile: +447791750420
On Sun, 2010-11-07 at 08:39 -0500, Yair Amir wrote:
> Yes - it seems that the node with IP address 10.0.0.1
> does not get the Form2 token from node 10.0.0.2 and hence the membership
> fails to complete. But it seems to me it is always that specific message.
>
>
> Can you start first the node 10.0.0.3 and then node 10.0.0.1 and let us see
> if only these 2 can work (without adding 10.0.0.2)?
>
> Also, can you try only 10.0.0.3 and 10.0.0.2 and see if they would work?
>
> Cheers,
>
> :) Yair.
>
> On 11/7/10 8:26 AM, John Schultz wrote:
> > From my read of it, it looks like the daemons are repeatedly trying to establish a membership and failing. The membership states of the daemons continually fluctuate between (2, 4, 5, 6) => (Segment, Gather, Form, EVS). This would cause the daemons to appear to freeze from the point of view of clients too.
> >
> > So, my guess would be that there is very bad loss between the 3rd node and the others for some reason.
> >
> > Cheers!
> >
> > -----
> > John Lane Schultz
> > Spread Concepts LLC
> > Phn: 301 830 8100
> > Cell: 443 838 2200
> >
> > On Nov 7, 2010, at 7:44 AM, Yair Amir wrote:
> >
> > Luke,
> >
> > If you can, please e-mail a monitor report that lasts for two minutes or so,
> > so that we can see several (10 or so) reports from each daemon such that
> > after a while you add the third machine and then it continues for a while
> > so that we can see several reports after you added the third machine.
> >
> > Cheers,
> >
> > :) Yair.
> >
> > On 11/7/10 7:36 AM, Luke Marsden wrote:
> >> Hi all,
> >> To pin this down as a potential FreeBSD 8.1 issue, I have now
> >> demonstrated that Spread 4.1.0 works fine on Debian 5.0 in the same
> >> network infrastructure (with multiple Spread segments, one for each
> >> public IP).
> >> I will now test it on FreeBSD 8.0 to see if it was some change in
> >> FreeBSD 8.1 which is subtly interacting with Spread to cause this
> >> "self-destruct-on-new-join" behaviour.
> >> If so, would anyone be able to help me create a patch to Spread which
> >> fixes it? I know precious little about Spread's internal protocol.
> >> Full verbosity failure logs coming later, which we can hopefully compare
> >> to the successful Debian run to figure out where it's going wrong!
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
More information about the Spread-users
mailing list