[Spread-users] Issue with Spread going silent

Yair Amir yairamir at cs.jhu.edu
Sun Nov 7 08:59:44 EST 2010


Hi Luke,

The way to see what is going on is to change in the conf file of Spread
on all of the computers to have an uncommented line

DebugFlags = { MEMBERSHIP PRINT EXIT }

This will show us what happens more clearly.

I have a hard time believing this is a FreeBSD issue per se.
I do think that 4% loss is extremely high for a cluster by the way,
but I don't think this is the cause because it seems to always happen at the
same exact message. Let verify that.

BTW - can you change the order of them in the conf file to be
first 10.0.0.1, then 10.0.0.2 and then 10.0.0.3?
This will not matter but will simplify understanding what is going on.

Cheers,

	:) Yair.


On 11/7/10 8:45 AM, Luke Marsden wrote:
> Hi John,
> 
> Thank you for your email.
> 
> The strange thing is that this happens whichever order I add the nodes.
> There is pretty good connectivity (max 4% packet loss) between all
> nodes, as shown here:
> 
>         http://lukemarsden.net/spsendrecv.png
> 
> I have run the same tests on multiple sets of VMs on this cloud
> infrastructure, and found it still happened in situations with smaller
> levels of packet loss.
> 
> The weird thing is that if you start all the spread daemons
> simultaneously, they sync up and it works fine. Then if you kill a
> spread daemon, the other two notice and you get the expected behaviour.
> It's only when you *add a third daemon to an established group of two*
> (such as adding a failed node back in) that it stalls everything, and
> you get the behaviour I posted in my spmonitor output.
> 
> Spread is behaving as expected on the same network with Debian 5.0. So
> my working assumption is that it might be an issue to do with Spread on
> FreeBSD 8.1.
> 
> What do you think?
> 




More information about the Spread-users mailing list