[Spread-users] cost of failures

Wed Oct 5 13:56:41 EDT 2005

Paul Rubel wrote:

>Hello,
>
>We have been working with Spread and are trying to understand the
>worst case behavior when a group member fails. In particular, I'm
>curious how long should it take for the group to recognize that a
>member has failed and to agree upon a new group membership without the
>failed member. That is, how long might a process need to wait between
>a failure and receiving a new membership message.
>
>I'm guessing that some of the factors in play here will be the type of
>network, the number of members and their locations, the spread
>timeout_* values, and the timing of the failure in respect to the
>protocol steps. Are there some aspects that are so dominant that we
>can practically ignore the others?
>
>I can run some tests but I suspect that there is a higher-level
>insight lurking here.
>  
>
A group member can't really affect the convergence time of the group.  
That is managed by the Spread daemon and the member leaving or joining 
is a singular event.  There are many things that come into play in 
defining an expected turn around on a new membership, the most 
infulential being your network topology and behaviour, but can also be 
affected by things like system load.

However, if you are speaking of a _daemon_ failure and the expected and 
maximal times for a new daemon membership to converge the situation is 
not so clean.  An adversarial daemon on the network can easily prevent 
the Spread ring from ever converging on a membership.  Unfortunately, 
this can also be caused by acute (yet subtle) networking issues like a 
VLAN QoSing packets differently base on multicast/unicast or some sort 
of bad packet loss between two machines.  Some of these networking issue 
can be quite difficult to detect.  So the above is really bad news, but 
the good news is that once you understand your network and work out the 
kinks, it is often times possible to _engineer_ a safe solution to 
deploy it by using tight firewall policies or a dedicated switch to 
handle Spread traffic, etc.

-- 
// Theo Schlossnagle
// Principal Engineer -- http://www.omniti.com/~jesus/
// Ecelerity: Run with it. -- http://www.omniti.com/