[Spread-users] cost of failures

Wed Oct 5 14:39:33 EDT 2005

Hi Theo,

 Thanks for the response! The daemon is certainly the interesting
part, thanks for clarifying that. 

Theo Schlossnagle writes:
 > Paul Rubel wrote:
 > 
 > >Hello,
 > >
 > >We have been working with Spread and are trying to understand the
 > >worst case behavior when a group member fails. In particular, I'm
 > >curious how long should it take for the group to recognize that a
 > >member has failed and to agree upon a new group membership without the
 > >failed member. That is, how long might a process need to wait between
 > >a failure and receiving a new membership message.
 > >
 > >I'm guessing that some of the factors in play here will be the type of
 > >network, the number of members and their locations, the spread
 > >timeout_* values, and the timing of the failure in respect to the
 > >protocol steps. Are there some aspects that are so dominant that we
 > >can practically ignore the others?
 > >
 > >I can run some tests but I suspect that there is a higher-level
 > >insight lurking here.
 > >  
 > >
 > A group member can't really affect the convergence time of the group.  
 > That is managed by the Spread daemon and the member leaving or joining 
 > is a singular event.  There are many things that come into play in 
 > defining an expected turn around on a new membership, the most 
 > infulential being your network topology and behaviour, but can also be 
 > affected by things like system load.

What do you mean by network behavior? Utilization? 

What mechanism do the daemons use to find membership? Is it a
heartbeat that times out or something else?

 >
 > However, if you are speaking of a _daemon_ failure and the expected
 > and maximal times for a new daemon membership to converge the
 > situation is not so clean. An adversarial daemon on the network can
 > easily prevent the Spread ring from ever converging on a
 > membership.  Unfortunately, this can also be caused by acute (yet
 > subtle) networking issues like a VLAN QoSing packets differently
 > base on multicast/unicast or some sort of bad packet loss between
 > two machines.  Some of these networking issue can be quite
 > difficult to detect.  So the above is really bad news, but the good
 > news is that once you understand your network and work out the
 > kinks, it is often times possible to _engineer_ a safe solution to
 > deploy it by using tight firewall policies or a dedicated switch to
 > handle Spread traffic, etc.

The case I'm curious about would be essentially a fail-stop failure of
a host, which takes down a group member and its daemon but where there
is no malicious activity taking place. 

   thanks again,
    Paul