[Spread-users] cost of failures
Theo Schlossnagle
jesus at omniti.com
Wed Oct 5 13:56:41 EDT 2005
Paul Rubel wrote:
>Hello,
>
>We have been working with Spread and are trying to understand the
>worst case behavior when a group member fails. In particular, I'm
>curious how long should it take for the group to recognize that a
>member has failed and to agree upon a new group membership without the
>failed member. That is, how long might a process need to wait between
>a failure and receiving a new membership message.
>
>I'm guessing that some of the factors in play here will be the type of
>network, the number of members and their locations, the spread
>timeout_* values, and the timing of the failure in respect to the
>protocol steps. Are there some aspects that are so dominant that we
>can practically ignore the others?
>
>I can run some tests but I suspect that there is a higher-level
>insight lurking here.
>
>
A group member can't really affect the convergence time of the group.
That is managed by the Spread daemon and the member leaving or joining
is a singular event. There are many things that come into play in
defining an expected turn around on a new membership, the most
infulential being your network topology and behaviour, but can also be
affected by things like system load.
However, if you are speaking of a _daemon_ failure and the expected and
maximal times for a new daemon membership to converge the situation is
not so clean. An adversarial daemon on the network can easily prevent
the Spread ring from ever converging on a membership. Unfortunately,
this can also be caused by acute (yet subtle) networking issues like a
VLAN QoSing packets differently base on multicast/unicast or some sort
of bad packet loss between two machines. Some of these networking issue
can be quite difficult to detect. So the above is really bad news, but
the good news is that once you understand your network and work out the
kinks, it is often times possible to _engineer_ a safe solution to
deploy it by using tight firewall policies or a dedicated switch to
handle Spread traffic, etc.
--
// Theo Schlossnagle
// Principal Engineer -- http://www.omniti.com/~jesus/
// Ecelerity: Run with it. -- http://www.omniti.com/
More information about the Spread-users
mailing list