[Spread-users] Failure detection

Wed May 28 02:23:59 EDT 2003

Anders,

The classic way to solve the majority-quorum problem with only two
hosts is to have a shared lockable resource via some other medium
that could be relied upon to not partition (at least not in the
same way); in VMS and TruCluster this was the Quorum Disk, and it
contributed a "vote" to the server that locks the disk.

This has the distinct advantage over network-based multipath
alternatives that it works even if the partitioning is due to a
bug on the OS network stack.

Joshua.

On Wed, May 28, 2003 at 03:29:14PM +1000, Anders.Lindstrom at ubsw.com wrote:
> I am thinking about using Spread to implement a replicated service. It is very important that this service does not accidentally process the same request more than once. It seems to me that the timeout failure detection mechanism used by Spread can lead to a situation where a machine is falsely suspected as having failed.
> 
> The specific example is this. There are two servers A and B, where A is the primary and B is the secondary. Both these machines a members of the the "server" group. There is an arbitrary number of clients that are not part of the group but nevertheless send requests to it.
> 
> At some point in time, both A and B time out with respect to one another but they do not crash. That is, they are partitioned from one another but still alive. What happens when the clients send to the "server" group? Which one of them gets the request? Is it ever possible for both A and B to get a request from a client while they are partitioned?
> 
> If yes, does this mean that I've really got to have three servers so that if partitioning ever happens, the component with 2 members becomes the primary component? Any server that finds itself isolated would then deem itself failed and not respond to requests?