[Spread-users] Failure detection

Wed May 28 09:54:12 EDT 2003

Hi,

Anders> I am thinking about using Spread to implement a replicated service. It is very important that this service does not accidentally process the same request more than once. It seems to me that
Anders> the timeout failure detection mechanism used by Spread can lead to a situation where a machine is falsely suspected as having failed.

Anders> The specific example is this. There are two servers A and B, where A is the primary and B is the secondary. Both these machines a members of the the "server" group. There is an arbitrary
Anders> number of clients that are not part of the group but nevertheless send requests to it.

Anders> At some point in time, both A and B time out with respect to one another but they do not crash. That is, they are partitioned from one another but still alive. What happens when the clients
Anders> send to the "server" group? Which one of them gets the request? Is it ever possible for
Anders> both A and B to get a request from a client while they are partitioned?

Spread's strong semantics guarantees that it is never possible that both A and B will get
the request while they are partitioned.

Consistent replication in a system that may partition is a pretty hard
problem. Spread's strong semantics was specifically designed to be
useful for this problem. However, Spread does not solve the persistent
consistency problem by itself. You need some sort of a replication engine
for that. You can read a bit about this here:
http://www.cnds.jhu.edu/rep.html

Replication is a main thrust for us at Spread Concepts where we build
commercial-grade replication engines.
http://www.spreadconcepts.com/replication.html

Cheers,

       :) Yair.

Anders> If yes, does this mean that I've really got to have three servers so that if partitioning ever happens, the component with 2 members becomes the primary component? Any server that finds
Anders> itself isolated would then deem itself failed and not respond to requests?

Anders> Anders.