[Spread-users] Failure detection

Fri May 30 12:08:20 EDT 2003

Hi,

I actually started writing this email before Yair replied so some of
what I say is slightly redundant now but may clarify your problem a
bit better.

First of all I will try to make sure that there is no confusion
related to the model of the problem that you state.

>>Anders> The specific example is this. There are two servers A
>>and B, where A is the primary and B is the secondary. Both 
>>these machines a members of the the "server" group. There is 
>>an arbitrary
>>Anders> number of clients that are not part of the group but 
>>nevertheless send requests to it.

You talk about _servers_ A and B that are members of the group
"servers". The first important thing to notice here is that A and B
are _not_ spread daemons! From spread's perspective they are just
client applications that join a group called "servers". Their _server_
status is internal to your application I assume.

In this context, the perceived partition that you mention translates
into a spread daemon level membership change which is delivered to A
and B as a membership change notification. The clients that are
connected to A and B will be completely oblivious of this change.

>>Anders> At some point in time, both A and B time out with
>>respect to one another but they do not crash. That is, they 
>>are partitioned from one another but still alive. What happens 
>>when the clients
>>Anders> send to the "server" group? Which one of them gets the 
>>request? Is it ever possible for
>>Anders> both A and B to get a request from a client while they 
>>are partitioned?

The situation you mention can occur only if you go through
considerable trouble to make it possible ;) (or alternatively program
an application that completely ignores this possibility). Either way
it is only dependent on the way you program your client-server
architecture. Namely, you need to to have client C to connect first to
server A and send a request and then, for some strange reason, have
the client connect to server B and send _the same_ request. Obviously
this can be only a voluntary retry and should not happen unless you
mean it to happen.

Note that the clients do not (normally) know about
the existence of any spread groups (they do not communicate with
spread directly); their universe consists of the two servers A and B
and their communication semantics is probably completely oblivious to
the existence of spread. Therefore it is entirely their choice of what
action they send to each server so the conflict that you mention does
not depend on the partitioned state of the spread daemons (and
therefore of the servers).

ALuc> I have another question. In my reading about virtual synchrony
ALuc> the Fischer, Lynch and Patterson impossibility result is
ALuc> mentioned. In particular, Chandra's (et al) papers "On the
ALuc> impossibility of group membership" and "Unreliable failure
ALuc> detectors for realiable distributed systems " are cited.
ALuc> Unfortunately, I do not have access to this paper. What are the
ALuc> practical implications of this result for Spread?

Their definition of membership is under a different model than the one
used by Spread. In particular, without going into details, the Chandra
model refers to the single-view (or primary-partition) membership
model where the system of all servers agrees on only one view. Spread
works under the partitionable group membership service, where
different sets of servers agree on different views (disjoint). This is
how Spread incorporates network partitions that cause group splits and
merges. The details are, of course, fairly complex.

Hope this helps a little,
Ciprian