[Spread-users] Failure detection

Fri May 30 09:16:35 EDT 2003

Hi Anders,

Anders> I wasn't thinking about replication in the presence of
Anders> partitioning since, as you point out, it's very difficult. I guess I
Anders> was just to trying to understand what 'partition' really meant. My main
Anders> concern was to avoid a situation where the servers could not talk to
Anders> one another because their commuication link had been broken but where
Anders> clients could talk to each server independently through alternative communication
Anders> channels. If the clients are not members of the group, how is it possible
Anders> to avoid this? Is the answer simply to avoid network topologies that allow
Anders> this to happen? I'm sorry if I'm belabouring the point but this scenario
Anders> is something that I have to avoid at all costs.

As the client is also a user of Spread and talks with the servers only
through Spread, its messages are covered by the guarantees of extended
virtual synchrony even though the client is not a member of the group.
In contrast to some other systems, Spread supports the open group
semantics which allow senders to be non-group members while still
meeting the relevant guarantees.

Anders> I have another question. In my reading about virtual synchrony the Fischer, Lynch and
Anders> Patterson impossibility result is mentioned. In particular, Chandra's (et al) papers
Anders> "On the impossibility of group membership" and "Unreliable failure detectors for
Anders> realiable distributed systems " are cited. Unfortunately, I do not have access to
Anders> this paper. What are the practical implications of this result for Spread?

There are no practical implications of this result for Spread of
course, but there were many implications on the theory, modeling and
algorithms behind it.

Spread can absolutely meet all its extended virtual synchrony
guarantees. To completely understand why, one would need to understand
the exact model under which FLP / Chandra et all really try to solve
the consensus problem. Spread is not solving a consensus problem yet
it guarantees all you need to achieve complete consistency under any
scenario (ain't it cool?).

Shrinking a course to one paragraph, I would say that any practical
system (including Spread) has to solve a weaker problem - a membership
problem - which allows throwing away non responsive members (in agreement),
possibly temporarily. This is possible and sufficient to provide all
the necessary information so that applications can be consistent
in partitionable environments.

       Enjoy,

       :) Yair.