[Spread-users] Failure detection

Wed May 28 21:16:58 EDT 2003

>
>Hi,
>
>Anders> I am thinking about using Spread to implement a 
>replicated service. It is very important that this service 
>does not accidentally process the same request more than once. 
>It seems to me that
>Anders> the timeout failure detection mechanism used by Spread 
>can lead to a situation where a machine is falsely suspected 
>as having failed.
>
>Anders> The specific example is this. There are two servers A 
>and B, where A is the primary and B is the secondary. Both 
>these machines a members of the the "server" group. There is 
>an arbitrary
>Anders> number of clients that are not part of the group but 
>nevertheless send requests to it.
>
>Anders> At some point in time, both A and B time out with 
>respect to one another but they do not crash. That is, they 
>are partitioned from one another but still alive. What happens 
>when the clients
>Anders> send to the "server" group? Which one of them gets the 
>request? Is it ever possible for
>Anders> both A and B to get a request from a client while they 
>are partitioned?
>
>Spread's strong semantics guarantees that it is never possible 
>that both A and B will get
>the request while they are partitioned.
>
>Consistent replication in a system that may partition is a pretty hard
>problem. Spread's strong semantics was specifically designed to be
>useful for this problem. However, Spread does not solve the persistent
>consistency problem by itself. You need some sort of a 
>replication engine
>for that. You can read a bit about this here:

Thanks Yair.

I wasn't thinking about replication in the presence of
partitioning since, as you point out, it's very difficult. I guess I
was just to trying to understand what 'partition' really meant. My main
concern was to avoid a situation where the servers could not talk to
one another because their commuication link had been broken but where
clients could talk to each server independently through alternative communication
channels. If the clients are not members of the group, how is it possible
to avoid this? Is the answer simply to avoid network topologies that allow
this to happen? I'm sorry if I'm belabouring the point but this scenario
is something that I have to avoid at all costs.

I have another question. In my reading about virtual synchrony the Fischer, Lynch and
Patterson impossibility result is mentioned. In particular, Chandra's (et al) papers
"On the impossibility of group membership" and "Unreliable failure detectors for
realiable distributed systems " are cited. Unfortunately, I do not have access to
this paper. What are the practical implications of this result for Spread?

Cheers,

Anders.

Visit our website at http://www.ubswarburg.com

This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.