[Spread-users] Re: partition detection

Jonathan Stanton jonathan at cnds.jhu.edu
Mon Apr 22 15:11:31 EDT 2002


>From the point of a view of a client (some program linked with libsp or
equivelent) you can receive 4 types of 'membership' events.
The type can be determined using the Is_caused_join_mess(),
Is_caused_leave_mess() ...) functions.

1) JOIN: This means a single member joined the group. Noone failed and the
the new member received no messages to the group prior to the join and will
receive all messages after the join. Everyone is told which member joined.

2) LEAVE: A single member left the group (someone called SP_leave). the
leaving member received all messages prior to this leave message (althought
spread guarantees nothing about what the program DID with those messages it
received) and wil receive no more messages from the group. Everyone is told
which member left.

3) DISCONNECT: A single member 'disconnected' from the daemon it had been
connected to. This could be because the client called SP_disconnect() or it
could be because the TCP or Unix Domain Socket returned a closed connection
to the daemon (something between the client and daemon failed and caused a
network reset, or the client process crashed, or something else). In this
case everyone else (other then the disconnected member) will get a message
indicating who was disconnected. It is not clear what messages the client
received prior to the disconnet event as it is not know where the failure
occured. 

4) NETWORK: A network event indicates taht more then a single member change
occured and an actual failure, recovery, network partition, or network merge
occured between the DAEMONS themselves. As a result of the daemons changing
who they were connected with, the clients who were connected to those
daemons also change membership. 

This event can encompass a parititon and merge at the same time (some
daemos are now disconnected from this subset, and others have become
reachable at approximately the same time. In this case each client willr
eceive a NETWORK event with the new membership set and teh VS set which
indicates which members of the old view (the last one delivered to the
client) are still in the new view AND were together at all times between the
views. This VS set is the only set who are guaranteed to have seen teh same
set of messages during the membership change. Other members of the group who
are not in the VS set may have seen a different series of membership views
and different data messages and will need to be reconciled by the
application.

A network event can be triggered in several ways. Some control message
between th daemons keep being dropped (so the link appears completely
unreliable), a timeout occurs in communication amoung the daemons, some
daemon hears from new daemons that are not part of the current
configuration. None of these triggers involves the clients directly, only
the daemon processes. So if a client cannot talk to a daemon for some reason
that will cause a 'DISCONNECT' event, not a 'NETWORK' event. If the daemon
the client is connected to cannot talk to the other daemons it used to be
able to, that will be a NETWORK event, not a DISCONNECT event.

Hope this helps,

Jonathan

On Sun, Apr 21, 2002 at 03:10:23PM +0300, Yaron Weinsberg wrote:
> 
> Hi again, 
> 
> Can you please explain what is the exact semantic for a network
> partition ? Is it the disability to communicate with a remote spread
> deaemon (due to a possible crash) or maybe it is an ICMP destination
> unreachable message that triggers the membership change ?
> 
> 	best,
> 		yaron.
> 
> p.s. thanks a lot for previous help regarding partitons and quorums.
> 
> 

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------





More information about the Spread-users mailing list