[Spread-users] Network partition and re-merge

John Lane Schultz jschultz at spreadconcepts.com
Thu Aug 7 21:28:26 EDT 2014


I apologize that I forgot to respond to your earlier questions on this.  Please feel free to prod me multiple times if I’m unresponsive in the future again.

> what happens if SPREAD reports a (false) network failure, and the new view does not include one of the old-view member, but it is really operational? The other "surviving" members should deliver CAUSED_BY_NETWORK membership messages and their high level application surely do not consider the "dead" member. They continue exchanging messages and, thus changing the global state.

Yes, that’s certainly possible.

> What happens with the "false-dead" member? Because it is considered as a "dead" member it doesn't receive the token anymore. Therefore, a token timeout should expires and  surely  it delivers a CAUSED_BY_NETWORK membership messages.

Yes, it too will eventually see that the ring has broken and deliver such a message.  It may deliver a membership with only its own members (a singleton daemon membership), or, depending on the exact timing of events it might skip right to the next part ...

> How the "false-dead" member could be included in the group again? Does SPREAD report it as a member again in the membership information? 

The daemons will eventually notice one another again.  Either by seeing each other’s traffic or through periodic probe messages they each send out.  Then they will attempt to form a new ring and, if successful, will eventually all report a different CAUSED_BY_NETWORK membership that remerges the daemons’ group members.

> Suppose that the "false-dead" member gets an resource exclusively. As the other members considered that member as "dead", one of them allocates that "false-free" resource. A difficult situation could arise.   

Yes.  Typically, the way such universal exclusivity is handled is through quorums.  As a simple example, a weighted majority of the potential system must be present and agree for any of those daemons or that quorum to utilize such a resource.  It is not possible for multiple quorums to exist virtually in parallel, so you can ensure exclusivity this way.  It’s not simple to get 100% correct and in some cases, no system quorum exists for extended periods of time.


John Lane Schultz
Spread Concepts LLC
Cell: 443 838 2200

On Aug 7, 2014, at 9:03 PM, Pablo Pessolani <ppessolani at hotmail.com> wrote:

Does any body has experience about handling network partition and re-merge with spread?
What happened if spread reports false-positive network failures? 
Thanks in advance.
Spread-users mailing list
Spread-users at lists.spread.org

More information about the Spread-users mailing list