[Spread-users] Cluster Vulnerability Question

John Schultz jschultz at spreadconcepts.com
Sun Mar 3 12:36:45 EST 2013

Hi Lyric,

I've thought a bit about your problem and I think I have a potential solution.  

My basic idea is to build a user level, message recovery process that would normally allow failing over clients to recover any messages they missed whilst they were disconnected.  The recovery process would ephemerally cache the most recent messages for the groups in which it is participating.

When a client fails over to a new daemon it would send a message to the recovery process specifying the messages it did receive, then the recovery process would retransmit any missing messages.  This could either be done through Spread communication or a more conventional client-server TCP architecture.

The main issue with this approach is being able to specify the msg id's that a client did and did not receive.  You could either institute a user-level id'ing scheme or possibly you could change the daemon to reveal the agreed ordering of messages within daemon memberships.

Alternatively, along the same lines, you could augment the Spread daemon and API to build this service into the daemon itself.

I hope that helps!  Please feel free to contact me if you have any questions or want help in building such a solution.


John Lane Schultz
Spread Concepts LLC
Phn: 301 830 8100
Cell: 443 838 2200

On Feb 25, 2013, at 3:04 PM, Lyric Doshi wrote:

hey all,

In our environment, we run a collection of spread daemons, where 
multiple clients connect to each a spread daemon over TCP. Failure of a 
spread daemon host machine disconnects all connected clients, making 
many clients appear down, despite only the daemon failing. We cannot 
afford to have any clients miss messages as they reconnect to a 
different spread daemon.

We'd appreciate any thoughts or help you can provide on how we can 
mitigate the problem of a spread-host dying and inducing all it's 
children spread client nodes to fall behind the rest of the cluster as 
well. An inefficient way to do this may be to connect every child to 
multiple spread-daemon hosts so it may buffer and cross-check every 
message it receives from each parent using some from of global unique 
ordered ID. However, this doubles our traffic and can be painful if 
messages are large so we're hoping for something better that spread can 
provide or other customers may have engineered to eliminate this problem.

-- Lyric

Spread-users mailing list
Spread-users at lists.spread.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3805 bytes
Desc: not available
Url : http://lists.spread.org/pipermail/spread-users/attachments/20130303/77feee22/attachment.bin 

More information about the Spread-users mailing list