[Spread-users] Cluster Vulnerability Question

Mon Feb 25 21:58:03 EST 2013

On Mon, Feb 25, 2013 at 07:20:05PM -0500, Lyric Doshi wrote:
> Thanks for your answer! Interestingly, we actually do run a spread 
> daemon on each client machine in most cases. However, spread only 
> supports up to 128 nodes and so we've run into trouble here with larger 
> deployments. That forced us to switch to a smaller ring with clients 
> connecting to a shared spread damon.

I've not run that many daemons so I haven't run into this issue.
As I recall, the official limit on the number of spread instances
is quite a bit less than that.

> Hmm. I'm not sure I follow how the proxy service would guarantee that no 
> client ever missed a message in the period where they (or the proxy 
> service) detect a failed node and then connect to a back up. Any 
> thoughts on how we can address this?

The proxy service allows the client to reconnect quickly to another host
without having know about all the daemons in the network.

As for missing messages and catching up, that is an application function.
The standard method is for the application to (re)start in recovery mode
and synchronize (usually out of band) with a more up to date peer before
fully joining the group.
Depending on the application, this can range from quite doable to quite hard.

There are several papers on the spread site that deal with this issue.
They focus on recovery in the context of database replication but I have found
them amenable to other circumstances.

-Gyepi