Can you provide a bit more detail about your system: how many messages / second, how often there are failures, what you&#39;d like to happen if everything went down at once, etc.?<br><br>I ask, only because often we find ourselves using a particular system because it is what we have, not what fits our needs.  Spread (at least the open source version I used a few years back) is not a durable messaging system - if you are going to try and build durability on top of it, you might spend lots of effort, when perhaps something already exists that meets your needs.<br>

<br><br><br><br><br><br><br><div class="gmail_quote">On Mon, Feb 25, 2013 at 7:20 PM, Lyric Doshi <span dir="ltr">&lt;<a href="mailto:ldoshi@vertica.com" target="_blank">ldoshi@vertica.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

hi Gyepi,<br>

<br>

Thanks for your answer! Interestingly, we actually do run a spread<br>

daemon on each client machine in most cases. However, spread only<br>

supports up to 128 nodes and so we&#39;ve run into trouble here with larger<br>

deployments. That forced us to switch to a smaller ring with clients<br>

connecting to a shared spread damon.<br>

<br>

We also would like to run in environments with higher latency where<br>

having a smaller ring of spread daemon hosts with clients is preferable<br>

to a full ring of all the nodes.<br>

<br>

Hmm. I&#39;m not sure I follow how the proxy service would guarantee that no<br>

client ever missed a message in the period where they (or the proxy<br>

service) detect a failed node and then connect to a back up. Any<br>

thoughts on how we can address this?<br>

<br>

Thanks!<br>

<span class="HOEnZb"><font color="#888888">-- Lyric<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

On 02/25/2013 06:15 PM, Gyepi SAM wrote:<br>

&gt; On Mon, Feb 25, 2013 at 03:04:17PM -0500, Lyric Doshi wrote:<br>

&gt;&gt; In our environment, we run a collection of spread daemons, where<br>

&gt;&gt; multiple clients connect to each a spread daemon over TCP. Failure of a<br>

&gt;&gt; spread daemon host machine disconnects all connected clients, making<br>

&gt;&gt; many clients appear down, despite only the daemon failing.<br>

&gt;&gt; We&#39;d appreciate any thoughts or help you can provide on how we can<br>

&gt;&gt; mitigate the problem of a spread-host dying and inducing all it&#39;s<br>

&gt;&gt; children spread client nodes to fall behind the rest of the cluster as<br>

&gt;&gt; well. An inefficient way to do this may be to connect every child to<br>

&gt;&gt; multiple spread-daemon hosts so it may buffer and cross-check every<br>

&gt;&gt; message it receives from each parent using some from of global unique<br>

&gt;&gt; ordered ID.<br>

&gt; Hi Lyric,<br>

&gt;<br>

&gt; The most robust solution is to run a spread daemon on each client machine.<br>

&gt;<br>

&gt; Nearly as robust would be to interpose a proxy service between the<br>

&gt; client and the spread daemons. Something like haproxy, for example,<br>

&gt; which would, upon detecting a failed spread node, connect to a backup.<br>

&gt;<br>

&gt; Of course, any existing connections will be lost, but assuming that the<br>

&gt; client nodes reconnect, they&#39;ll be connected to the next available daemon.<br>

&gt;<br>

&gt; -Gyepi<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; Spread-users mailing list<br>

&gt; <a href="mailto:Spread-users@lists.spread.org">Spread-users@lists.spread.org</a><br>

&gt; <a href="http://lists.spread.org/mailman/listinfo/spread-users" target="_blank">http://lists.spread.org/mailman/listinfo/spread-users</a><br>

<br>

<br>

_______________________________________________<br>

Spread-users mailing list<br>

<a href="mailto:Spread-users@lists.spread.org">Spread-users@lists.spread.org</a><br>

<a href="http://lists.spread.org/mailman/listinfo/spread-users" target="_blank">http://lists.spread.org/mailman/listinfo/spread-users</a><br>

</div></div></blockquote></div><br>