[Spread-users] Communicating between 130 hosts with Spread

Thu Aug 8 03:46:51 EDT 2013

On Tuesday 06 Aug 2013 John Schultz wrote:
> > I presume that client failover between daemons is something we'd need to
> > handle ourselves. If we do lose communication with a daemon, or a daemon
> > goes down, how quickly will we find out about it?
>
> Yes, you'd have to handle failover at your client application.

Roger. As expected.

> If you are connecting remotely through TCP, then the usual TCP mechanisms
> would determine when you get notice. If just the Spread process crashes,
> then usually you should quickly get notices from its host that the
> connection is dead.  If the remote machine suddenly fails (e.g. - power
> failure) or the network suddenly partitions, then you usually need to send
> some traffic from the client (e.g. - a no-op message to an empty group or
> yourself) to realize that the host is gone.
>
> Spread does offer the ability to use TCP's keep alive semantics, but for
> them to be actually useful you have to set TCP's keep alive parameters
> system wide at the OS level on both sides of the connection as the default
> is usually something like 75 minutes or two hours before TCP probes are
> sent.

I see that Spread always sets SO_KEEPALIVE on a TCP daemon connection. Rather 
than fiddle with system-wide keep alive parameters, since both our target 
systems support TCP_KEEP* I'd be tempted to add those for a custom SP_connect 
build in our distribution.

> Thinking about this some more, in the future we might want to add the
> ability to explicitly probe the TCP connection, but not have it be a full
> Spread message.  That is, for no-op messages to periodically go between the
> server and client and not need to bother the whole Spread deployment.  The
> client would still have to explicitly call this fcn periodically though, as
> the Spread library does not have its own thread of control.  We'd probably
> have to use the out-of-band mechanisms of TCP to do this ... I'll need to
> think about it some more.

Yup. I can see that would be useful. In our projected use cases, we'd largely 
be having servers regularly distributing data and receivers collecting it. The 
servers would therefore find out when sending that Something Was Wrong, but 
receivers might potentially swan along in ignorance for some time.

We'll get on with trying Spread out. Thanks for your help.

By the way, John, you may well find reports of mail bounces from my address. I 
found that a mail redirect was interacting badly with checks on your SPF 
information. Fixed now.
-- 
Jim Hague
jim.hague at laicatc.com              Never trust a computer you can't lift.
LAIC AG                            +44 1865 980647    Mob +44 7941 697732

Disclaimer
This message contains confidential (and possible privileged) information and
is for the named addressee or its intended recipients and others may not,
disclose, distribute, copy or use it. If you have received this
communication in error please:
1. tell LAIC either by return e-mail or by telephoning us on
   +44 (0) 1342 321 873; and
2. delete the e-mail message and any copies.

Whilst we have taken steps to ensure that this message (and any attachments
or hyperlinks contained within it) are free from computer viruses and the
like, the recipient is responsible for ensuring that it is actually virus
free before opening it.