[Spread-users] Communicating between 130 hosts with Spread
ericbambach1 at discover.com
ericbambach1 at discover.com
Tue Jan 21 10:58:59 EST 2014
We've used spread version 3.17 successfully in production for a few years
in a small scale (one spread server, dozen clients) on AIX compiled with
GCC (not xlc). There are always compile problems on AIX with gcc but
nothing a little coaxing couldn't fix.
Eric Bambach | Discover
Cons. Systems Analyst, Promotion Rewards/Enterprise Rewards
2500 Lake Cook Road, Riverwoods IL 60015
P: 224.405.2896 ericbambach1 at discover.com
From: Christopher Browne <cbbrowne at afilias.info>
To: John Schultz <jschultz at spreadconcepts.com>
Cc: "spread-users at lists.spread.org Users"
<spread-users at lists.spread.org>
Date: 01/21/2014 09:33 AM
Subject: Re: [Spread-users] Communicating between 130 hosts with
On Tue, Aug 6, 2013 at 10:03 AM, John Schultz <jschultz at spreadconcepts.com
> Out of curiosity, do you know of any AIX users?
I personally do not know of any AIX users, but if any are on the list they
might chime in?
Spread is written in very portable C, uses autoconf to try to detect
system specifics and mainly depends on the Berkeley socket interface. I
see little reason why running Spread on a *nix clone like AIX should be
Obviously not an AIX user! ;-)
My environment used to use AIX a fair bit, albeit not for Spread; I would
not be surprised if there are some issues compiling & running it, that is
quite common, as AIX is definitely a bit different. Alas, I can't help;
we recently retired the AIX portion of the environment. Once upon a time,
I found problems in the TCP/IP stack when trying to get Postgres working
> I presume that client failover between daemons is something we'd need to
handle ourselves. If we do lose communication with a daemon, or a daemon
goes down, how quickly will we find out about it?
Yes, you'd have to handle failover at your client application. How
quickly you would find out about a problem depends first on whether or not
you are on the same machine or not.
If you are on the same machine, then basically the only failure that can
occur is if Spread crashes. In such a case, using IPC, then you should
get almost immediate notice on your connection that it has failed.
If you are connecting remotely through TCP, then the usual TCP mechanisms
would determine when you get notice. If just the Spread process crashes,
then usually you should quickly get notices from its host that the
connection is dead. If the remote machine suddenly fails (e.g. - power
failure) or the network suddenly partitions, then you usually need to send
some traffic from the client (e.g. - a no-op message to an empty group or
yourself) to realize that the host is gone.
Spread does offer the ability to use TCP's keep alive semantics, but for
them to be actually useful you have to set TCP's keep alive parameters
system wide at the OS level on both sides of the connection as the default
is usually something like 75 minutes or two hours before TCP probes are
Thinking about this some more, in the future we might want to add the
ability to explicitly probe the TCP connection, but not have it be a full
Spread message. That is, for no-op messages to periodically go between
the server and client and not need to bother the whole Spread deployment.
The client would still have to explicitly call this fcn periodically
though, as the Spread library does not have its own thread of control.
We'd probably have to use the out-of-band mechanisms of TCP to do this
... I'll need to think about it some more.
I suspect that you may want to encourage changing the default keepalive
parameters. (tcp_keepalive_intvl, tcp_keepalive_probes,
tcp_keepalive_time are the values indicated by "man 7 tcp")
The defaults were pretty logical years back, when connectivity to the
Internet was somewhat unusual and definitely intermittent. Would-be
"highly available" systems of today need to have a rather shorter leash.
In particular, it is nowhere near reasonable to need to wait hours for
things to timeout, especially if you might be doing special things like
setting up devoted network segments/connections/interfaces.
On Linux, the keepalive values are:
- time - 7200s - how long a connection needs to be idle before probes
start to be sent
- interval - 75s - how long between keepalive probes
- probes - 9 - how many probes before giving up
Those defaults (effective since Linux 2.4) indicate that things won't
timeout for 2 hours and 12 minutes. Other Unix flavours most likely have
similar values, and they're all most likely rather too long.
Spread-users mailing list
Spread-users at lists.spread.org
Please consider the environment before printing this email.
More information about the Spread-users