[Spread-users] Communicating between 130 hosts with Spread

Tue Jan 21 10:58:59 EST 2014

We've used spread version 3.17 successfully in production for a few years 
in a small scale (one spread server, dozen clients) on AIX compiled with 
GCC (not xlc). There are always compile problems on AIX with gcc  but 
nothing a little coaxing couldn't fix.

Eric Bambach | Discover
Cons. Systems Analyst, Promotion Rewards/Enterprise Rewards
2500 Lake Cook Road, Riverwoods IL 60015
P: 224.405.2896 ericbambach1 at discover.com

From:   Christopher Browne <cbbrowne at afilias.info>
To:     John Schultz <jschultz at spreadconcepts.com>
Cc:     "spread-users at lists.spread.org Users" 
<spread-users at lists.spread.org>
Date:   01/21/2014 09:33 AM
Subject:        Re: [Spread-users] Communicating between 130 hosts with 
Spread

On Tue, Aug 6, 2013 at 10:03 AM, John Schultz <jschultz at spreadconcepts.com
> wrote:
> Out of curiosity, do you know of any AIX users?

I personally do not know of any AIX users, but if any are on the list they 
might chime in?

Spread is written in very portable C, uses autoconf to try to detect 
system specifics and mainly depends on the Berkeley socket interface.  I 
see little reason why running Spread on a *nix clone like AIX should be 
difficult.

Obviously not an AIX user! ;-) 

My environment used to use AIX a fair bit, albeit not for Spread; I would 
not be surprised if there are some issues compiling & running it, that is 
quite common, as AIX is definitely a bit different.  Alas, I can't help; 
we recently retired the AIX portion of the environment.  Once upon a time, 
I found problems in the TCP/IP stack when trying to get Postgres working 
on AIX.

> I presume that client failover between daemons is something we'd need to 
handle ourselves. If we do lose communication with a daemon, or a daemon 
goes down, how quickly will we find out about it?

Yes, you'd have to handle failover at your client application.  How 
quickly you would find out about a problem depends first on whether or not 
you are on the same machine or not.

If you are on the same machine, then basically the only failure that can 
occur is if Spread crashes.  In such a case, using IPC, then you should 
get almost immediate notice on your connection that it has failed.

If you are connecting remotely through TCP, then the usual TCP mechanisms 
would determine when you get notice.  If just the Spread process crashes, 
then usually you should quickly get notices from its host that the 
connection is dead.  If the remote machine suddenly fails (e.g. - power 
failure) or the network suddenly partitions, then you usually need to send 
some traffic from the client (e.g. - a no-op message to an empty group or 
yourself) to realize that the host is gone.

Spread does offer the ability to use TCP's keep alive semantics, but for 
them to be actually useful you have to set TCP's keep alive parameters 
system wide at the OS level on both sides of the connection as the default 
is usually something like 75 minutes or two hours before TCP probes are 
sent.

Thinking about this some more, in the future we might want to add the 
ability to explicitly probe the TCP connection, but not have it be a full 
Spread message.  That is, for no-op messages to periodically go between 
the server and client and not need to bother the whole Spread deployment. 
 The client would still have to explicitly call this fcn periodically 
though, as the Spread library does not have its own thread of control. 
 We'd probably have to use the out-of-band mechanisms of TCP to do this 
... I'll need to think about it some more.

I suspect that you may want to encourage changing the default keepalive 
parameters.  (tcp_keepalive_intvl, tcp_keepalive_probes, 
tcp_keepalive_time are the values indicated by "man 7 tcp")

The defaults were pretty logical years back, when connectivity to the 
Internet was somewhat unusual and definitely intermittent.  Would-be 
"highly available" systems of today need to have a rather shorter leash.   
In particular, it is nowhere near reasonable to need to wait hours for 
things to timeout, especially if you might be doing special things like 
setting up devoted network segments/connections/interfaces.

On Linux, the keepalive values are:
- time - 7200s - how long a connection needs to be idle before probes 
start to be sent
- interval - 75s - how long between keepalive probes
- probes - 9 - how many probes before giving up

Those defaults (effective since Linux 2.4) indicate that things won't 
timeout for 2 hours and 12 minutes.  Other Unix flavours most likely have 
similar values, and they're all most likely rather too long.
_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users

Please consider the environment before printing this email.