[Spread-users] Communicating between 130 hosts with Spread

Tue Aug 6 10:03:18 EDT 2013

> Out of curiosity, do you know of any AIX users?

I personally do not know of any AIX users, but if any are on the list they might chime in?

Spread is written in very portable C, uses autoconf to try to detect system specifics and mainly depends on the Berkeley socket interface.  I see little reason why running Spread on a *nix clone like AIX should be difficult.

> I presume that client failover between daemons is something we'd need to handle ourselves. If we do lose communication with a daemon, or a daemon goes down, how quickly will we find out about it?

Yes, you'd have to handle failover at your client application.  How quickly you would find out about a problem depends first on whether or not you are on the same machine or not.  

If you are on the same machine, then basically the only failure that can occur is if Spread crashes.  In such a case, using IPC, then you should get almost immediate notice on your connection that it has failed.  

If you are connecting remotely through TCP, then the usual TCP mechanisms would determine when you get notice.  If just the Spread process crashes, then usually you should quickly get notices from its host that the connection is dead.  If the remote machine suddenly fails (e.g. - power failure) or the network suddenly partitions, then you usually need to send some traffic from the client (e.g. - a no-op message to an empty group or yourself) to realize that the host is gone.

Spread does offer the ability to use TCP's keep alive semantics, but for them to be actually useful you have to set TCP's keep alive parameters system wide at the OS level on both sides of the connection as the default is usually something like 75 minutes or two hours before TCP probes are sent.

Thinking about this some more, in the future we might want to add the ability to explicitly probe the TCP connection, but not have it be a full Spread message.  That is, for no-op messages to periodically go between the server and client and not need to bother the whole Spread deployment.  The client would still have to explicitly call this fcn periodically though, as the Spread library does not have its own thread of control.  We'd probably have to use the out-of-band mechanisms of TCP to do this ... I'll need to think about it some more.

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Cell: 443 838 2200

On Aug 6, 2013, at 6:05 AM, Jim Hague wrote:

Hi John,

I somehow managed to miss your reply - just found it in the list archives.

On Wed Jul 31, John Schultz wrote:
> A spread deployment can usually handle thousands of clients served by at
> most a few tens of daemons.  We typically do not recommend casual users to
> try running Spread deployments bigger than say 40 daemons or so (especially
> across a WAN) and even that size is usually massive overkill for most
> applications.

OK, thanks. I was suspecting that this was the case, but this wasn't 
completely clear from my reading of the user guide.

> The primary advantage of running a daemon on the same machine as client
> applications is that the client-daemon communication doesn't go over the
> network.  This means less traffic on your network and also that certain
> kinds of faults can't really happen (e.g. - network partitions, only one
> side of the connection abruptly fails, etc.) and/or are more easily and
> quickly detected.  Furthermore, on *nix systems you can use IPC rather than
> TCP, which is less intensive on the CPU.  Also, the more daemons you have
> then the more distributed the load of the system (i.e. - handling clients)
> can be across all the daemons.

I should have noted that we're deployed on AIX and Linux/x86 exclusively. I'm 
sure you have a bucketload of experience with Linux. Out of curiosity, do you 
know of any AIX users? Just so I can make a few more reassuring noises at 
management should they choose to take an interest. :-)

> Which structure of deployment is best for you depends on particulars such as
> what kind of throughput in Mb/s and msgs/s you expect to go through the
> system in total, how much redundancy you want in each site, what do you
> want to happen if the WAN partitions, etc.

I think our applications could be described as 'undemanding'. Those we have in 
mind for Spread right now have throughputs of a few kb/s and maybe 10 msg/s. 

> As a rough guess at what might work for you, assuming your application isn't
> all that demanding, I'd guess that one Spread segment for your central
> office and one for each remote site each containing a few (e.g. - 3)
> daemons would suffice.  Clients would (remotely) connect to one of the
> daemons at their local site, potentially failing over within a site if a
> daemon is down or even to another site if all the daemons in its site are
> down.

That's exactly what I was envisaging. I presume that client failover between 
daemons is something we'd need to handle ourselves. If we do lose 
communication with a daemon, or a daemon goes down, how quickly will we find 
out about it?

> There's little need to get too fancy with depending on the network
> structure, VLANs, multicast, etc.  Just run each spread segment in a LAN
> and so long as all the daemons in the system can talk point-to-point with
> all the other daemons in the system and your clients can reach their
> daemons then that would likely work for you.

Good. I'm pretty sure that all the daemons will be able to communicate 
directly with each other, so we should be able to keep it simple.

> If you are having lots of trouble or have more demanding needs than I
> guessed, then my company, Spread Concepts LLC, offers support contracts and
> would be happy to work with you on your particular problems.

At this stage we're investigating the technology. Should we look like moving 
to deployment, I'm sure we'll be wanting your commercial services :-)

> PS - Does ATC stand for Air Traffic Control?

Yes, it does. We are active in that business, mainly in the Czech Republic.
-- 
Jim Hague
jim.hague at laicatc.com              Never trust a computer you can't lift.
LAIC AG                            +44 1865 980647    Mob +44 7941 697732

Disclaimer
This message contains confidential (and possible privileged) information and
is for the named addressee or its intended recipients and others may not,
disclose, distribute, copy or use it. If you have received this
communication in error please:
1. tell LAIC either by return e-mail or by telephoning us on
  +44 (0) 1342 321 873; and
2. delete the e-mail message and any copies.

Whilst we have taken steps to ensure that this message (and any attachments
or hyperlinks contained within it) are free from computer viruses and the
like, the recipient is responsible for ensuring that it is actually virus
free before opening it.

_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users