[Spread-users] newbie question - automatic daemon failover

Sun Nov 23 14:05:54 EST 2003

On Sun, 23 Nov 2003, Ophir Bleiberg wrote:
>
> After being impressed with the apparent performance, I am concerend about
> fault tolerance - I could see no mention of daemon failover in the
> documentation, but this must be supported somehow, right?

Daemon failover is supported in Spread by allowing you to concurrently run
many daemons in the same system configuration.  Quite often, for a given
application process, people will connect to the system through a "single"
daemon known to that process (e.g. - "4803 at machine5.domain.com" or
"4803 at localhost").  This approach has the drawback that if an
application's known daemon is down, then that application doesn't know how
to failover to another daemon in the system.

If you don't like this approach, then I suggest leveraging DNS for
failover.  If an application's known daemon is down, then do a DNS lookup
of "spread_daemons.domain.com."  Associate all the daemons in your Spread
system with that DNS name and have DNS hand them out round-robin.  Better
yet, you could associate all the daemons in a LAN segment with a name
(e.g. segment1.domain.com) and then try that name first and if no daemons
are available in the LAN, then try the global name.  This should be an
adequate solution for daemon failover for most any application.

> 	When running two (Java) clients on a single group, I have them both
> connect to a single daemon out of 2 running in my configuration.  When
> taking down the daemon they have connected to, I receive a:
> 
>         spread.SpreadException: write(): java.net.SocketException: Broken
> pipe
>         at spread.SpreadConnection.multicast(SpreadConnection.java:1886)
>        .
>        .
>        .
> 
> 	and the messages, naturally, do not get delivered.  It seems
> pointless to develop redundant applications over spread if the daemons
> themselves are a single point of failure, so this is probably taken care of.
> Could you give some idea of what I should do?

With respect to you seeing broken pipes from Spread daemons, I can almost
guarantee that this is not due to a daemon crash.  This is almost surely
due to the Spread daemon explicitly disconnecting your process due to it
not receiving messages from Spread fast enough.  Remember, that as
messages are sent to a message group, Spread holds (buffers) them for 
client receipt.  If a client isn't keeping up with the speed of messages 
in a group, Spread's buffers begin to grow.  If those buffers get too big, 
Spread disconnects you for using up too much of its memory.

So your application must implement some level of flow control, or be
structured in such a way that the rate of reading is always faster than
the rate of sending.  Flow control is discussed a lot in the mail archives 
-- here is a good resource on that discussion:

http://lists.spread.org/pipermail/spread-users/2002-March/000655.html

Hope that helps!

--
John Lane Schultz
Spread Concepts, LLC
Phn: 443 838 2200