[Spread-users] detecting network outages

John Schultz jschultz at spreadconcepts.com
Wed Jul 8 14:23:26 EDT 2009


The Spread daemon definitely should have recognized the remote client's connection as dead because it should have been trying to send the sender's traffic to it continuously.  The remote client's tcp socket should have errored out with some kind of OS/network error.  Failing that, even if it got an EAGAIN kind of error (i.e. - tcp/ip buffers full), then after 1000 buffered msgs the daemon should have kicked the client for not reading its messages.

So, it sounds to me like your OS is doing something weird (i.e. - something it shouldn't be doing) with the tcp/ip sockets when you disable the network interface.  Normally, any connected, active channels across that interface should break quite quickly with an OS/Network error.

TCP/IP does leave some tolerance that if no traffic is flowing that the channel can remain valid for significant periods (e.g. - minutes) of partition before it will timeout with an error.  However, that shouldn't be the case here as your sender is sending traffic that should be trying to go down the receiver's channel continuously.  Also, the fact that the connection doesn't resume normally after you re-enable the interface indicates that this isn't TCP/IP handling a normal partition of some sort.

From the receiver's side an easier way to achieve the same effect as SO_KEEPALIVE would be to have your clients send a hello msg periodically (e.g. - every few seconds) that will force the tcp/ip channel to figure out if it is alive or not. In the scenario you laid out above, the remote client should quickly get a failure on the socket.

Here's something to try -- rather than disabling the interface through the OS (e.g. - ifconfig), instead simply pull the network cable out of the NIC.  So long as your OS doesn't do something dumb like auto-disable the interface, this should give you the behavior that you expect: the server should disconnect the remote client, the sender will see the remote client leave the group through a membership message of type disconnect, and the receiver may or may not bomb out quickly with an OS/network error.  If you have your remote client send hello's then it should quickly bomb out with an OS/network error.

Cheers!
John

---
John Lane Schultz
Spread Concepts LLC
Phn: 443 838 2200 
Fax: 301 560 8875

Wednesday, July 8, 2009, 1:57:22 PM, you wrote:


> In message <20090708161024.GA18687 at sewage>, Matt Garman writes:
>>So that's the crux of the problem I'm trying to solve now: if my
>>client program connects to a remote spread daemon, then calls
>>SP_receive(), and the network connection goes down, is there any way
>>for that client program to recognize what has happened?

> I apologize.  It didn't register with me the first time that one
> program was connecting to the Spread daemon over the network
> even though that's what your diagram showed.  I'm so used to always
> having applications connect to a local daemon that I forget about the
> other use case.

> If you're going to connect to a remote daemon, the only way I know of
> for the client to detect a loss of connection to the Spread daemon
> is via the usual TCP/IP client-server programming techniques.  This
> would involve setting the SO_KEEPALIVE socket option and then using
> a platform-specific method to reduce the idle time before keepalive
> probes are sent (default is usually 2 hours, which appears to be too
> much for your application).  On Linux, you can use the TCP_KEEPIDLE
> socket option for IPPROTO_TCP sockets.  On Solaris, you can use the
> TCP_KEEPALIVE option, which I thought had been included in POSIX.1g,
> but I can find no reference to it in the current standard.

> If your application can support it, it would be simpler to deploy a
> Spread daemon on each communicating node.  That way, you can let
> Spread detect network partitions, at which point a group membership
> message reporting the new view will be delivered to all group members.

> daniel


> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list