[Spread-users] detecting network outages

Wed Jul 8 12:10:24 EDT 2009

On Thu, Jul 02, 2009 at 08:26:34PM -0400, Daniel F. Savarese wrote:
> In message <20090702221813.GA14352 at sewage>, Matt Garman writes:
> >How can the programs on both R and S reliably detect when there
> >is a network failure between R and S?
> 
> Assuming I understand your question correctly, that's what group
> membership messages are for.  You have to enable the receipt of
> group membership messages in your application.  The Spread daemon
> will send your application a group membership message with the new
> group view after a network partition event (either disconnection
> or reconnection).

I have enabled the receipt of group membership messages, but I can't
see how that will help.

Let me change the scenario from my original one:

SystemA: has a spread client program, but no spread daemon.  It
connects to a spread daemon on SystemB (via SP_connect(), with the
receipt of group membership messages enabled).  It does an
SP_join(), then calls SP_receive() in a loop (i.e. receive data,
process, receive data, process, etc).

SystemB: has the spread daemon, and as well as a client program that
is sending (SP_multicast()) in a loop.

ASCII graphic:

+------------------+                     +------------------+
| SystemB          |-------network-------| SystemA          |
+------------------+                     +------------------+
| sender program   |                     | receiver program |
+------------------+                     +------------------+
| spread daemon    |
+------------------+

I did the following test:

    1. Start both programs, everything is working as expected.
    2. Take down the network interface on SystemB
        - The sending program continues to send, as though nothing
          has changed.  No group membership messages are received.
        - The receiving program is no longer receiving data, and is
          just "stuck" in SP_receive().  No group membership
          messages are received (which makes sense---the network
          connection to the spread daemon has been severed).
    3. I leave the network interface down for 31 minutes, then bring
       it back up.
        - Both sender and receiver programs are unchanged from
          before: sender still sending, receiver still "stuck" in
          SP_receive(), and no group membership messages received on
          either end.
    4. I continue to observe both programs for another 10 minutes to
       see if there is any change---there isn't.

So that's the crux of the problem I'm trying to solve now: if my
client program connects to a remote spread daemon, then calls
SP_receive(), and the network connection goes down, is there any way
for that client program to recognize what has happened?

Thanks again,
Matt