[Spread-users] detecting network outages
Matt Garman
matthew.garman at gmail.com
Wed Jul 8 12:10:24 EDT 2009
On Thu, Jul 02, 2009 at 08:26:34PM -0400, Daniel F. Savarese wrote:
> In message <20090702221813.GA14352 at sewage>, Matt Garman writes:
> >How can the programs on both R and S reliably detect when there
> >is a network failure between R and S?
>
> Assuming I understand your question correctly, that's what group
> membership messages are for. You have to enable the receipt of
> group membership messages in your application. The Spread daemon
> will send your application a group membership message with the new
> group view after a network partition event (either disconnection
> or reconnection).
I have enabled the receipt of group membership messages, but I can't
see how that will help.
Let me change the scenario from my original one:
SystemA: has a spread client program, but no spread daemon. It
connects to a spread daemon on SystemB (via SP_connect(), with the
receipt of group membership messages enabled). It does an
SP_join(), then calls SP_receive() in a loop (i.e. receive data,
process, receive data, process, etc).
SystemB: has the spread daemon, and as well as a client program that
is sending (SP_multicast()) in a loop.
ASCII graphic:
+------------------+ +------------------+
| SystemB |-------network-------| SystemA |
+------------------+ +------------------+
| sender program | | receiver program |
+------------------+ +------------------+
| spread daemon |
+------------------+
I did the following test:
1. Start both programs, everything is working as expected.
2. Take down the network interface on SystemB
- The sending program continues to send, as though nothing
has changed. No group membership messages are received.
- The receiving program is no longer receiving data, and is
just "stuck" in SP_receive(). No group membership
messages are received (which makes sense---the network
connection to the spread daemon has been severed).
3. I leave the network interface down for 31 minutes, then bring
it back up.
- Both sender and receiver programs are unchanged from
before: sender still sending, receiver still "stuck" in
SP_receive(), and no group membership messages received on
either end.
4. I continue to observe both programs for another 10 minutes to
see if there is any change---there isn't.
So that's the crux of the problem I'm trying to solve now: if my
client program connects to a remote spread daemon, then calls
SP_receive(), and the network connection goes down, is there any way
for that client program to recognize what has happened?
Thanks again,
Matt
More information about the Spread-users
mailing list