[Spread-users] detecting disconnect?

matthew.garman at gmail.com matthew.garman at gmail.com
Thu Sep 6 14:04:24 EDT 2007


We have several spread daemons running across several machines.

I wrote a basic program to monitor the state of these daemons.  The
program sends out a message on every channel every 30 seconds.
Likewise, the program listens for spread messages on every channel
from every other machine.

For various reasons, I don't use select() or the spread event loop
to detect when messages are available---I just use SP_poll() in a
tight loop, sleeping about two seconds between every call.

Last night, at least one of the daemons for a particular channel
went down (either crashed or reset itself).  This is based on the
fact that I can see in the log that the membership for that spread
channel changed.  Likewise, when my monitoring program tried to send
out its messages, SP_multicast() returned an error (not sure which
one), then my program went into a loop calling SP_disconnect,
SP_connect, and SP_join() until all three succeeded.  It looped many
times, with SP_join returning -2.

So it looks like the spread daemon went down for some time.

However, none of the calls to SP_poll() returned an error status.
Likewise, even though messages were being sent out on that channel
(after the daemon was back up), none of the peers were receiving the
messages (i.e. SP_poll() always returned zero).

Another thing to note is that my sending and receiving mboxes are
different.

So it appears that the temporary down time of the spread daemon
caused my receiving mboxes to become corrupt or invalid.  How can I
detect this?  Or is this indicative of some other problem?

Thank you,
Matt





More information about the Spread-users mailing list