[Spread-users] detecting disconnect?

Thu Sep 6 15:16:05 EDT 2007

SP_poll is simply a wrapper around a ioctl(fd, FIONREAD, &unread) call, 
which returns the number of bytes ready to be read from the file 
descriptor.

Given that the other end of the client's socket is gone because the daemon 
died, there isn't any new data to be read so ioctl returning zero is 
probably correct.

You could always augment or change SP_poll to call select() or poll() with 
a zero timeout, which will indicate read activity when the remote end of a 
connected socket disappears (subject to TCP's link detection).

---
John Schultz
Spread Concepts
Phn: 443 838 2200

On Thu, 6 Sep 2007 matthew.garman at gmail.com wrote:
>
> We have several spread daemons running across several machines.
>
> I wrote a basic program to monitor the state of these daemons.  The
> program sends out a message on every channel every 30 seconds.
> Likewise, the program listens for spread messages on every channel
> from every other machine.
>
> For various reasons, I don't use select() or the spread event loop
> to detect when messages are available---I just use SP_poll() in a
> tight loop, sleeping about two seconds between every call.
>
> Last night, at least one of the daemons for a particular channel
> went down (either crashed or reset itself).  This is based on the
> fact that I can see in the log that the membership for that spread
> channel changed.  Likewise, when my monitoring program tried to send
> out its messages, SP_multicast() returned an error (not sure which
> one), then my program went into a loop calling SP_disconnect,
> SP_connect, and SP_join() until all three succeeded.  It looped many
> times, with SP_join returning -2.
>
> So it looks like the spread daemon went down for some time.
>
> However, none of the calls to SP_poll() returned an error status.
> Likewise, even though messages were being sent out on that channel
> (after the daemon was back up), none of the peers were receiving the
> messages (i.e. SP_poll() always returned zero).
>
> Another thing to note is that my sending and receiving mboxes are
> different.
>
> So it appears that the temporary down time of the spread daemon
> caused my receiving mboxes to become corrupt or invalid.  How can I
> detect this?  Or is this indicative of some other problem?