[Spread-users] spread daemon rejects connections from a disconnected client

Ryan Caudy rcaudy at gmail.com
Wed Nov 10 21:40:50 EST 2004


Well, I certainly think that Spread should be checking return codes a
bit more robustly.  I've only looked at 3.17.3, but it's doing a
non-blocking recv call, and not making sure that the full length it
expected is read out.  0 is certainly a legal return, that should be
checked for, but it usually means that the socket was closed from the
other end.  Do you know what was going on in the client library?

Cheers,
Ryan


On Wed, 10 Nov 2004 14:15:05 +0200, Shlomi Yaakobovich
<shlomi at exanet.com> wrote:
> Hi all,
> 
> We are still running spread 3.17.2 in our systems, but for my specific problem, I don't think that matters.
> 
> There are two nodes in our system, each running a spread daemon, and a client is connected locally to each daemon, so the group has only two members. Every few minutes we run spflooder on each node, to see if the daemon is responsive and if messages are sent. The nodes have been up without a problem for about a month. The spread client specifically is running for a month.
> 
> The problem we see is that one of the nodes was shut down, and the other node started to experience spread problems. The spread client disconnected from the spread daemon (good), but it failed to reconnect to it (bad). SP_connect would return a -2 COULD_NOT_CONNECT error, and it would repeat itself whenever this specific client tried to reconnect. The funny thing is that spflooder seem to be unaffected - everything works well for it.
> 
> I am attaching the spread.log and also strace report that shows the problem, the first accepted connection is from spflooder and it works well, the next connection is rejected. Looking at the code - I believe this happens in session.c : Sess_accept then call Sess_accept_continue, and fails in the block that deals with REJECT_VERSION. I don't know if this is accurate, but if it is, it is strange, because the client uses the same data over and over again, and it usually works. Looking at the strace, recv returned 0, and I don't see this case handled in Sess_accept_continue...
> 
> The workaround for this problem is that the spread daemon has been restarted, and everything went back to normal, the client succeeded to reconnect. The fact that this solved the problem might suggest that there is something bad here somewhere.
> 
> Any ideas ?
> 
> Shlomi
> 
> 
> 


-- 
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>         
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University          
---------------------------------------------------------------------




More information about the Spread-users mailing list