[Spread-users] Error

Sat Jan 19 21:09:09 EST 2002

On Sat, Jan 19, 2002 at 06:41:32AM -0500, Tim Peters wrote:
> [Jonathan Stanton]
> > The problem I remember did have to do with thread behavior when a
> > disconnect occured on a socket. The problem is that there are races
> > when one thread gets a socket error and closes a socket (in the
> > libsp code) and other threads are also trying to use that socket. I
> > think it actually only happened when the socket was immediately
> > reconnected and the socket number (fd) got reused. We know how to
> > fix it, and I just don't recall if we have already integrated the
> > fix or not.
> 
> I'm pretty sure we've seen this happen under 3.16.1, so I don't think a fix
> has been released yet.
Yes I'm pretty sure you are right.

> 
> [Guido van Rossum]
> > I find it kind of strange that Spread closes the socket file
> > descriptor; it would have been safer for the user if it just marked
> > that mbox as "bad" without actually closing it (the reason being the
> > file descriptor reuse case you describe).  I had to put a bandaid
> > around this problem in the Python wrapper (this bandaid isn't on the
> > distribution on the web yet).
> 
> "The bandaid" is to set our own mbox wrapper object's disconnected flag to
> true upon seeing CONNECTION_CLOSED or ILLEGAL_SESSION come back from Spread,
> right?  Alas, that doesn't really solve it, just makes it more unlikely:

Our proposed fix is to not close fd's when we get errors, but rather just
mark them inactive and keep returning errors to the application if they
call SP calls on that mbox until they call SP_disconnect() on it. The
sp_disconnect will be the only place that actually closes the fd. 

The only iffy part about this, is old code which was lazy and didn't call
sp_disconnect itself when an error was encountered, and instead just called
sp_connect again will now leak fds. I'm probably ok with this, but it needs
to be documented and a warning given. 

> because we release the global interpreter lock around the Spread API calls,
> this (for example) is possible:
> 
> Thread A				   Thread B
> call Python mbox.receive()
> passes self->disconnected check
> releases GIL
>                                  calls Python mbox.multicast()
>                                  passes self->disconnected check
> calls Spread SP_receive()
> 					   releases GIL
> reacquires GIL
> sees CONNECTION_CLOSED
> sets self->disconnected
>            *** an arbitrarily long time can pass here ***
>                                  calls Spread SP_multicast, now with
>                                      a recycled mbox descriptor

With the Sp fix I propose above this won't happen unless thread A calls
sp_disconnect without verifying that no threads are still using the mbox.
If it verifies it (maybe with a counting semaphore) that noone is currently
using the mbox, then it can disconnect it. You might have a better solution
to this.

Jonathan
-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------