[Spread-users] Question about thread-safety

John Schultz jschultz at d-fusion.net
Tue Jun 17 13:50:54 EDT 2003


Joshua Goodall wrote:

> On Fri, Jun 13, 2003 at 02:31:05PM -0400, John Schultz wrote:
> 
>>The only way I can think of to avoid this race condition with the 
>>current Spread library is to instruct your OS not to reuse file 
>>descriptors IDs (see NOTE above).
> 
> I can't think of a way to achieve that. 
> 

Yes, it seems that is the consensus. This means that Spread's user 
library is currently not correct for multi-threaded applications that 
open and close multiple Spread connections concurrently. In particular, 
its penchant for closing the file descriptor (SP_kill) as soon as any 
error is detected is BAD -- the file descriptor should only be closed by 
an explicit USER call (e.g. - SP_disconnect or SP_kill).

> As a band-aid, you could change the mailbox parameter to be an index
> into a lookup table.  You'd need a smaller wrapper for the lookup
> and another for SP_connect.  Assuming you can manage the table size
> effectively, you can achieve a monotonically increasing "mbox" value
> without modification of the application.
> 

Yes, this is exactly what I was thinking -- and what I've already 
implemented for my own library application. If you combine what you 
suggested with something like my fd_protector synchronization from 
before then you can practically (though not theoretically, as Jonathan 
pointed out) defeat this race condition without hoisting this 
synchronization problem onto the end user of the library.

The Spread user library should be doing all of this if it wants to be 
truly thread safe + friendly, but currently does not.

It surprises me that Posix systems don't offer a standard option to make 
open act differently (return an incrementing counter) for my process for 
two main reasons:

(1) Under the default behavior, operating on file descriptors in 
multiple threads requires global synchronization, which many programmers 
might not realize/remember to do and is not trivial to do properly.

(2) If I know that my application and none of its libraries depend on 
the default behavior it is more efficient to return an incrementing 
counter instead of finding/remebering the lowest unused one.

James Rauser wrote:
 >
 > I can vouch for the fact that this is a problem; in my context it was
 > occuring in a server which forked child processes to handle requests.
 > The children each called close() on the inherited spread mailbox,
 > then reestablished their own connection with SP_connect().   The
 > children can't call SP_disconnect(), because we don't really want to
 > disconnect it.  But: the child's copy of spread's internal session
 > table wasn't updated, so if the new SP_connect() call obtained the
 > same numerical FD, the library got confused.
 >
This is a closely related problem about which I hadn't even thought. I 
didn't realize SP_disconnect actually tried to signal the daemon it was 
disconnecting that mailbox. It seems in this case that SP_kill should be 
exported as a public fcn to support your usage. That'd be my 
recommendation to you as well, just edit sp.c and sp.h to make SP_kill 
public and use that in your child process to clean up the unwanted mailbox.

-- 
John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055





More information about the Spread-users mailing list