[Spread-users] Question about thread-safety
John Schultz
jschultz at d-fusion.net
Tue Jun 17 13:50:54 EDT 2003
Joshua Goodall wrote:
> On Fri, Jun 13, 2003 at 02:31:05PM -0400, John Schultz wrote:
>
>>The only way I can think of to avoid this race condition with the
>>current Spread library is to instruct your OS not to reuse file
>>descriptors IDs (see NOTE above).
>
> I can't think of a way to achieve that.
>
Yes, it seems that is the consensus. This means that Spread's user
library is currently not correct for multi-threaded applications that
open and close multiple Spread connections concurrently. In particular,
its penchant for closing the file descriptor (SP_kill) as soon as any
error is detected is BAD -- the file descriptor should only be closed by
an explicit USER call (e.g. - SP_disconnect or SP_kill).
> As a band-aid, you could change the mailbox parameter to be an index
> into a lookup table. You'd need a smaller wrapper for the lookup
> and another for SP_connect. Assuming you can manage the table size
> effectively, you can achieve a monotonically increasing "mbox" value
> without modification of the application.
>
Yes, this is exactly what I was thinking -- and what I've already
implemented for my own library application. If you combine what you
suggested with something like my fd_protector synchronization from
before then you can practically (though not theoretically, as Jonathan
pointed out) defeat this race condition without hoisting this
synchronization problem onto the end user of the library.
The Spread user library should be doing all of this if it wants to be
truly thread safe + friendly, but currently does not.
It surprises me that Posix systems don't offer a standard option to make
open act differently (return an incrementing counter) for my process for
two main reasons:
(1) Under the default behavior, operating on file descriptors in
multiple threads requires global synchronization, which many programmers
might not realize/remember to do and is not trivial to do properly.
(2) If I know that my application and none of its libraries depend on
the default behavior it is more efficient to return an incrementing
counter instead of finding/remebering the lowest unused one.
James Rauser wrote:
>
> I can vouch for the fact that this is a problem; in my context it was
> occuring in a server which forked child processes to handle requests.
> The children each called close() on the inherited spread mailbox,
> then reestablished their own connection with SP_connect(). The
> children can't call SP_disconnect(), because we don't really want to
> disconnect it. But: the child's copy of spread's internal session
> table wasn't updated, so if the new SP_connect() call obtained the
> same numerical FD, the library got confused.
>
This is a closely related problem about which I hadn't even thought. I
didn't realize SP_disconnect actually tried to signal the daemon it was
disconnecting that mailbox. It seems in this case that SP_kill should be
exported as a public fcn to support your usage. That'd be my
recommendation to you as well, just edit sp.c and sp.h to make SP_kill
public and use that in your child process to clean up the unwanted mailbox.
--
John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055
More information about the Spread-users
mailing list