[Spread-users] Question about thread-safety

Tue Jun 17 16:19:04 EDT 2003

Theo E. Schlossnagle wrote:

> John Schultz wrote:
> 
>> Yes, it seems that is the consensus. This means that Spread's user 
>> library is currently not correct for multi-threaded applications that 
>> open and close multiple Spread connections concurrently. In 
>> particular, its penchant for closing the file descriptor (SP_kill) as 
>> soon as any error is detected is BAD -- the file descriptor should 
>> only be closed by an explicit USER call (e.g. - SP_disconnect or 
>> SP_kill).
> 
> All of the problems described thus far are for multiprocess operation, 
> not multithreaded operation.  Multithreaded operation has locking 
> problems which are substantially different from the fork() fdset 
> inheritance problem.

Are we talking about the same problems? The race condition I've mainly 
been talking about in this discussion involves using file descriptors 
(in particular opening/closing) in multiple threads inside a single 
process (let's ignore Linux's perverted notion of a thread: a light 
weight process). I pointed out that Spread's user library summarily 
closes its file descriptors upon receiving errors, which makes it very 
difficult for the user to ensure that the mailbox "it is using" is the 
mailbox "it thinks it is using" in threaded applications. I think the 
Spread library should insulate the user from this problem.

James Rauser brought up the multiprocess/fork problem that I hadn't 
considered at all until my previous email.

>> It surprises me that Posix systems don't offer a standard option to 
>> make open act differently (return an incrementing counter) for my 
>> process for two main reasons:
>>
>> (1) Under the default behavior, operating on file descriptors in 
>> multiple threads requires global synchronization, which many 
>> programmers might not realize/remember to do and is not trivial to do 
>> properly.
> 
> Threads yes... but that is a silly approach in processes -- or you 
> likely would have been using threads in as the programming technique.

Yes, for multiprocess/fork programming there would be no reason to use 
my proposed open semantics.

>> (2) If I know that my application and none of its libraries depend on 
>> the default behavior it is more efficient to return an incrementing 
>> counter instead of finding/remebering the lowest unused one.
> 
> That would be pretty darn inconvenient for system calls like select().  

I agree about select() and I'm sure that there are other calls that 
depend on the default behavior. That is why I said, "If I know that my 
application and none of its libraries depend on the default behavior."

 > The OS manages the filedescriptor table for each process.  On all UNIX
 > operating systems is it designed to be effecient with regards to those
 > POSIX requirements.

I know it's not that much harder to find the lowest ID, but it is less 
efficient than immediately returning a counter. :)

> If for some crazy reason you needed these semantics (always 
> incrementing), 

The "crazy reason" you'd want these semantics is to avoid all of the 
synchronization in a multi-threaded program necessary to avoid the file 
descriptor ID reuse race condition that I've described and which exists 
in the Spread user library.

 > you will shortly run out of FDs (there is a limit to how high the
 > number can be)

Yes, that is why I said the counter would have to be a 64 bit counter, 
which would be "big enough" in practice IMO. I realize that this might 
break too many things as I believe file descriptors are defined to be in 
the range [1, 2^16), it was just my wishful thinking.

>> James Rauser wrote:
>>>
>>> I can vouch for the fact that this is a problem; in my context it was
>>> occuring in a server which forked child processes to handle requests.
>>> The children each called close() on the inherited spread mailbox,
>>> then reestablished their own connection with SP_connect().   The
>>> children can't call SP_disconnect(), because we don't really want to
>>> disconnect it.  But: the child's copy of spread's internal session
>>> table wasn't updated, so if the new SP_connect() call obtained the
>>> same numerical FD, the library got confused.
>>>
>> This is a closely related problem about which I hadn't even thought. I 
>> didn't realize SP_disconnect actually tried to signal the daemon it 
>> was disconnecting that mailbox. It seems in this case that SP_kill 
>> should be exported as a public fcn to support your usage. That'd be my 
>> recommendation to you as well, just edit sp.c and sp.h to make SP_kill 
>> public and use that in your child process to clean up the unwanted 
>> mailbox.
> 
> The root of the problem is that the spread library attempts to manage 
> all sorts of contextual information for the user..  It hands the user 
> back a file descriptor and keeps all the contect internal to the 
> library... This make forking and maintaining the correct context 
> _singularly_ in one of final processes non-trivial.  Ideally it would 
> act like other libraries and hand back the user a complete context... 
> That way the user could call SP_context_destroy() with or without 
> SP_disconnect().
> 
> What is really cool about this?  Spread already hands back an opaque 
> type called mailbox.  There is no reason that mailbox couldn't be a 
> pointer to all of the internal state necessary to manage the Spread 
> connection inside the library.  As I understand it, this is the way 
> openssl works (well).

I agree that the Spread library should return a reference to a 

context that it uses to implement its functionality instead of a file 

descriptor.

However, I would argue that the library should continue to hide the real 
context in an internal table and rather than return a pointer, it should 
return an "eternally unique" integer handle to the internal context. 
This would allow any SP call on a reclaimed handle to fail nicely 
(INVALID_HANDLE) instead of potentially crashing due to invalid use of 
an already reclaimed context pointer.

This would allow the user to reclaim the context at any point in time: 
they wouldn't have to ensure that all threads that might use that 
context have stopped using the context before reclaiming it. If a SP 
call returns an INVALID_HANDLE error they can assume that another thread 
reclaimed/closed the context.

Of course, the drawback to this approach is a single globally 
synchronized hashtable lookup for each SP call. IMO, this additional 
overhead is dwarfed by the system IO the Spread library does and would 
have no impact on performance.

-- 
John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055