[Spread-users] Race condition fix to Spread...

John Schultz jschultz at d-fusion.net
Thu Nov 13 15:51:26 EST 2003


Hey guys,

I think we have a good solution to the file descriptor race condition in 
the Spread client library when you use multiple threads and multiple 
connections and don't strongly synchronize Spread operations.

This was the problem that Theo, I and others had a long email discussion 
about where there seems to be an almost inherent race condition with 
multi-threading and asynchronous opening/closing of file descriptors on 
unix type systems.

Just to recap, if one thread is about to perform a blocking operation on 
a file descriptor it is difficult for another thread to safely close the 
file descriptor.  The most common method is to simply close the file 
descriptor from another thread and expect that the first thread should 
break out of the call with an error.

This has two problems (1) it is not guaranteed that closing the fd this 
way will break the first thread out and (2) it introduces a race 
condition where the first thread might end up performing its operation 
on a different file descriptor (inode) that was opened just after the 
close call (because the file descriptor table is global and fd ids are 
reused).

The solution is fairly simple: change sp_disconnect so that it signals 
the server to close the socket on its end, but DOESN'T close it on the 
client's end.  A second call, sp_kill (or sp_close), actually closes the 
file descriptor on the client's end.  This solves our race condition as 
the blocking thread WILL break out with an error and then (with a little 
additional synchronization code) it can safely call sp_kill/sp_close.

To be completely correct some code (either Spread's or the app's) has to 
ensure that no threads will possibly be about to perform an op on the 
context/fd when the close call is executed.

I personally believe that the Spread library code should hide all of 
this synchronization, using the table method w/ eternally unique ids and 
locks like I talked about in our previous discussion, so that a new 
version of sp_disconnect could be called at any time and would be safe 
and would clean up everything as expected.  Even then, we should still 
export a sp_kill/sp_close for the fork() case below.

The multi-thread race condition discussion started here:
http://lists.spread.org/pipermail/spread-users/2003-June/001481.html

Another problem was playing nice with fork:
http://lists.spread.org/pipermail/spread-users/2002-October/001129.html

The basic problem being that mutexes might be locked in a child process 
with no owner thread running and the child process wants to get rid of 
the parent's contexts.

Spread should install pthread_atfork() handlers that would reinitialize 
all of its synchronization variables (mutexes, etc.).  (There probably 
should be a sp_init call that initializes the spread library)

Then the child process just uses sp_close/sp_kill function to clean up 
the library's context without signaling the server.

-- 
John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055






More information about the Spread-users mailing list