[Spread-users] Race condition fix to Spread...
jschultz at d-fusion.net
Thu Nov 13 15:51:26 EST 2003
I think we have a good solution to the file descriptor race condition in
the Spread client library when you use multiple threads and multiple
connections and don't strongly synchronize Spread operations.
This was the problem that Theo, I and others had a long email discussion
about where there seems to be an almost inherent race condition with
multi-threading and asynchronous opening/closing of file descriptors on
unix type systems.
Just to recap, if one thread is about to perform a blocking operation on
a file descriptor it is difficult for another thread to safely close the
file descriptor. The most common method is to simply close the file
descriptor from another thread and expect that the first thread should
break out of the call with an error.
This has two problems (1) it is not guaranteed that closing the fd this
way will break the first thread out and (2) it introduces a race
condition where the first thread might end up performing its operation
on a different file descriptor (inode) that was opened just after the
close call (because the file descriptor table is global and fd ids are
The solution is fairly simple: change sp_disconnect so that it signals
the server to close the socket on its end, but DOESN'T close it on the
client's end. A second call, sp_kill (or sp_close), actually closes the
file descriptor on the client's end. This solves our race condition as
the blocking thread WILL break out with an error and then (with a little
additional synchronization code) it can safely call sp_kill/sp_close.
To be completely correct some code (either Spread's or the app's) has to
ensure that no threads will possibly be about to perform an op on the
context/fd when the close call is executed.
I personally believe that the Spread library code should hide all of
this synchronization, using the table method w/ eternally unique ids and
locks like I talked about in our previous discussion, so that a new
version of sp_disconnect could be called at any time and would be safe
and would clean up everything as expected. Even then, we should still
export a sp_kill/sp_close for the fork() case below.
The multi-thread race condition discussion started here:
Another problem was playing nice with fork:
The basic problem being that mutexes might be locked in a child process
with no owner thread running and the child process wants to get rid of
the parent's contexts.
Spread should install pthread_atfork() handlers that would reinitialize
all of its synchronization variables (mutexes, etc.). (There probably
should be a sp_init call that initializes the spread library)
Then the child process just uses sp_close/sp_kill function to clean up
the library's context without signaling the server.
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055
More information about the Spread-users