[Spread-users] Question about thread-safety

Tue Jun 17 14:51:58 EDT 2003

John Schultz wrote:
> Yes, it seems that is the consensus. This means that Spread's user 
> library is currently not correct for multi-threaded applications that 
> open and close multiple Spread connections concurrently. In particular, 
> its penchant for closing the file descriptor (SP_kill) as soon as any 
> error is detected is BAD -- the file descriptor should only be closed by 
> an explicit USER call (e.g. - SP_disconnect or SP_kill).

All of the problems described thus far are for multiprocess operation, not 
multithreaded operation.  Multithreaded operation has locking problems which 
are substantially different from the fork() fdset inheritance problem.

Oracle libraries have the same problem if you use them this way.

> It surprises me that Posix systems don't offer a standard option to make 
> open act differently (return an incrementing counter) for my process for 
> two main reasons:
> 
> (1) Under the default behavior, operating on file descriptors in 
> multiple threads requires global synchronization, which many programmers 
> might not realize/remember to do and is not trivial to do properly.

Threads yes... but that is a silly approach in processes -- or you likely 
would have been using threads in as the programming technique.

> (2) If I know that my application and none of its libraries depend on 
> the default behavior it is more efficient to return an incrementing 
> counter instead of finding/remebering the lowest unused one.

That would be pretty darn inconvenient for system calls like select().  The OS 
manages the filedescriptor table for each process.  On all UNIX operating 
systems is it designed to be effecient with regards to those POSIX requirements.

If for some crazy reason you needed these semantics (always incrementing), you 
will shortly run out of FDs (there is a limit to how high the number can be). 
  But you can implement a wrapper around open that does:

int _co_current = -1;

int crazy_open(const char *file, int mode) {
  int retval = -1;
  int bad;
  global_lock();
  bad = open();
  if(bad < 0)
   goto bail;
  if(dup2(bad, ++_co_current) < 0) {
   _co_current--;
   close(bad);
   goto bail;
  }
  retval = _co_current;
bail:
  global_unlock();
  return retval;
}

You'd need to make this much more robust, because many system calls open file 
descriptors behind your bad... and this will fail as soon as _co_current hit 
an FD that was previously opened by something else.

> James Rauser wrote:
>  >
>  > I can vouch for the fact that this is a problem; in my context it was
>  > occuring in a server which forked child processes to handle requests.
>  > The children each called close() on the inherited spread mailbox,
>  > then reestablished their own connection with SP_connect().   The
>  > children can't call SP_disconnect(), because we don't really want to
>  > disconnect it.  But: the child's copy of spread's internal session
>  > table wasn't updated, so if the new SP_connect() call obtained the
>  > same numerical FD, the library got confused.
>  >
> This is a closely related problem about which I hadn't even thought. I 
> didn't realize SP_disconnect actually tried to signal the daemon it was 
> disconnecting that mailbox. It seems in this case that SP_kill should be 
> exported as a public fcn to support your usage. That'd be my 
> recommendation to you as well, just edit sp.c and sp.h to make SP_kill 
> public and use that in your child process to clean up the unwanted mailbox.

The root of the problem is that the spread library attempts to manage all 
sorts of contextual information for the user..  It hands the user back a file 
descriptor and keeps all the contect internal to the library... This make 
forking and maintaining the correct context _singularly_ in one of final 
processes non-trivial.  Ideally it would act like other libraries and hand 
back the user a complete context... That way the user could call 
SP_context_destroy() with or without SP_disconnect().

What is really cool about this?  Spread already hands back an opaque type 
called mailbox.  There is no reason that mailbox couldn't be a pointer to all 
of the internal state necessary to manage the Spread connection inside the 
library.  As I understand it, this is the way openssl works (well).

For now, the duct-tape approach would be (for multiprocesses).  This approach 
works well for other such suffering libraries like Oracle's OCI C libs:

connect(&mbox);
prefork stuff;
disconnect(mbox);
ret = fork();
if(ret > 0) {
   connect(&mbox);
   do parent;
} else if(ret == 0) {
   connect(&mbox);
   do child;
} else {
   perror("fork");
   handle error;
}

More permanent duct-tape is as John suggested:
un-static SP_kill in sp.c and remake you libraries.

Then:
connect(&mbox);
ret = fork();
if(ret > 0) {
   do parent;
} else if (ret == 0) {
   SP_kill(mbox);
   connect(&mbox);
} else {
   perror("fork");
   handle error;
}

-- 
Theo Schlossnagle
Principal Consultant
OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Phone:  +1 410 872 4910 x201     Fax:  +1 410 872 4911
1024D/82844984/95FD 30F1 489E 4613 F22E  491A 7E88 364C 8284 4984
2047R/33131B65/71 F7 95 64 49 76 5D BA  3D 90 B9 9F BE 27 24 E7