[Spread-users] Question about thread-safety

Wed Jun 18 12:33:41 EDT 2003

On Wed, 18 Jun 2003, Theo E. Schlossnagle wrote:

> John Schultz wrote:
>
> > Your method would still require that the user ensure that all threads
> > that might use the context, stop using the context before it is reclaimed.
>
> No.  That safety can be completely handled within the library.

Yes, the safety can be completely handled within the library: by
internally employing a method such as my table method.

> > This is common practice for libraries but with almost no additional work
> > (a lookup per call) the table method takes that synchronization problem
> > off of the user's hands.  The user can reclaim/close the handle at any
> > time in any thread and any threads that subsequently try to use that
> > handle will get an INVALID_HANDLE error and can assume that some other
> > thread already closed/reclaimed the handle.
>
> In the context situation, you require no synchronization on the user's part.
> All synchronization facilities are _inside_ the context, which is opaque.  The
> library calls do what is necessary on the context.

Here is an example of why you require the user to perform reclaimation
synchronization:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sp.h>

void *worker_thread(void *arg);

int foo()
{
  pthread_t pid;
  int err = 0;
  int i;

  SP_context sp_ref = SP_connect(...);

  for (i = 0; i < 500; ++i) {
    if (pthread_create(&pid, NULL, worker_thread, (void*) sp_ref) != 0) {
      exit(fprintf(stderr, "Couldn't spawn worker threads!\r\n"));
    }
  }

  ...   /* go into a recv loop on sp_ref */

  if (err != 0) {  /* a "fatal" application or Spread error */
    SP_disconnect(sp_ref);
  }

  return err;
}

void *worker_thread(void *arg)
{
  SP_context sp_ref = (SP_context) arg;
  int err = 0;

  pthread_detach(pthread_self());

  ...   /* go into a worker/send loop on sp_ref */

  if (err != 0) {  /* a "fatal" application or Spread error */
    SP_disconnect(sp_ref);
  }

  return NULL;
}

In your setup SP_context is a pointer to some opaque context.  In my setup
SP_context is an integer handle to some internal context; this makes my
casts to and from void* dubious, but let that slide for now.

In the code, all the threads call SP_disconnect (reclaim sp_ref) upon
detecting an error. This code is not valid under your setup, but it is
valid under mine.  The difference comes in the way the threads reclaim
sp_ref's context under our setups.

Under your setup, when an error is detected, ONE AND ONLY ONE thread can
ACTUALLY reclaim sp_ref's resources and it has to be the LAST thread to
use sp_ref.

This HAS to be the case because all of your synchronization primitives are
stored inside the context.  If you used a global lock to ensure that the
context was still valid, then you get exactly to my setup -- you would
have a global table of valid pointers.  If sp_ref's context was reclaimed
before all other threads are done using it, then your attempt to get the
lock on the context would be invalid -- it was already reclaimed.  The
only other option would be to have a memory leak:  allow the context's
lock to remain valid forever so that any thread could check for an error
code on sp_ref in the future.

Therefore, to be 100% correct your solution still forces the user to
synchronize all threads using sp_ref so that one and only one thread
actually reclaims sp_ref's context and it is the last thread to use
sp_ref.  My setup takes this burden off of the user's hands.

Under my setup, when an error is detected all the threads working on
sp_ref call SP_disconnect.  The first such thread to get the lock on the
global lookup table atomically removes the handle from the table and
releases the table lock.  This means any subsequent SP calls using sp_ref
will fail with an INVALID_HANDLE error because that handle is not in the
global lookup table.  Therefore, all the other threads calling
SP_disconnect at any time on sp_ref is completely valid.

Next, the disconnector thread internally sets an error code on sp_ref's
context and wakes any blocked threads that currently have references to
sp_ref's context. These are threads that are already in SP fcns on sp_ref
that have already gotten the context from the global lookup table -- when
they do that they increment a reference count on the context.  These
threads all notice the error condition, quickly let go of the context and
return an INVALID_HANDLE error to the user.

The disconnector thread waits until the reference count on sp_ref's
context goes to zero and when it does it completely reclaims the context.

> > Finally, we can guarantee in practice (use a big counter and exit if we
> > flip it) that our integer handles are "eternally unique" whereas malloc
> > _might_ return the same pointer twice.
>
> Not if it has been freed.  If it hasbeen freed, then of course malloc could
> return the same value.

At some point in time your contexts have to be freed or you have a memory
leak.

> The point being that SP_kill and SP_disconnect (or any other error for
> that matter) could never free the context.  The user must call
> SP_context_destroy(mailbox) [or some such function].

You just re-highlighted the fact that one and only one of your threads can
ACTUALLY completely reclaim the context and it has to be the last thread
that might use that context.

> The problem with the lookuptable is that you just reimplement the
> problem of duplicating the file descriptor table.  Now you duplicate the
> lookuptable and the user (with no visibility to that) has no recourse.

I don't understand your last comment.

---
John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443 838 2200