[Spread-users] Spread python module

Thu Apr 25 00:29:16 EDT 2002

Thought people might want to know how this one is likely to turn out [Guido,
skip to the end].

[Magnus Heino]
> I have a multithreaded app written in python that uses spread and
> the python spread module.
>
> ... [and it deadlocks sometimes] ...
>
> Looking at the diff between SpreadModule 1.0 and 1.1, the major
> difference is
> this;
>
> - Fixed race conditions having to do with disconnects in
>   multi-threaded programs.  When one of the errors CONNECTION_CLOSED
>   or ILLEGAL_SESSION is received, mark the mbox as closed.  In
>   addition, use a lock to serialize access to the disconnected flag.
>
> Is this needed in 3.16.2 too? The SpreadModule is written against
> 3.16.2rc1

Here's the tail end of the history of the problem this was trying to solve:

http://lists.spread.org/pipermail/spread-users/2002-January/000538.html

In short, the mboxes Spread passes out are actually sockets (handles or
descriptors, depending on OS), and if a session gets the boot for whatever
reason, the socket gets closed by Spread.  But if you have other threads
hanging on to the same mbox, it's possible, e.g., for Spread to open another
socket, get back the very same magic integer from the platform socket
library, and then the other threads can end up using a different socket
without ever knowing something had gone wrong.  Or they can plain go insane.

We tried to worm around at first by adding a "disconnected" flag to our own
mbox wrapper, and checking it before every Spread operation.  However, that
was vulnerable to races too, so I added another hack, using a full-blown
lock of our own to serialize calls to Spread, and making a Spread call
atomic with our pre-call inspection and possible post-call setting of our
mbox wrapper's disconnect flag.

That plugged the hole in a way that always worked for us, but I'm afraid
it's guaranteed to cause deadlock for you:  in the apps we've written, we
never do an mbox read unless a select() tells us the mbox is ready to
deliver data, *and* we never have more than one thread per process reading
from an mbox.  Therefore an mbox read always "comes back" pretty quickly in
our apps, and so our wrapper lock is never held for a long time.

In your app, an mbox read is in a loop without a select() guard, and it's
waiting for a "time to quit" message to be sent via Spread by another thread
in the same process.  Alas, Python holds its Spread-serialization lock
around the call to Spread's mbox read, and any attempt to send via Spread by
another thread in the same process blocks, waiting for our
Spread-serialization lock to get released.  Deadlock results.

As Jonathan Stanton said in the msg referenced above, this is something
Spread should be guarding against itself (with help from the application, to
let Spread know when an app is finished with an mbox).  Trying to patch
around Spread's lack of paranoia here is making our code too convoluted, and
I don't have a way to fix this deadlock correctly without resorting to
poll-and-sleep loops.

So I expect we're just going to comment out our workarounds in the next
release of the Python Spread module, and multithreaded apps will be on their
own for dealing with unexpected disconnects.  I suspect a better solution is
possible, but just commenting out the #define certainly fits my available
time better <wink>.

If Spread eventually changes in the way Jonathan suggests in the above, our
Spread wrapper will automatically place nice with it:  Python's normal
threadsafe refcount semantics will automatically cause SP_disconnect() to
get called when the last reference (in your code) to a mailbox wrapper goes
away; and before then, Spread won't close the original socket.

Guido, against that day, I believe we have another bug here (*currently*
impossible to trigger):

static PyObject *
mailbox_disconnect(MailboxObject *self, PyObject *args)
{
	PyObject *result = Py_None;

	if (!PyArg_ParseTuple(args, ":disconnect"))
		return NULL;
	if (!self->disconnected) {
		ACQUIRE_MBOX_LOCK(self);
		if (!self->disconnected) {
			int err;
			Py_BEGIN_ALLOW_THREADS
			self->disconnected = 1;
			err = SP_disconnect(self->mbox);

We should move the set of self->disconnected up a line, so the testing and
setting of it are made indivisible by the GIL.  Else when the mbox lock goes
away, there's nothing to stop two (or more) threads from getting into this
code at the same time, and all calling SP_disconnect on the same mbox.  For
correctness, Jonathan's scheme requires that we guarantee to call
SP_disconnect at most once on an mbox (and I believe moving that line up one
is all we need to guarantee that).