[Spread-users] Error

Tim Peters tim at zope.com
Fri Jan 18 03:14:27 EST 2002


[Jonathan Stanton, to Guido van Rossum]
> I'll check this out tomorrow ( I'm at a conference and don't have the
> code handy.) Are you using the 'auto generate random id's' option or are
> you creating session names yourself?

We create them ourselves, with no danger of collision.  There were usually 5
seesions total in these tests, spread across 4 machines (each with its own
Spread daemon).

> You are using version 3.16.1 correct? I'm just checking because that
> version corrected several known races that caused this error.

Yes, 3.16.1, on Linux.

> To debug, the first step is to turn on SESSION and GROUP alarm flags and
> log to a file. Then when the error occurs, email the end of the file
> including any messages from Session about rejecting/refusing/errors on a
> connection.
>
> I don't have any known causes of this problem in 3.16.1, you may have
> found a rare race.
>
> Wait! Is this in a multithreaded application linked with the spread
> library?

Yes, it's multithreaded.  Too much so <wink>.

> If so then I may know the problem. If it is, I'll get back to you
> about it.

We would appreciate hearing the theory, but we're no longer sure this
problem "is real".  After Guido wrote his initial message, we figured out we
were bumping into the 1000-msg-backlog disconnection feature, and since
these *are* multi-threaded apps, threads doing writes to the same mbox had
no idea that another thread disconnected while reading.  So other threads
saw a wide variety of weird problems, but they *may* all have been
consequences of a seminal disconnect of a shared mbox.  In all the detailed
cases I saw, it was at least plausible that a disconnect happened first, and
usually obvious that a disconnect came first.

Still, Guido saw some other failures that didn't fit this scenario, like
"Sess_validate_read_header: Message has negative or too large num_groups
field" (btw, the Alarm that displays this is missing a %d in its format
string, so it doesn't show the offending num_groups value also passed to the
Alarm call).

We'd love to hear anything about possible problems with multithreaded apps
regardless.






More information about the Spread-users mailing list