[Spread-users] BUFFER_TOO_SHORT && endian_mismatch >= 0

Wed Jul 3 18:08:52 EDT 2002

[John Schultz, with some good ideas]
> Just a quick shot in the dark:
>
> Are you sure that you are using ints instead of unsigneds as the return
> parameters? Your compiler should warn you if you are, but this could
> explain why endian comes back as non-negative when it should be negative.

Yes, they're auto (stack) ints:

static PyObject *
mailbox_receive(MailboxObject *self, PyObject *args)
{
	service svc_type;
	int num_groups, endian, size;
	int16 msg_type;

	char senderbuffer[MAX_GROUP_NAME];
	char groupbuffer[DEFAULT_GROUPS_SIZE][MAX_GROUP_NAME];
	char databuffer[DEFAULT_BUFFER_SIZE];

	int max_groups = DEFAULT_GROUPS_SIZE;
	char (*groups)[MAX_GROUP_NAME] = groupbuffer;

	int bufsize = DEFAULT_BUFFER_SIZE;
	char *pbuffer = databuffer;

	PyObject *sender = NULL, *data = NULL, *msg = NULL;

The BUFFER_TOO_SHORT code tries to recover invisibly, by allocating a larger
buffer and trying again.

> It looks like your num_groups also gets messed up even on successful
> calls to receive sometimes?

I figure you're talking about this:

		if (size >= 0) {
			if (num_groups < 0) {
				/* XXX This really happens!
				   Don't dare retry the receive, since we
				   didn't get an error.  The extra names
				   are forever lost. */
				num_groups = max_groups;
			}
			break;
		}

My boss added that code early on because he saw that combination happening.
It doesn't seem too surprising to me after reading the SP_receive() man
page:

    For example, if your groups array could store 5 group names, but a
    message for 7 groups arrived, the first five group names would
    appear in the groups array and num_groups would be set to -7.

although I don't find the docs entirely clear about exactly when
GROUPS_TOO_SHORT is returned.

Is it your testimony that GROUPS_TOO_SHORT will always be returned when
num_groups is too small and no other kind of error return is competing for
attention?  Even for a REG_MESSAGE, where the user may well not care about
the group list?

> What kind of values are you getting back for the different parameters
> when things fail (I know you said you don't know, but maybe you could
> add in some more diagnostic code)? Are they completely off the wall or
> could they be correct but of the wrong sign or zero or what?

That's another excellent question and I can't answer it.  The user reported
two occurrences of this so far, and there's no known way to provoke it
deliberately.  We'll have to ship them a wrapper that produces more logging
in this case and hope it happens again.

> If you are having several parameters come back messed up on different
> occasions

The only oddity we know of is this exceedingly shy combination of size >= 0
with endian_mismatch >= 0.  Terabytes of messages have gone thru this code
without other surprises.

> I would guess that you might be getting stack or heap corruption.

Very selective corruption if so <wink>, but, yes, that is a possibility I
can't rule out.

> It looks like you are using multiple threads.

Oh yes.  Many, although in this specific app only one thread ever tries to
read from the mbox.  I've got about 20 years' thread experience myself, so
you can be sure I'm suitably paranoid here.

> Are you sure that different threads aren't trying to use the same
> memory variables (i.e. - are they using local stack variables or not)?

Yes, they're autos.

> Is it possible that a parallel memcpy or something could be overwriting
> that portion of your data?

Or Spread's internal data, or the network's, or ... I can't rule those out
either, but they're not high-probability causes.  I can't rule out flaky
hardware at this point, either!  The symptoms are consistent with that
low-probability hypothesis too.