[Spread-users] ppc version of spread 4.1.0 closing connection
Mark Swan
mswan at cray.com
Fri Sep 10 16:29:30 EDT 2010
Excellent suggestion. Thanks. I've compiled a list of success and failure below.
First, here's the most stripped down version of the source code I'm using to demonstrate the problem:
#include <stdio.h>
#include <stdlib.h>
#include <sp.h>
#define MY_MAX_NUM_GROUPS 1000
#define MY_MAX_MESS_SIZE 102400
int main (int argc, char **argv)
{
int rc;
char private_name[MAX_GROUP_NAME];
int priority = 0;
int group_membership = 1;
mailbox mbox;
char private_group[MAX_GROUP_NAME];
service service_type;
char sender[MAX_GROUP_NAME];
int num_groups;
char groups[MY_MAX_NUM_GROUPS][MAX_GROUP_NAME];
int16 mess_type;
int endian_mismatch;
int mess_len;
char mess[MY_MAX_MESS_SIZE];
sprintf(private_name,"P%d",getpid());
rc = SP_connect(argv[1],private_name,0,1,&mbox,private_group);
if (rc < 0) {
printf("SP_connect() failed - %d\n",rc);
exit(1);
}
printf("SP_connect() returned %d\n",rc);
printf("mbox=%d,private_group='%s'\n",mbox,private_group);
rc = SP_join(mbox,"xyz");
if (rc < 0) {
printf("SP_join() failed - %d\n",rc);
exit(1);
}
printf("SP_join() returned %d\n",rc);
sleep(5);
rc = SP_receive(mbox, &service_type, sender, MY_MAX_NUM_GROUPS,
&num_groups, groups, &mess_type, &endian_mismatch,
MY_MAX_MESS_SIZE, mess);
if (rc < 0) {
printf("SP_receive() failed - %d\n",rc);
SP_error(mess_len);
exit(1);
}
printf("SP_receive() returned %d\n",rc);
exit(0);
}
I built this code on both PPC and X86 platforms against both the 4.0.0 and 4.1.0 releases of spread. I also have PPC and X86 spread daemons built and running from both the 4.0.0 and 4.1.0 releases. My biggest heartburn, obviously, is that a 4.1.0 ppc can't talk to a 4.1.0 x86 and vice versa.
Briefly, it's the SP_join() that's failing, not the SP_receive(). When SP_join() fails, the spread daemon's error message is typically "Sess_validate_read_header: Message has illegal type field 0x80000180".
A typical sequence seen in the daemon's log file for a successful execution of my test code is below. The "Bad file descriptor" error is simply when my app executes the SP_receive() after the 5 second sleep and then exits.
[Fri 10 Sep 2010 10:32:29] Sess_accept: set sndbuf/rcvbuf to 204800
[Fri 10 Sep 2010 10:32:29] Sess_recv_client_auth: Client requested NULL type authentication
[Fri 10 Sep 2010 10:32:29] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18315 on mailbox 9
[Fri 10 Sep 2010 10:32:29] Sess_read: Message has type field 0x10000
[Fri 10 Sep 2010 10:32:29] Sess_read: queueing message of type 8 with len 0 to the protocol
[Fri 10 Sep 2010 10:32:34] Sess_read: failed receiving header on session 9: ret 0: error: Bad file descriptor
[Fri 10 Sep 2010 10:32:34] Sess_kill: killing session P18315 ( mailbox 9 )
In this same successful run, my test code spits out this:
SP_connect() returned 1
mbox=3,private_group='#P18315#localhost'
SP_join() returned 0
SP_receive() returned 56
A typical sequence in a failed SP_join() looks like this:
[Fri 10 Sep 2010 10:28:11] Sess_accept: set sndbuf/rcvbuf to 204800
[Fri 10 Sep 2010 10:28:11] Sess_recv_client_auth: Client requested NULL type authentication
[Fri 10 Sep 2010 10:28:11] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18147 on mailbox 9
[Fri 10 Sep 2010 10:28:11] Sess_read: Message has type field 0x80000180
[Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
[Fri 10 Sep 2010 10:28:11] Sess_kill: killing session P18147 ( mailbox 9 )
And my code spits out this:
SP_connect() returned 1
mbox=3,private_group='#P18147#localhost'
SP_join() returned 0
SP_receive() failed - -8
SP_error: (0) unrecognized error
Here's a summary of the failures.
4.0.0/ppc client talking to 4.0.0/ppc daemon: success
4.0.0/ppc client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
4.0.0/ppc client talking to 4.0.0/x86 daemon: success
4.0.0/ppc client talking to 4.1.0/x86 daemon: success
4.1.0/ppc client talking to 4.0.0/ppc daemon: failure
[Fri 10 Sep 2010 10:48:43] Sess_validate_read_header: Message has illegal type field 0x100
4.1.0/ppc client talking to 4.1.0/ppc daemon: success
4.1.0/ppc client talking to 4.0.0/x86 daemon: failure
[Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180
4.1.0/ppc client talking to 4.1.0/x86 daemon: failure
[Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180
4.0.0/x86 client talking to 4.0.0/ppc daemon: success
4.0.0/x86 client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:52:40] Sess_validate_read_header: Message has illegal type field 0x80000180
4.0.0/x86 client talking to 4.0.0/x86 daemon: success
4.0.0/x86 client talking to 4.1.0/x86 daemon: success
4.1.0/x86 client talking to 4.0.0/ppc daemon: success
4.1.0/x86 client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:57:11] Sess_validate_read_header: Message has illegal type field 0x80000180
4.1.0/x86 client talking to 4.0.0/x86 daemon: success
4.1.0/x86 client talking to 4.1.0/x86 daemon: success
More information about the Spread-users
mailing list