[Spread-users] ppc version of spread 4.1.0 closing connection

Mark Swan mswan at cray.com
Fri Sep 10 16:29:30 EDT 2010


Excellent suggestion.  Thanks.  I've compiled a list of success and failure below.

First, here's the most stripped down version of the source code I'm using to demonstrate the problem:

#include <stdio.h>
#include <stdlib.h>

#include <sp.h>

#define MY_MAX_NUM_GROUPS 1000
#define MY_MAX_MESS_SIZE 102400

int main (int argc, char **argv)
{
    int rc;
    char private_name[MAX_GROUP_NAME];
    int priority = 0;
    int group_membership = 1;
    mailbox mbox;
    char private_group[MAX_GROUP_NAME];
    service service_type;
    char sender[MAX_GROUP_NAME];
    int num_groups;
    char groups[MY_MAX_NUM_GROUPS][MAX_GROUP_NAME];
    int16 mess_type;
    int endian_mismatch;
    int mess_len;
    char mess[MY_MAX_MESS_SIZE];

    sprintf(private_name,"P%d",getpid());
    rc = SP_connect(argv[1],private_name,0,1,&mbox,private_group);
    if (rc < 0) {
        printf("SP_connect() failed - %d\n",rc);
        exit(1);
    }
    printf("SP_connect() returned %d\n",rc);
    printf("mbox=%d,private_group='%s'\n",mbox,private_group);

    rc = SP_join(mbox,"xyz");
    if (rc < 0) {
        printf("SP_join() failed - %d\n",rc);
        exit(1);
    }
    printf("SP_join() returned %d\n",rc);


    sleep(5);

    rc = SP_receive(mbox, &service_type, sender, MY_MAX_NUM_GROUPS,
                    &num_groups, groups, &mess_type, &endian_mismatch,
                    MY_MAX_MESS_SIZE, mess);
    if (rc < 0) {
        printf("SP_receive() failed - %d\n",rc);
        SP_error(mess_len);
        exit(1);
    }
    printf("SP_receive() returned %d\n",rc);

    exit(0);
}


I built this code on both PPC and X86 platforms against both the 4.0.0 and 4.1.0 releases of spread.  I also have PPC and X86 spread daemons built and running from both the 4.0.0 and 4.1.0 releases.  My biggest heartburn, obviously, is that a 4.1.0 ppc can't talk to a 4.1.0 x86 and vice versa.

Briefly, it's the SP_join() that's failing, not the SP_receive().  When SP_join() fails, the spread daemon's error message is typically "Sess_validate_read_header: Message has illegal type field 0x80000180".

A typical sequence seen in the daemon's log file for a successful execution of my test code is below.  The "Bad file descriptor" error is simply when my app executes the SP_receive() after the 5 second sleep and then exits.

[Fri 10 Sep 2010 10:32:29] Sess_accept: set sndbuf/rcvbuf to 204800
[Fri 10 Sep 2010 10:32:29] Sess_recv_client_auth: Client requested NULL type authentication
[Fri 10 Sep 2010 10:32:29] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18315 on mailbox 9
[Fri 10 Sep 2010 10:32:29] Sess_read: Message has type field 0x10000
[Fri 10 Sep 2010 10:32:29] Sess_read: queueing message of type 8 with len 0 to the protocol
[Fri 10 Sep 2010 10:32:34] Sess_read: failed receiving header on session 9: ret 0: error: Bad file descriptor 
[Fri 10 Sep 2010 10:32:34] Sess_kill: killing session P18315 ( mailbox 9 )

In this same successful run, my test code spits out this:

SP_connect() returned 1
mbox=3,private_group='#P18315#localhost'
SP_join() returned 0
SP_receive() returned 56

A typical sequence in a failed SP_join() looks like this:
[Fri 10 Sep 2010 10:28:11] Sess_accept: set sndbuf/rcvbuf to 204800
[Fri 10 Sep 2010 10:28:11] Sess_recv_client_auth: Client requested NULL type authentication
[Fri 10 Sep 2010 10:28:11] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18147 on mailbox 9
[Fri 10 Sep 2010 10:28:11] Sess_read: Message has type field 0x80000180
[Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
[Fri 10 Sep 2010 10:28:11] Sess_kill: killing session P18147 ( mailbox 9 )

And my code spits out this:

SP_connect() returned 1
mbox=3,private_group='#P18147#localhost'
SP_join() returned 0
SP_receive() failed - -8
SP_error: (0) unrecognized error

Here's a summary of the failures.

4.0.0/ppc client talking to 4.0.0/ppc daemon: success
4.0.0/ppc client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
4.0.0/ppc client talking to 4.0.0/x86 daemon: success
4.0.0/ppc client talking to 4.1.0/x86 daemon: success

4.1.0/ppc client talking to 4.0.0/ppc daemon: failure
[Fri 10 Sep 2010 10:48:43] Sess_validate_read_header: Message has illegal type field 0x100
4.1.0/ppc client talking to 4.1.0/ppc daemon: success
4.1.0/ppc client talking to 4.0.0/x86 daemon: failure
[Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180
4.1.0/ppc client talking to 4.1.0/x86 daemon: failure
[Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180

4.0.0/x86 client talking to 4.0.0/ppc daemon: success
4.0.0/x86 client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:52:40] Sess_validate_read_header: Message has illegal type field 0x80000180
4.0.0/x86 client talking to 4.0.0/x86 daemon: success
4.0.0/x86 client talking to 4.1.0/x86 daemon: success

4.1.0/x86 client talking to 4.0.0/ppc daemon: success
4.1.0/x86 client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:57:11] Sess_validate_read_header: Message has illegal type field 0x80000180
4.1.0/x86 client talking to 4.0.0/x86 daemon: success
4.1.0/x86 client talking to 4.1.0/x86 daemon: success







More information about the Spread-users mailing list