[Spread-users] ppc version of spread 4.1.0 closing connection

John Schultz jschultz at spreadconcepts.com
Sat Sep 11 14:20:42 EDT 2010


I agree it looks like an endian issue, but Spread (clients and daemons) always sends in host byte order and sets the endian flags on the message so that if a receiver has opposite endianness it can flip as necessary.

There shouldn't be a difference between 4.0 and 4.1 in terms of the endian flipping or setting code.  If anything the autoconf / build system might have changed between the two versions such that the endianness is getting set wrong at compile time on one or more of your endpoints.

Are you cross compiling or anything weird like that?  

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Phn: 301 830 8100
Cell: 443 838 2200

On Sep 11, 2010, at 1:01 AM, Mike Root wrote:

You are compiling with the wrong ARCH_ENDIAN flag set for 4.1 ppc.

Looks like an endianess issue.  Probably the 4.1 code base isn't doing
the network byte order correctly.  4.0 sends and receives everything
in network byte order, so it can talk to x86 correctly.  4.1 doesn't
send in network byte order, so it can talk fine to itself, but it
can't send data to people that talk in network byte order.

Here is the truth table based on your results

https://spreadsheets.google.com/pub?key=0Aj_CwkfwmcEZdEw4QWNLZkxpMGtQLWc2RHlWSmxNeFE&hl=en&single=true&gid=0&output=html

Looking at the code...
 0x80000080 is ARCH_ENDIAN  (representing little-endian) and is set
locally after the message is received.
So for your type message 0x80000180 the received type is really
0x00000100  (the ARCH_ENDIAN added after the fact).
0x00000100  in little-endian is the same as 0x00010000 in big-endian.
0x10000 is the type representing a join (JOIN_MESS).
0x00100 is not a valid type -- (CAUSED_BY_JOIN would have additional
bits set besides 0x100)

****Looking at the errors with the 4.1ppc daemon
4.1.0/x86 client talking to 4.1.0/ppc daemon: failure
[Fri 10 Sep 2010 10:57:11] Sess_validate_read_header: Message has
illegal type field 0x80000180
The type flag is bad and the daemon is running in little-endian mode.

****Looking at the errors with the 4.1ppc client
4.1.0/ppc client talking to 4.0.0/ppc daemon: failure
[Fri 10 Sep 2010 10:48:43] Sess_validate_read_header: Message has
illegal type field 0x100
The 4.0.0 ppc daemon sees a bad field type, but is running in BIG-ENDIAN mode.

4.1.0/ppc client talking to 4.0.0/x86 daemon: failure
[Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has
illegal type field 0x80000180
The type flag is bad and the daemon is running in little-endian mode.

The ppc and x86 should not be running with the same endianness (unless
you have the ppc hardware tweeked to run in little-endian mode.  The
4.0ppc spread library seems to be running fine in big-endian, so you
probably want the 4.1 library to also run in big-endian.

***********************************************
When you compile the 4.1 ppc version of the spread library make sure
ARCH_ENDIAN is 0x00000000 to represent big-endian mode.  You can do
this by compiling with -D WORDS_BIGENDIAN
see arch.h in the source code


On Fri, Sep 10, 2010 at 1:29 PM, Mark Swan <mswan at cray.com> wrote:
> Excellent suggestion.  Thanks.  I've compiled a list of success and failure below.
> 
> First, here's the most stripped down version of the source code I'm using to demonstrate the problem:
> 
> #include <stdio.h>
> #include <stdlib.h>
> 
> #include <sp.h>
> 
> #define MY_MAX_NUM_GROUPS 1000
> #define MY_MAX_MESS_SIZE 102400
> 
> int main (int argc, char **argv)
> {
>    int rc;
>    char private_name[MAX_GROUP_NAME];
>    int priority = 0;
>    int group_membership = 1;
>    mailbox mbox;
>    char private_group[MAX_GROUP_NAME];
>    service service_type;
>    char sender[MAX_GROUP_NAME];
>    int num_groups;
>    char groups[MY_MAX_NUM_GROUPS][MAX_GROUP_NAME];
>    int16 mess_type;
>    int endian_mismatch;
>    int mess_len;
>    char mess[MY_MAX_MESS_SIZE];
> 
>    sprintf(private_name,"P%d",getpid());
>    rc = SP_connect(argv[1],private_name,0,1,&mbox,private_group);
>    if (rc < 0) {
>        printf("SP_connect() failed - %d\n",rc);
>        exit(1);
>    }
>    printf("SP_connect() returned %d\n",rc);
>    printf("mbox=%d,private_group='%s'\n",mbox,private_group);
> 
>    rc = SP_join(mbox,"xyz");
>    if (rc < 0) {
>        printf("SP_join() failed - %d\n",rc);
>        exit(1);
>    }
>    printf("SP_join() returned %d\n",rc);
> 
> 
>    sleep(5);
> 
>    rc = SP_receive(mbox, &service_type, sender, MY_MAX_NUM_GROUPS,
>                    &num_groups, groups, &mess_type, &endian_mismatch,
>                    MY_MAX_MESS_SIZE, mess);
>    if (rc < 0) {
>        printf("SP_receive() failed - %d\n",rc);
>        SP_error(mess_len);
>        exit(1);
>    }
>    printf("SP_receive() returned %d\n",rc);
> 
>    exit(0);
> }
> 
> 
> I built this code on both PPC and X86 platforms against both the 4.0.0 and 4.1.0 releases of spread.  I also have PPC and X86 spread daemons built and running from both the 4.0.0 and 4.1.0 releases.  My biggest heartburn, obviously, is that a 4.1.0 ppc can't talk to a 4.1.0 x86 and vice versa.
> 
> Briefly, it's the SP_join() that's failing, not the SP_receive().  When SP_join() fails, the spread daemon's error message is typically "Sess_validate_read_header: Message has illegal type field 0x80000180".
> 
> A typical sequence seen in the daemon's log file for a successful execution of my test code is below.  The "Bad file descriptor" error is simply when my app executes the SP_receive() after the 5 second sleep and then exits.
> 
> [Fri 10 Sep 2010 10:32:29] Sess_accept: set sndbuf/rcvbuf to 204800
> [Fri 10 Sep 2010 10:32:29] Sess_recv_client_auth: Client requested NULL type authentication
> [Fri 10 Sep 2010 10:32:29] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18315 on mailbox 9
> [Fri 10 Sep 2010 10:32:29] Sess_read: Message has type field 0x10000
> [Fri 10 Sep 2010 10:32:29] Sess_read: queueing message of type 8 with len 0 to the protocol
> [Fri 10 Sep 2010 10:32:34] Sess_read: failed receiving header on session 9: ret 0: error: Bad file descriptor
> [Fri 10 Sep 2010 10:32:34] Sess_kill: killing session P18315 ( mailbox 9 )
> 
> In this same successful run, my test code spits out this:
> 
> SP_connect() returned 1
> mbox=3,private_group='#P18315#localhost'
> SP_join() returned 0
> SP_receive() returned 56
> 
> A typical sequence in a failed SP_join() looks like this:
> [Fri 10 Sep 2010 10:28:11] Sess_accept: set sndbuf/rcvbuf to 204800
> [Fri 10 Sep 2010 10:28:11] Sess_recv_client_auth: Client requested NULL type authentication
> [Fri 10 Sep 2010 10:28:11] Sess_session_authorized: Accepting from 0.0.0.0 with private name P18147 on mailbox 9
> [Fri 10 Sep 2010 10:28:11] Sess_read: Message has type field 0x80000180
> [Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
> [Fri 10 Sep 2010 10:28:11] Sess_kill: killing session P18147 ( mailbox 9 )
> 
> And my code spits out this:
> 
> SP_connect() returned 1
> mbox=3,private_group='#P18147#localhost'
> SP_join() returned 0
> SP_receive() failed - -8
> SP_error: (0) unrecognized error
> 
> Here's a summary of the failures.
> 
> 4.0.0/ppc client talking to 4.0.0/ppc daemon: success
> 4.0.0/ppc client talking to 4.1.0/ppc daemon: failure
> [Fri 10 Sep 2010 10:28:11] Sess_validate_read_header: Message has illegal type field 0x80000180
> 4.0.0/ppc client talking to 4.0.0/x86 daemon: success
> 4.0.0/ppc client talking to 4.1.0/x86 daemon: success
> 
> 4.1.0/ppc client talking to 4.0.0/ppc daemon: failure
> [Fri 10 Sep 2010 10:48:43] Sess_validate_read_header: Message has illegal type field 0x100
> 4.1.0/ppc client talking to 4.1.0/ppc daemon: success
> 4.1.0/ppc client talking to 4.0.0/x86 daemon: failure
> [Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180
> 4.1.0/ppc client talking to 4.1.0/x86 daemon: failure
> [Fri 10 Sep 2010 10:47:38] Sess_validate_read_header: Message has illegal type field 0x80000180
> 
> 4.0.0/x86 client talking to 4.0.0/ppc daemon: success
> 4.0.0/x86 client talking to 4.1.0/ppc daemon: failure
> [Fri 10 Sep 2010 10:52:40] Sess_validate_read_header: Message has illegal type field 0x80000180
> 4.0.0/x86 client talking to 4.0.0/x86 daemon: success
> 4.0.0/x86 client talking to 4.1.0/x86 daemon: success
> 
> 4.1.0/x86 client talking to 4.0.0/ppc daemon: success
> 4.1.0/x86 client talking to 4.1.0/ppc daemon: failure
> [Fri 10 Sep 2010 10:57:11] Sess_validate_read_header: Message has illegal type field 0x80000180
> 4.1.0/x86 client talking to 4.0.0/x86 daemon: success
> 4.1.0/x86 client talking to 4.1.0/x86 daemon: success
> 
> 
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 



--
Mike

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3805 bytes
Desc: not available
Url : http://lists.spread.org/pipermail/spread-users/attachments/20100911/3f6fba40/attachment.bin 


More information about the Spread-users mailing list