[Spread-users] ppc version of spread 4.1.0 closing connection

Mark Swan mswan at cray.com
Mon Sep 13 10:54:54 EDT 2010


This is a native compile on the ppc machine.

Looking at the 4.1.0 config.log file, there's an error message ...

configure:3528: checking whether byte ordering is bigendian
configure:3553: gcc -c -g -O2  conftest.c >&5
configure:3560: $? = 0
configure:3902: result: universal
configure:3929: checking how to run the C preprocessor
configure:4047: result: gcc -E
configure:4076: gcc -E  conftest.c
configure:4083: $? = 0
configure:4114: gcc -E  conftest.c
conftest.c:18:28: error: ac_nonexistent.h: No such file or directory

and we end up with ac_cv_c_bigendian=universal

In the 4.0.0 config.log file, it comes up with a clear "ac_cv_c_bigendian=yes".

--
Mark Swan (mswan at cray.com)
office:651-605-9068/cell:651-239-8000

> -----Original Message-----
> From: spread-users-bounces at lists.spread.org [mailto:spread-users-
> bounces at lists.spread.org] On Behalf Of John Schultz
> Sent: Saturday, September 11, 2010 1:21 PM
> To: spread-users at lists.spread.org
> Subject: Re: [Spread-users] ppc version of spread 4.1.0 closing
> connection
> 
> I agree it looks like an endian issue, but Spread (clients and daemons)
> always sends in host byte order and sets the endian flags on the
> message so that if a receiver has opposite endianness it can flip as
> necessary.
> 
> There shouldn't be a difference between 4.0 and 4.1 in terms of the
> endian flipping or setting code.  If anything the autoconf / build
> system might have changed between the two versions such that the
> endianness is getting set wrong at compile time on one or more of your
> endpoints.
> 
> Are you cross compiling or anything weird like that?
> 
> Cheers!
> 
> -----
> John Lane Schultz
> Spread Concepts LLC
> Phn: 301 830 8100
> Cell: 443 838 2200
> 
> On Sep 11, 2010, at 1:01 AM, Mike Root wrote:
> 
> You are compiling with the wrong ARCH_ENDIAN flag set for 4.1 ppc.
> 
> Looks like an endianess issue.  Probably the 4.1 code base isn't doing
> the network byte order correctly.  4.0 sends and receives everything in
> network byte order, so it can talk to x86 correctly.  4.1 doesn't send
> in network byte order, so it can talk fine to itself, but it can't send
> data to people that talk in network byte order.
> 
> Here is the truth table based on your results
> 
> https://spreadsheets.google.com/pub?key=0Aj_CwkfwmcEZdEw4QWNLZkxpMGtQLW
> c2RHlWSmxNeFE&hl=en&single=true&gid=0&output=html
> 
> Looking at the code...
>  0x80000080 is ARCH_ENDIAN  (representing little-endian) and is set
> locally after the message is received.
> So for your type message 0x80000180 the received type is really
> 0x00000100  (the ARCH_ENDIAN added after the fact).
> 0x00000100  in little-endian is the same as 0x00010000 in big-endian.
> 0x10000 is the type representing a join (JOIN_MESS).
> 0x00100 is not a valid type -- (CAUSED_BY_JOIN would have additional
> bits set besides 0x100)
> 
> ****Looking at the errors with the 4.1ppc daemon
> 4.1.0/x86 client talking to 4.1.0/ppc daemon: failure [Fri 10 Sep 2010
> 10:57:11] Sess_validate_read_header: Message has illegal type field
> 0x80000180 The type flag is bad and the daemon is running in little-
> endian mode.
> 
> ****Looking at the errors with the 4.1ppc client 4.1.0/ppc client
> talking to 4.0.0/ppc daemon: failure [Fri 10 Sep 2010 10:48:43]
> Sess_validate_read_header: Message has illegal type field 0x100 The
> 4.0.0 ppc daemon sees a bad field type, but is running in BIG-ENDIAN
> mode.
> 
> 4.1.0/ppc client talking to 4.0.0/x86 daemon: failure [Fri 10 Sep 2010
> 10:47:38] Sess_validate_read_header: Message has illegal type field
> 0x80000180 The type flag is bad and the daemon is running in little-
> endian mode.
> 
> The ppc and x86 should not be running with the same endianness (unless
> you have the ppc hardware tweeked to run in little-endian mode.  The
> 4.0ppc spread library seems to be running fine in big-endian, so you
> probably want the 4.1 library to also run in big-endian.
> 
> ***********************************************
> When you compile the 4.1 ppc version of the spread library make sure
> ARCH_ENDIAN is 0x00000000 to represent big-endian mode.  You can do
> this by compiling with -D WORDS_BIGENDIAN see arch.h in the source code
> 
> 
> On Fri, Sep 10, 2010 at 1:29 PM, Mark Swan <mswan at cray.com> wrote:
> > Excellent suggestion.  Thanks.  I've compiled a list of success and
> failure below.
> >
> > First, here's the most stripped down version of the source code I'm
> using to demonstrate the problem:
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> >
> > #include <sp.h>
> >
> > #define MY_MAX_NUM_GROUPS 1000
> > #define MY_MAX_MESS_SIZE 102400
> >
> > int main (int argc, char **argv)
> > {
> >    int rc;
> >    char private_name[MAX_GROUP_NAME];
> >    int priority = 0;
> >    int group_membership = 1;
> >    mailbox mbox;
> >    char private_group[MAX_GROUP_NAME];
> >    service service_type;
> >    char sender[MAX_GROUP_NAME];
> >    int num_groups;
> >    char groups[MY_MAX_NUM_GROUPS][MAX_GROUP_NAME];
> >    int16 mess_type;
> >    int endian_mismatch;
> >    int mess_len;
> >    char mess[MY_MAX_MESS_SIZE];
> >
> >    sprintf(private_name,"P%d",getpid());
> >    rc = SP_connect(argv[1],private_name,0,1,&mbox,private_group);
> >    if (rc < 0) {
> >        printf("SP_connect() failed - %d\n",rc);
> >        exit(1);
> >    }
> >    printf("SP_connect() returned %d\n",rc);
> >    printf("mbox=%d,private_group='%s'\n",mbox,private_group);
> >
> >    rc = SP_join(mbox,"xyz");
> >    if (rc < 0) {
> >        printf("SP_join() failed - %d\n",rc);
> >        exit(1);
> >    }
> >    printf("SP_join() returned %d\n",rc);
> >
> >
> >    sleep(5);
> >
> >    rc = SP_receive(mbox, &service_type, sender, MY_MAX_NUM_GROUPS,
> >                    &num_groups, groups, &mess_type, &endian_mismatch,
> >                    MY_MAX_MESS_SIZE, mess);
> >    if (rc < 0) {
> >        printf("SP_receive() failed - %d\n",rc);
> >        SP_error(mess_len);
> >        exit(1);
> >    }
> >    printf("SP_receive() returned %d\n",rc);
> >
> >    exit(0);
> > }
> >
> >
> > I built this code on both PPC and X86 platforms against both the
> 4.0.0 and 4.1.0 releases of spread.  I also have PPC and X86 spread
> daemons built and running from both the 4.0.0 and 4.1.0 releases.  My
> biggest heartburn, obviously, is that a 4.1.0 ppc can't talk to a 4.1.0
> x86 and vice versa.
> >
> > Briefly, it's the SP_join() that's failing, not the SP_receive().
> When SP_join() fails, the spread daemon's error message is typically
> "Sess_validate_read_header: Message has illegal type field 0x80000180".
> >
> > A typical sequence seen in the daemon's log file for a successful
> execution of my test code is below.  The "Bad file descriptor" error is
> simply when my app executes the SP_receive() after the 5 second sleep
> and then exits.
> >
> > [Fri 10 Sep 2010 10:32:29] Sess_accept: set sndbuf/rcvbuf to 204800
> > [Fri 10 Sep 2010 10:32:29] Sess_recv_client_auth: Client requested
> > NULL type authentication [Fri 10 Sep 2010 10:32:29]
> > Sess_session_authorized: Accepting from 0.0.0.0 with private name
> > P18315 on mailbox 9 [Fri 10 Sep 2010 10:32:29] Sess_read: Message has
> > type field 0x10000 [Fri 10 Sep 2010 10:32:29] Sess_read: queueing
> > message of type 8 with len 0 to the protocol [Fri 10 Sep 2010
> > 10:32:34] Sess_read: failed receiving header on session 9: ret 0:
> > error: Bad file descriptor [Fri 10 Sep 2010 10:32:34] Sess_kill:
> > killing session P18315 ( mailbox 9 )
> >
> > In this same successful run, my test code spits out this:
> >
> > SP_connect() returned 1
> > mbox=3,private_group='#P18315#localhost'
> > SP_join() returned 0
> > SP_receive() returned 56
> >
> > A typical sequence in a failed SP_join() looks like this:
> > [Fri 10 Sep 2010 10:28:11] Sess_accept: set sndbuf/rcvbuf to 204800
> > [Fri 10 Sep 2010 10:28:11] Sess_recv_client_auth: Client requested
> > NULL type authentication [Fri 10 Sep 2010 10:28:11]
> > Sess_session_authorized: Accepting from 0.0.0.0 with private name
> > P18147 on mailbox 9 [Fri 10 Sep 2010 10:28:11] Sess_read: Message has
> > type field 0x80000180 [Fri 10 Sep 2010 10:28:11]
> > Sess_validate_read_header: Message has illegal type field 0x80000180
> > [Fri 10 Sep 2010 10:28:11] Sess_kill: killing session P18147 (
> mailbox
> > 9 )
> >
> > And my code spits out this:
> >
> > SP_connect() returned 1
> > mbox=3,private_group='#P18147#localhost'
> > SP_join() returned 0
> > SP_receive() failed - -8
> > SP_error: (0) unrecognized error
> >
> > Here's a summary of the failures.
> >
> > 4.0.0/ppc client talking to 4.0.0/ppc daemon: success 4.0.0/ppc
> client
> > talking to 4.1.0/ppc daemon: failure [Fri 10 Sep 2010 10:28:11]
> > Sess_validate_read_header: Message has illegal type field 0x80000180
> > 4.0.0/ppc client talking to 4.0.0/x86 daemon: success 4.0.0/ppc
> client
> > talking to 4.1.0/x86 daemon: success
> >
> > 4.1.0/ppc client talking to 4.0.0/ppc daemon: failure [Fri 10 Sep
> 2010
> > 10:48:43] Sess_validate_read_header: Message has illegal type field
> > 0x100 4.1.0/ppc client talking to 4.1.0/ppc daemon: success 4.1.0/ppc
> > client talking to 4.0.0/x86 daemon: failure [Fri 10 Sep 2010
> 10:47:38]
> > Sess_validate_read_header: Message has illegal type field 0x80000180
> > 4.1.0/ppc client talking to 4.1.0/x86 daemon: failure [Fri 10 Sep
> 2010
> > 10:47:38] Sess_validate_read_header: Message has illegal type field
> > 0x80000180
> >
> > 4.0.0/x86 client talking to 4.0.0/ppc daemon: success
> > 4.0.0/x86 client talking to 4.1.0/ppc daemon: failure [Fri 10 Sep
> 2010
> > 10:52:40] Sess_validate_read_header: Message has illegal type field
> > 0x80000180
> > 4.0.0/x86 client talking to 4.0.0/x86 daemon: success
> > 4.0.0/x86 client talking to 4.1.0/x86 daemon: success
> >
> > 4.1.0/x86 client talking to 4.0.0/ppc daemon: success
> > 4.1.0/x86 client talking to 4.1.0/ppc daemon: failure [Fri 10 Sep
> 2010
> > 10:57:11] Sess_validate_read_header: Message has illegal type field
> > 0x80000180
> > 4.1.0/x86 client talking to 4.0.0/x86 daemon: success
> > 4.1.0/x86 client talking to 4.1.0/x86 daemon: success
> >
> >
> >
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
> 
> 
> 
> --
> Mike





More information about the Spread-users mailing list