[Spread-users] after fork: session 'xxx' trying to make session 'yyy' do something

James Rauser j.rauser at science-factory.com
Tue Oct 1 08:35:59 EDT 2002


Yair Amir wrote:
> 
> I miss some information. For example I need to see the EXACT
> SP_multicast call you do (including all of its parameters).
> 
>              :) Yair.

The basic question is: how can I get rid of an inherited spread connection
in a forked child process *without* disconnecting the parent process?
Just doing a close() on the mbox gets rid of the unix file descriptor,
but apparently doesn't update the session table maintained in the client
library, which then leads to inconsistencies in the communication with
the daemon.  Doing an SP_disconnect() in the child will clean up the
session table, but will also disconnect the parent.  Note that I don't
exec() anything in the child, so the fd's close-on-exec flag is
irrelevant.

What follows is (annotated) trace output showing the parameters
of all of the spread calls except for SP_receive.  The message contents
are also omitted.  In this run, there's only one server process running,
which forks itself to execute one job.  There is only one spread group
(named "default").  The server process ID is 27352, the child is 27368.

// server process initialization, connect and join
27352[1] SPREAD_TRACE: SP_connect('4803 at benin', 'wKgAEGrYS', 0, 1, 0xbffff4ac, 0xbffff424)
27352[1] SPREAD_TRACE: SP_connect() returns 1, mbox=4, private_group=#wKgAEGrYS#benin
27352[1] SPREAD_TRACE: SP_join(4, 'default')
27352[1] SPREAD_TRACE: SP_join() returns 0

// server notified of its own membership
27352[1] SPREAD_TRACE: received spread membership message

// messages are sent and received to trigger the execution of a job
27352[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, 'default', 0 , 0x4064d000, 11)
27352[1] SPREAD_TRACE: SP_multicast() returns 11
27352[1] SPREAD_TRACE: received spread data message
27352[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, '#wKgAEGrYS#benin', 0 , 0x404be000, 30)
27352[1] SPREAD_TRACE: SP_multicast() returns 30
27352[1] SPREAD_TRACE: received spread data message

// job process is forked, inherits and then closes the server's mbox (FD 4)
27368[1] SPREAD_TRACE: worker process started
27368[1] SPREAD_TRACE: close mbox 4

// job process reconnects, it's mbox is also FD 4
27368[1] SPREAD_TRACE: SP_connect('4803 at benin', 'wKgAEGroJ', 0, 0, 0xbffff23c, 0xbffff194)
27368[1] SPREAD_TRACE: SP_connect() returns 1, mbox=4, private_group=#wKgAEGroJ#benin
27368[1] SPREAD_TRACE: SP_join(4'default')

// This message is from the spread daemon, and appears in the console output
// *before* the the SP_join call returns 0
Sess_validate_read_header: Session wKgAEGrYS trying to make session wKgAEGroJ do something
27368[1] SPREAD_TRACE: SP_join() returns 0

// The child's first multicast dies
27368[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, 'default', 0 , 0x406ff000, 95)
27368[1] SPREAD_TRACE: terminate after SP_multicast returns -8

My theory: when the child's SP_connect takes place, the client library's
session table still has an entry in it for mbox 4, inherited from the parent
process.  The code in SP_connect just appends a new session entry to the
table, which also happens to have mbox 4.  When the child performs the
SP_join, the call to SP_get_session() finds the inherited entry, with
the parent's private ID "wKgAEGrYS", and uses it to transmit the join
message.  But: the daemon has associated the connection to the child's mbox
with the private ID "wKGAEGroJ", resulting in the authorization failure.
What I need is a way to get the client library to remove the inherited
session entry *without* disconnecting.

Thanks, Jim


> 
> James> Hi,
> 
> James> I'm a Spread beginner.  We're using Spread 3.16.2 in a system which
> James> distributes work from a shared queue to to a number of distributed
> James> compute servers.  The servers each join spread groups with the same
> James> name as the queue, and I use the group ordering to designate one as
> James> the "leader".  The leader is responsible for monitoring the queue
> James> and dispatching jobs to the remaining servers.
> 
> James> When a job is dispatched, the server which is running it forks a
> James> new process in which to execute it.  The FD for the spread connection
> James> is (of course) inherited across the fork.  The child needs its own
> James> connection to the spread daemon, so it closes the parent's mbox
> James> (with close(2)), and then calls SP_connect to reconnect.
> James> Schematically:
> 
> James>    parent_mbox = SP_connect("xxx");
> James>    SP_join(parent_mbox, "queue");
> James>    while ( ... ) {
> James>      SP_receive();
> James>      if ( fork() == 0 ) {
> James>         // child process for a new job
> James>         close(parent_mbox);
> James>         child_mbox = SP_connect("yyy");
> James>         SP_join(child_mbox, "queue");
> James>         SP_multicast()
> James>      }
> James>    }
> 
> James> If parent_mbox and child_mbox happen to be assigned the same (numerical)
> James> FD, then the spread daemon produces the error message quoted in the
> James> subject line, where "xxx" and "yyy" are the private names of the parent
> James> and child, respectively, and the child's first attempt to SP_multicast()
> James> dies with error -8: "connection closed by spread".  I can work around
> James> the bug by doing a dummy open() or pipe() call between the close()
> James> and the SP_connect() in the child.
> 
> James> After a quick examination of sp.c, it appears that I don't want to do
> James> an SP_disconnect() in the child, because that actually updates the
> James> membership at the daemon and the parent is not actually disconnecting.
> James> The internal function SP_kill() looks like it does what I want (close
> James> the mbox and update the client-side session table), but it's static and
> James> undocumented.
> 
> James> Suggesstion? Any help would be appreciated.
> 
> James> Thanks, Jim

-- 
------------------------------------------------------------------------
Jim Rauser                                          Science Factory GmbH
mailto:j.rauser at science-factory.com                       Unter Käster 1
Tel: +49 221 277 399 204                          50667 Cologne, Germany




More information about the Spread-users mailing list