[Spread-users] after fork: session 'xxx' trying to make session 'yyy' do something

Yair Amir yairamir at cnds.jhu.edu
Thu Oct 3 07:21:34 EDT 2002


Hi,

Try not touching the original connection. Just connect in the child
with a new name and use that new connection and see if that works.
I know it is not nice to leave the first connection opened in the
child but I would first verify that your closing it is the problem.

    Cheers,
    
    :) Yair.
    
James> Yair Amir wrote:
>> 
>> I miss some information. For example I need to see the EXACT
>> SP_multicast call you do (including all of its parameters).
>> 
>>              :) Yair.

James> The basic question is: how can I get rid of an inherited spread connection
James> in a forked child process *without* disconnecting the parent process?
James> Just doing a close() on the mbox gets rid of the unix file descriptor,
James> but apparently doesn't update the session table maintained in the client
James> library, which then leads to inconsistencies in the communication with
James> the daemon.  Doing an SP_disconnect() in the child will clean up the
James> session table, but will also disconnect the parent.  Note that I don't
James> exec() anything in the child, so the fd's close-on-exec flag is
James> irrelevant.

James> What follows is (annotated) trace output showing the parameters
James> of all of the spread calls except for SP_receive.  The message contents
James> are also omitted.  In this run, there's only one server process running,
James> which forks itself to execute one job.  There is only one spread group
James> (named "default").  The server process ID is 27352, the child is 27368.

James> // server process initialization, connect and join
James> 27352[1] SPREAD_TRACE: SP_connect('4803 at benin', 'wKgAEGrYS', 0, 1, 0xbffff4ac, 0xbffff424)
James> 27352[1] SPREAD_TRACE: SP_connect() returns 1, mbox=4, private_group=#wKgAEGrYS#benin
James> 27352[1] SPREAD_TRACE: SP_join(4, 'default')
James> 27352[1] SPREAD_TRACE: SP_join() returns 0

James> // server notified of its own membership
James> 27352[1] SPREAD_TRACE: received spread membership message

James> // messages are sent and received to trigger the execution of a job
James> 27352[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, 'default', 0 , 0x4064d000, 11)
James> 27352[1] SPREAD_TRACE: SP_multicast() returns 11
James> 27352[1] SPREAD_TRACE: received spread data message
James> 27352[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, '#wKgAEGrYS#benin', 0 , 0x404be000, 30)
James> 27352[1] SPREAD_TRACE: SP_multicast() returns 30
James> 27352[1] SPREAD_TRACE: received spread data message

James> // job process is forked, inherits and then closes the server's mbox (FD 4)
James> 27368[1] SPREAD_TRACE: worker process started
James> 27368[1] SPREAD_TRACE: close mbox 4

James> // job process reconnects, it's mbox is also FD 4
James> 27368[1] SPREAD_TRACE: SP_connect('4803 at benin', 'wKgAEGroJ', 0, 0, 0xbffff23c, 0xbffff194)
James> 27368[1] SPREAD_TRACE: SP_connect() returns 1, mbox=4, private_group=#wKgAEGroJ#benin
James> 27368[1] SPREAD_TRACE: SP_join(4'default')

James> // This message is from the spread daemon, and appears in the console output
James> // *before* the the SP_join call returns 0
James> Sess_validate_read_header: Session wKgAEGrYS trying to make session wKgAEGroJ do something
James> 27368[1] SPREAD_TRACE: SP_join() returns 0

James> // The child's first multicast dies
James> 27368[1] SPREAD_TRACE: SP_multicast(4, RELIABLE_MESS, 'default', 0 , 0x406ff000, 95)
James> 27368[1] SPREAD_TRACE: terminate after SP_multicast returns -8

James> My theory: when the child's SP_connect takes place, the client library's
James> session table still has an entry in it for mbox 4, inherited from the parent
James> process.  The code in SP_connect just appends a new session entry to the
James> table, which also happens to have mbox 4.  When the child performs the
James> SP_join, the call to SP_get_session() finds the inherited entry, with
James> the parent's private ID "wKgAEGrYS", and uses it to transmit the join
James> message.  But: the daemon has associated the connection to the child's mbox
James> with the private ID "wKGAEGroJ", resulting in the authorization failure.
James> What I need is a way to get the client library to remove the inherited
James> session entry *without* disconnecting.

James> Thanks, Jim


>> 
>> James> Hi,
>> 
>> James> I'm a Spread beginner.  We're using Spread 3.16.2 in a system which
>> James> distributes work from a shared queue to to a number of distributed
>> James> compute servers.  The servers each join spread groups with the same
>> James> name as the queue, and I use the group ordering to designate one as
>> James> the "leader".  The leader is responsible for monitoring the queue
>> James> and dispatching jobs to the remaining servers.
>> 
>> James> When a job is dispatched, the server which is running it forks a
>> James> new process in which to execute it.  The FD for the spread connection
>> James> is (of course) inherited across the fork.  The child needs its own
>> James> connection to the spread daemon, so it closes the parent's mbox
>> James> (with close(2)), and then calls SP_connect to reconnect.
>> James> Schematically:
>> 
>> James>    parent_mbox = SP_connect("xxx");
>> James>    SP_join(parent_mbox, "queue");
>> James>    while ( ... ) {
>> James>      SP_receive();
>> James>      if ( fork() == 0 ) {
>> James>         // child process for a new job
>> James>         close(parent_mbox);
>> James>         child_mbox = SP_connect("yyy");
>> James>         SP_join(child_mbox, "queue");
>> James>         SP_multicast()
>> James>      }
>> James>    }
>> 
>> James> If parent_mbox and child_mbox happen to be assigned the same (numerical)
>> James> FD, then the spread daemon produces the error message quoted in the
>> James> subject line, where "xxx" and "yyy" are the private names of the parent
>> James> and child, respectively, and the child's first attempt to SP_multicast()
>> James> dies with error -8: "connection closed by spread".  I can work around
>> James> the bug by doing a dummy open() or pipe() call between the close()
>> James> and the SP_connect() in the child.
>> 
>> James> After a quick examination of sp.c, it appears that I don't want to do
>> James> an SP_disconnect() in the child, because that actually updates the
>> James> membership at the daemon and the parent is not actually disconnecting.
>> James> The internal function SP_kill() looks like it does what I want (close
>> James> the mbox and update the client-side session table), but it's static and
>> James> undocumented.
>> 
>> James> Suggesstion? Any help would be appreciated.
>> 
>> James> Thanks, Jim





More information about the Spread-users mailing list