[Spread-users] Problem with "name not unique" on Solaris 10

Andres Heinloo andres at gfz-potsdam.de
Wed Jul 29 14:30:14 EDT 2009



On Tue, 21 Jul 2009, Andres Heinloo wrote:

> 
> 
> On Fri, 17 Jul 2009, Daniel F. Savarese wrote:
> 
> > 
> > In message <Pine.LNX.4.64.0907171604170.5907 at st55.gfz-potsdam.de>, Andres Heinl
> > oo writes:
> > >User>j PICK
> > >
> > >User>l PICK
> > >
> > >User>q
> > 
> > If that's a complete log, then you're missing the receipt of two
> > group membership messages.  Something like the following after the
> > join:
> > 
> > ============================                            
> > Received REGULAR membership for group PICK with 1 members, where I am member 0:
> >         #tttt#localhost
> > grp id is -1062731517 1247454353 1                      
> > Due to the JOIN of #tttt#localhost
> > 
> > And something like the following after the leave:
> > 
> > ============================
> > received membership message that left group PICK
> > 
> > That may be why the connection name isn't being freed up immediately
> > (there are pending messages, so the Spread daemon may wait before
> > killing the connection's queue of messages).  Try waiting for the
> > membership messages to arrive before issuing the quit command and see
> > how long that takes.  That may help the Spread team tell you what's
> > going on.  I wasn't able to reproduce your problem, but I'm running
> > a patched version of Spread with some custom modifications.
> 
> Indeed the membership messages are missing. It takes several minutes until 
> membership message arrives. However, once the message arrives, Spread 
> starts to work normally.
> 
> I think it would not make sense to keep message queue after TCP 
> disconnect, specially if the client cannot reconnect anyway (name not 
> unique).
> 
> Looks like the Spread daemon is somewhere blocked at start, however, it 
> still does accept and handle TCP connections.
> 
> As I said, the problem does *not* occur on Linux. On Solaris 10, it seems 
> that Spread works better when compiled with SunStudio rather than gcc (I 
> never got membership message when compiled with gcc, but maybe I did not 
> wait long enough).
> 
> Any help would be very much appreciated.

Looks like I solved the problem.

Running Spread with debug enabled, I noticed repeating messages like this:

[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: element[0]: 32 bytes
[Wed 29 Jul 2009 19:52:19] DL_send: error: Invalid argument
 sending 32 bytes on channel 5 to address 127.0.0.1
[Wed 29 Jul 2009 19:52:19] DL_send: sent a message of -1 bytes to (127.0.0.1,4804) on channel 5

truss shows:
sendmsg(5, 0x004CD120, 0)                       Err#22 EINVAL

So I manually changed 'ARCH_SCATTER_ACCRIGHTS 1' to 'ARCH_SCATTER_NONE 1' 
in config.h, in which case sendto() would be used instead of sendmsg() in 
data_link.c:

#ifndef ARCH_SCATTER_NONE
        ret = sendmsg(chan, &msg, 0);
#else   /* ARCH_SCATTER_NONE */
        ret = sendto(chan, pseudo_scat, total_len, 0,
                 (struct sockaddr *)&soc_addr, sizeof(soc_addr) );
#endif  /* ARCH_SCATTER_NONE */

That apparently solved the problem.


Andres.


P.S. Why does Spread daemon send messages to itself via port 4804 anyway?




More information about the Spread-users mailing list