[Spread-users] Problem with "name not unique" on Solaris 10
Andres Heinloo
andres at gfz-potsdam.de
Wed Jul 29 14:30:14 EDT 2009
On Tue, 21 Jul 2009, Andres Heinloo wrote:
>
>
> On Fri, 17 Jul 2009, Daniel F. Savarese wrote:
>
> >
> > In message <Pine.LNX.4.64.0907171604170.5907 at st55.gfz-potsdam.de>, Andres Heinl
> > oo writes:
> > >User>j PICK
> > >
> > >User>l PICK
> > >
> > >User>q
> >
> > If that's a complete log, then you're missing the receipt of two
> > group membership messages. Something like the following after the
> > join:
> >
> > ============================
> > Received REGULAR membership for group PICK with 1 members, where I am member 0:
> > #tttt#localhost
> > grp id is -1062731517 1247454353 1
> > Due to the JOIN of #tttt#localhost
> >
> > And something like the following after the leave:
> >
> > ============================
> > received membership message that left group PICK
> >
> > That may be why the connection name isn't being freed up immediately
> > (there are pending messages, so the Spread daemon may wait before
> > killing the connection's queue of messages). Try waiting for the
> > membership messages to arrive before issuing the quit command and see
> > how long that takes. That may help the Spread team tell you what's
> > going on. I wasn't able to reproduce your problem, but I'm running
> > a patched version of Spread with some custom modifications.
>
> Indeed the membership messages are missing. It takes several minutes until
> membership message arrives. However, once the message arrives, Spread
> starts to work normally.
>
> I think it would not make sense to keep message queue after TCP
> disconnect, specially if the client cannot reconnect anyway (name not
> unique).
>
> Looks like the Spread daemon is somewhere blocked at start, however, it
> still does accept and handle TCP connections.
>
> As I said, the problem does *not* occur on Linux. On Solaris 10, it seems
> that Spread works better when compiled with SunStudio rather than gcc (I
> never got membership message when compiled with gcc, but maybe I did not
> wait long enough).
>
> Any help would be very much appreciated.
Looks like I solved the problem.
Running Spread with debug enabled, I noticed repeating messages like this:
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: delaying after failure in send to 127.0.0.1, ret is -1
[Wed 29 Jul 2009 19:52:19] DL_send: element[0]: 32 bytes
[Wed 29 Jul 2009 19:52:19] DL_send: error: Invalid argument
sending 32 bytes on channel 5 to address 127.0.0.1
[Wed 29 Jul 2009 19:52:19] DL_send: sent a message of -1 bytes to (127.0.0.1,4804) on channel 5
truss shows:
sendmsg(5, 0x004CD120, 0) Err#22 EINVAL
So I manually changed 'ARCH_SCATTER_ACCRIGHTS 1' to 'ARCH_SCATTER_NONE 1'
in config.h, in which case sendto() would be used instead of sendmsg() in
data_link.c:
#ifndef ARCH_SCATTER_NONE
ret = sendmsg(chan, &msg, 0);
#else /* ARCH_SCATTER_NONE */
ret = sendto(chan, pseudo_scat, total_len, 0,
(struct sockaddr *)&soc_addr, sizeof(soc_addr) );
#endif /* ARCH_SCATTER_NONE */
That apparently solved the problem.
Andres.
P.S. Why does Spread daemon send messages to itself via port 4804 anyway?
More information about the Spread-users
mailing list