[Spread-users] RE: Spread on Linux question

Tue Jul 12 13:52:58 EDT 2005

Hi Theo,

	I was finally able to recreate the issue.  It seems if one starts a
spread daemon, connects to groups and then brings up a separate daemon that
is when the issue occurs.  I tried the patch that you suggested, and it
seems to have helped... a little bit at least :(  I'm still getting illegal
access in some other places now.  It seems that skiplist.c (at least for the
most part) is not properly cleaning up a bunch of memory.  Are the
developers aware of this?  


[Tue 12 Jul 2005 13:17:18] Send_join: State is 4
[Tue 12 Jul 2005 13:17:18] Memb_handle_message: handling join message from
-1806277333, State is 4
[Tue 12 Jul 2005 13:17:19] Sess_read: Message has type field 0x80040080
[Tue 12 Jul 2005 13:17:19] Sess_read: disconnecting session r8312-13 (
mailbox 13 )
[Tue 12 Jul 2005 13:17:19] Sess_read: queueing message of type 8 with len 0
to the protocol
[Tue 12 Jul 2005 13:17:19] Send_join: State is 4
[Tue 12 Jul 2005 13:17:19] Memb_handle_message: handling join message from
-1806277333, State is 4
[Tue 12 Jul 2005 13:17:19] Memb_handle_message: handling join message from
-1710970211, Stat==24738==
==24738== Invalid read of size 4
==24738==    at 0x805E43E: sl_next (skiplist.c:199)
==24738==    by 0x80507D2: G_handle_kill (groups.c:1273)
==24738==    by 0x804E7BE: Sess_handle_kill (session.c:1957)
==24738==    by 0x804E67E: Sess_deliver_message (session.c:1855)
==24738==    by 0x804B396: Discard_packets (protocol.c:1183)
==24738==    by 0x804A6DB: Prot_handle_token (protocol.c:639)
==24738==    by 0x80537AC: E_handle_events (events.c:673)
==24738==    by 0x80497B0: main (spread.c:193)
==24738==  Address 0x1C953BB4 is 4 bytes inside a block of size 32 free'd
==24738==    at 0x1B903A5D: free (vg_replace_malloc.c:152)
==24738==    by 0x805EA1C: sli_remove (skiplist.c:506)
==24738==    by 0x805EB27: sl_remove_compare (skiplist.c:536)
==24738==    by 0x805E922: sl_remove (skiplist.c:475)
==24738==    by 0x8050ABD: G_handle_kill (groups.c:1294)
==24738==    by 0x804E7BE: Sess_handle_kill (session.c:1957)
==24738==    by 0x804E67E: Sess_deliver_message (session.c:1855)
==24738==    by 0x804B396: Discard_packets (protocol.c:1183)
==24738==    by 0x804A6DB: Prot_handle_token (protocol.c:639)
==24738==    by 0x80537AC: E_handle_events (events.c:673)
==24738==    by 0x80497B0: main (spread.c:193)
e is 4
[Tue 12 Jul 2005 13:17:20] Memb_handle_message: handling join message from
-1710970211, State is 4
[Tue 12 Jul 2005 13:17:20] Send_join: State is 4


I also see an additional message from valgrind (though not sure if this is
an error):


[Mon 11 Jul 20==24738== Syscall param socketcall.sendmsg(msg.msg_iov[i]
points to uninitialised byte(s)
==24738==    at 0x1B9FFEFC: sendmsg (in /lib/tls/libc-2.3.2.so)
==24738==    by 0x8059747: Net_ucast_token (network.c:673)
==24738==    by 0x805645D: Create_form1 (membership.c:1401)
==24738==    by 0x8055790: Form_or_fail (membership.c:905)
==24738==    by 0x80533F0: E_handle_events (events.c:605)
==24738==    by 0x80497B0: main (spread.c:193)
==24738==  Address 0x52BFEA50 is on thread 1's stack
==24738==
==24738== Syscall param socketcall.sendmsg(msg.msg_iov[i] points to
uninitialised byte(s)
==24738==    at 0x1B9FFEFC: sendmsg (in /lib/tls/libc-2.3.2.so)
==24738==    by 0x8059747: Net_ucast_token (network.c:673)
==24738==    by 0x80563D3: Create_form1 (membership.c:1397)
==24738==    by 0x8055790: Form_or_fail (membership.c:905)
==24738==    by 0x80533F0: E_handle_events (events.c:605)
==24738==    by 0x80497B0: main (spread.c:193)
==24738==  Address 0x52BFEA50 is on thread 1's stack
05 18:38:53] Set Alarm mask to: 386
[Mon 11 Jul 2005 18:38:53] ENABLING Dangerous Monitor Commands! Make sure
Spread network is secured
[Mon 11 Jul 2005 18:38:53] Finished configuration file.

This seems to occur on startup for the first daemon in the group.
Unfortunately the memory errors seem to cause either random hangs or
crashes.  Has this code been revamped for version 4?


Any help is appreciated.


Thanks again,
Mayer


-----Original Message-----
From: Theo Schlossnagle [mailto:jesus at omniti.com] 
Sent: Wednesday, July 06, 2005 9:29 PM
To: Crystal, Mayer
Cc: Theo Schlossnagle; 'spread-users at lists.spread.org'
Subject: Re: [Spread-users] RE: Spread on Linux question


On Jul 6, 2005, at 8:33 PM, Crystal, Mayer wrote:

> OK, it took a little while to generate (still not sure what is the 
> root cause yet), but I ran my setup under valgrind and received the 
> following errors in middle of the execution (the daemons are still 
> running, but I have a feeling that this is not the desired behavior).  
> Sorry for the long post, but I hope more information will be helpful.  
> Has anyone seen anything like this?  Is this intended, known and/or is 
> there a patch if this is not intended?

clearly not intended.  I've learned not to argue with valgrind -- you always
loose.

It appears that the members lists are used even after they are destroyed.
this might be fixed by carefully reseting the skiplist structures to a
post-init state after calling sl_destruct.  The second error is sl_destruct
not cleaning up it's endnodes.

can you repeat the error by using spmonitor to fake a partition? and then
undo that partition?  It seems that you should.

so, without trying to repeat your error, give this a whirl:

--- skiplist.c.old      2004-04-16 12:50:34.000000000 -0400
+++ skiplist.c  2005-07-06 21:25:20.000000000 -0400
@@ -552,6 +552,7 @@
      m = p;
    }
    sl->top = sl->bottom = NULL;
+  sl->topend = sl->bottomend = NULL;
    sl->height = 0;
    sl->size = 0;
}
@@ -563,6 +564,7 @@
    if(sl->index) {
      sl_remove_all(sl->index, (FreeFunc)sli_destruct_free);
      free(sl->index);
+    sl->index = NULL;
    }
    sl_remove_all(sl, myfree);
}