AW: [Spread-users] Spread daemon seems to "forget" its name

Ryan Caudy rcaudy at gmail.com
Tue Nov 9 20:04:35 EST 2004


Hi,

Temp_buf is misused, somewhat, in G_compute_and_notify, when the
vs_sets are built there -- the size isn't checked properly.  This is a
problem that we know about, and need to fix.

I see how it's being misused where you found it, too. 
G_build_groups_bufs, the routine that builds the messages unpacked by
G_mess_to_groups, takes great care to keep the size under 100000. 
Unfortunately, it doesn't take into account the message_header
structure (which should be 48 bytes).  In an application like yours,
there certainly is the potential to trigger this bug.

Did you confirm that this bug was the cause of your problem using a
debugger?  Was My.name immediately following Temp_buf in memory?

I've attached a patch against the 3.17 branch of CVS which should fix
this bug, although I haven't tested it.  Please let me know if it
solves your problem.

Cheers,
Ryan

On Tue, 9 Nov 2004 12:09:28 +0100, Schroeder, Heiko, ADBM62
<heiko.schroeder at eads.com> wrote:
> Hi,
> 
> I think, I found the problem:
> Temp_buf (in sees_body.h) seems is too small in our
> case, it overflows in G_mess_to_groups. And the linker
> choose to place My after this buffer.
> 
> I don't fully understand the code yet, but shouldn't
> this buffer be able to hold at least MAX_MESSAGE_BODY_LEN
> bytes, which would be about 144k?
> 
> Anyway, increasing the buffer to this size solved
> our problem here.
> 
> 
> 
> CU
> 
>    Heiko
> 
> > -----Ursprüngliche Nachricht-----
> > Von: Ryan Caudy [mailto:rcaudy at gmail.com]
> > Gesendet am: Dienstag, 9. November 2004 03:44
> > An: Schroeder, Heiko, ADBM62
> > Cc: spread-users at lists.spread.org
> > Betreff: Re: [Spread-users] Spread daemon seems to "forget" its name
> >
> > Hi,
> >
> > What OS are you using?  What kind's of things are your clients doing?
> > This isn't something that has turned up in ordinary testing, although
> > I haven't put 3.17.3 through it's paces the way I have with a slightly
> > hacked 3.17.2, or a precursor to the current CVS head based on 3.17.2.
> >  In order to reproduce the problem, it would help to know any
> > descriptive information you can think of.
> >
> > Cheers,
> > Ryan
> >
> >
> > On Mon, 8 Nov 2004 16:51:56 +0100, Schroeder, Heiko, ADBM62
> > <heiko.schroeder at eads.com> wrote:
> > > Hi,
> > >
> > > we just came across a problem which (I think) hints to some
> > > memory management relating bug in Spread. This is
> > > with version 3.17.3.
> > >
> > > We have a system of 12 hosts that host several process each
> > > that communicate using Spread.  When switching one of the
> > > hosts off and on again, sometimes (in about 30-50% of all
> > > cases!), the whole system breaks down. At first, the crash
> > > was because of an "illegal private name to kill" message.
> > > I changed this message into a warning to see how the
> > > system would react and switched SESSION debugging
> > > on.
> > >
> > > The following output comes from one of the hosts that
> > > were not switched off (the others produce output that is
> > > very similar):
> > >
> > > [Mon 08 Nov 2004 13:57:18] Sess_read: queueing message of
> > type 4 with len 0
> > > to the protocol
> > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > field 0x80000084
> > > [Mon 08 Nov 2004 13:57:19] Sess_read: queueing message of
> > type 4 with len 0
> > > to the protocol
> > > Membership id is ( 176161537, 1099915040)
> > > [Mon 08 Nov 2004 13:57:19] --------------------
> > > [Mon 08 Nov 2004 13:57:19] Configuration at mfc2 is:
> > > [Mon 08 Nov 2004 13:57:19] Num Segments 1
> > > [Mon 08 Nov 2004 13:57:19]      12      10.128.255.255    4803
> > > [Mon 08 Nov 2004 13:57:19]              mfc1
> >     10.128.3.1
> > >
> > > [Mon 08 Nov 2004 13:57:19]              mfc2
> >     10.128.3.2
> > >
> > > [Mon 08 Nov 2004 13:57:19]              mfc3
> >     10.128.3.3
> > >
> > > [Mon 08 Nov 2004 13:57:19]              mfc5
> >     10.128.3.5
> > >
> > > [Mon 08 Nov 2004 13:57:19]              mfc6
> >     10.128.3.6
> > >
> > > [Mon 08 Nov 2004 13:57:19]              siu1
> >     10.128.2.1
> > >
> > > [Mon 08 Nov 2004 13:57:19]              siu2
> >     10.128.2.2
> > >
> > > [Mon 08 Nov 2004 13:57:19]              siu3
> >     10.128.2.3
> > >
> > > [Mon 08 Nov 2004 13:57:19]              siu5
> >     10.128.2.5
> > >
> > > [Mon 08 Nov 2004 13:57:19]              gpcu1
> >     10.128.1.1
> > >
> > > [Mon 08 Nov 2004 13:57:19]              gpcu2
> >     10.128.1.2
> > >
> > > [Mon 08 Nov 2004 13:57:19]              gpcu3
> >     10.128.1.3
> > >
> > > [Mon 08 Nov 2004 13:57:19] ====================
> > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > field 0x80000084
> > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > name mfc2 is not
> > > my name
> > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3636
> > ( mailbox 24 )
> > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > field 0x80000084
> > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > name mfc2 is not
> > > my name
> > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3669
> > ( mailbox 27 )
> > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > field 0x80000084
> > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > name mfc2 is not
> > > my name
> > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3637
> > ( mailbox 22 )
> > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > private name to kill
> > > #P3636#
> > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > private name to kill
> > > #P3669#
> > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > private name to kill
> > > #P1274#
> > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > private name to kill
> > > #P2135#
> > >
> > > Just before the new configuration message is output,
> > everyhting seems
> > > to be fine. But after this, the "My.name" is suddenly
> > empty. All of the 11
> > > "remaining" hosts showed the same problem, the one that "came back"
> > > did not (might be by chance, though).
> > >
> > > I'll try to investigate this further but I'd be very happy
> > if someone who
> > > really
> > > understands the code could help here... ;-)
> > >
> > > CU
> > >
> > >    Heiko
> > >
> > > --
> > > Heiko Schröder
> > > EADS Deutschland GmbH
> > > Defence and Communication Systems
> > > Naval Combat Systems (ADBM62)
> > > Bontekai 55
> > > 26382 Wilhelmshaven - Germany
> > > Tel: +49 44 21.15 43-230
> > > Fax: +49 44 21.15 43-111
> > > e-Fax: +49 731.392-20 91 11
> > > heiko.schroeder at eads.com
> > >
> > > www.eads.com
> > >
> > > _______________________________________________
> > > Spread-users mailing list
> > > Spread-users at lists.spread.org
> > > http://lists.spread.org/mailman/listinfo/spread-users
> > >
> >
> >
> > --
> > ---------------------------------------------------------------------
> > Ryan W. Caudy
> > <rcaudy at gmail.com>
> > ---------------------------------------------------------------------
> > Bloomberg L.P.
> > <rcaudy1 at bloomberg.net>
> > ---------------------------------------------------------------------
> > [Alumnus]
> > <caudy at cnds.jhu.edu>
> > Center for Networking and Distributed Systems
> > Department of Computer Science
> > Johns Hopkins University
> > ---------------------------------------------------------------------
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
> 


-- 
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>         
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University          
---------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: groups.c.patch
Type: application/octet-stream
Size: 778 bytes
Desc: not available
Url : http://lists.spread.org/pipermail/spread-users/attachments/20041109/76286f92/attachment.obj 


More information about the Spread-users mailing list