AW: AW: [Spread-users] Spread daemon seems to "forget" its name

Schroeder, Heiko, ADBM62 heiko.schroeder at eads.com
Wed Nov 10 03:19:53 EST 2004


Hi,

yes, I checked using objdump that My (and thus name as
its first element) is indeed immediately following 
the Temp_buf. To confirm this, I added a check in 
G_mess_to_groups that showed that there was an overflow. 
So this was definitely the cause of the problem.

CU

  Heiko

> -----Ursprüngliche Nachricht-----
> Von: Ryan Caudy [mailto:rcaudy at gmail.com]
> Gesendet am: Mittwoch, 10. November 2004 02:05
> An: Schroeder, Heiko, ADBM62
> Cc: spread-users at lists.spread.org
> Betreff: Re: AW: [Spread-users] Spread daemon seems to 
> "forget" its name
> 
> Hi,
> 
> Temp_buf is misused, somewhat, in G_compute_and_notify, when the
> vs_sets are built there -- the size isn't checked properly.  This is a
> problem that we know about, and need to fix.
> 
> I see how it's being misused where you found it, too. 
> G_build_groups_bufs, the routine that builds the messages unpacked by
> G_mess_to_groups, takes great care to keep the size under 100000. 
> Unfortunately, it doesn't take into account the message_header
> structure (which should be 48 bytes).  In an application like yours,
> there certainly is the potential to trigger this bug.
> 
> Did you confirm that this bug was the cause of your problem using a
> debugger?  Was My.name immediately following Temp_buf in memory?
> 
> I've attached a patch against the 3.17 branch of CVS which should fix
> this bug, although I haven't tested it.  Please let me know if it
> solves your problem.
> 
> Cheers,
> Ryan
> 
> On Tue, 9 Nov 2004 12:09:28 +0100, Schroeder, Heiko, ADBM62
> <heiko.schroeder at eads.com> wrote:
> > Hi,
> > 
> > I think, I found the problem:
> > Temp_buf (in sees_body.h) seems is too small in our
> > case, it overflows in G_mess_to_groups. And the linker
> > choose to place My after this buffer.
> > 
> > I don't fully understand the code yet, but shouldn't
> > this buffer be able to hold at least MAX_MESSAGE_BODY_LEN
> > bytes, which would be about 144k?
> > 
> > Anyway, increasing the buffer to this size solved
> > our problem here.
> > 
> > 
> > 
> > CU
> > 
> >    Heiko
> > 
> > > -----Ursprüngliche Nachricht-----
> > > Von: Ryan Caudy [mailto:rcaudy at gmail.com]
> > > Gesendet am: Dienstag, 9. November 2004 03:44
> > > An: Schroeder, Heiko, ADBM62
> > > Cc: spread-users at lists.spread.org
> > > Betreff: Re: [Spread-users] Spread daemon seems to 
> "forget" its name
> > >
> > > Hi,
> > >
> > > What OS are you using?  What kind's of things are your 
> clients doing?
> > > This isn't something that has turned up in ordinary 
> testing, although
> > > I haven't put 3.17.3 through it's paces the way I have 
> with a slightly
> > > hacked 3.17.2, or a precursor to the current CVS head 
> based on 3.17.2.
> > >  In order to reproduce the problem, it would help to know any
> > > descriptive information you can think of.
> > >
> > > Cheers,
> > > Ryan
> > >
> > >
> > > On Mon, 8 Nov 2004 16:51:56 +0100, Schroeder, Heiko, ADBM62
> > > <heiko.schroeder at eads.com> wrote:
> > > > Hi,
> > > >
> > > > we just came across a problem which (I think) hints to some
> > > > memory management relating bug in Spread. This is
> > > > with version 3.17.3.
> > > >
> > > > We have a system of 12 hosts that host several process each
> > > > that communicate using Spread.  When switching one of the
> > > > hosts off and on again, sometimes (in about 30-50% of all
> > > > cases!), the whole system breaks down. At first, the crash
> > > > was because of an "illegal private name to kill" message.
> > > > I changed this message into a warning to see how the
> > > > system would react and switched SESSION debugging
> > > > on.
> > > >
> > > > The following output comes from one of the hosts that
> > > > were not switched off (the others produce output that is
> > > > very similar):
> > > >
> > > > [Mon 08 Nov 2004 13:57:18] Sess_read: queueing message of
> > > type 4 with len 0
> > > > to the protocol
> > > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > > field 0x80000084
> > > > [Mon 08 Nov 2004 13:57:19] Sess_read: queueing message of
> > > type 4 with len 0
> > > > to the protocol
> > > > Membership id is ( 176161537, 1099915040)
> > > > [Mon 08 Nov 2004 13:57:19] --------------------
> > > > [Mon 08 Nov 2004 13:57:19] Configuration at mfc2 is:
> > > > [Mon 08 Nov 2004 13:57:19] Num Segments 1
> > > > [Mon 08 Nov 2004 13:57:19]      12      10.128.255.255    4803
> > > > [Mon 08 Nov 2004 13:57:19]              mfc1
> > >     10.128.3.1
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              mfc2
> > >     10.128.3.2
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              mfc3
> > >     10.128.3.3
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              mfc5
> > >     10.128.3.5
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              mfc6
> > >     10.128.3.6
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              siu1
> > >     10.128.2.1
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              siu2
> > >     10.128.2.2
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              siu3
> > >     10.128.2.3
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              siu5
> > >     10.128.2.5
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              gpcu1
> > >     10.128.1.1
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              gpcu2
> > >     10.128.1.2
> > > >
> > > > [Mon 08 Nov 2004 13:57:19]              gpcu3
> > >     10.128.1.3
> > > >
> > > > [Mon 08 Nov 2004 13:57:19] ====================
> > > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > > field 0x80000084
> > > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > > name mfc2 is not
> > > > my name
> > > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3636
> > > ( mailbox 24 )
> > > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > > field 0x80000084
> > > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > > name mfc2 is not
> > > > my name
> > > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3669
> > > ( mailbox 27 )
> > > > [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type
> > > field 0x80000084
> > > > [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc
> > > name mfc2 is not
> > > > my name
> > > > [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3637
> > > ( mailbox 22 )
> > > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > > private name to kill
> > > > #P3636#
> > > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > > private name to kill
> > > > #P3669#
> > > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > > private name to kill
> > > > #P1274#
> > > > [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal
> > > private name to kill
> > > > #P2135#
> > > >
> > > > Just before the new configuration message is output,
> > > everyhting seems
> > > > to be fine. But after this, the "My.name" is suddenly
> > > empty. All of the 11
> > > > "remaining" hosts showed the same problem, the one that 
> "came back"
> > > > did not (might be by chance, though).
> > > >
> > > > I'll try to investigate this further but I'd be very happy
> > > if someone who
> > > > really
> > > > understands the code could help here... ;-)
> > > >
> > > > CU
> > > >
> > > >    Heiko
> > > >
> > > > --
> > > > Heiko Schröder
> > > > EADS Deutschland GmbH
> > > > Defence and Communication Systems
> > > > Naval Combat Systems (ADBM62)
> > > > Bontekai 55
> > > > 26382 Wilhelmshaven - Germany
> > > > Tel: +49 44 21.15 43-230
> > > > Fax: +49 44 21.15 43-111
> > > > e-Fax: +49 731.392-20 91 11
> > > > heiko.schroeder at eads.com
> > > >
> > > > www.eads.com
> > > >
> > > > _______________________________________________
> > > > Spread-users mailing list
> > > > Spread-users at lists.spread.org
> > > > http://lists.spread.org/mailman/listinfo/spread-users
> > > >
> > >
> > >
> > > --
> > > 
> ---------------------------------------------------------------------
> > > Ryan W. Caudy
> > > <rcaudy at gmail.com>
> > > 
> ---------------------------------------------------------------------
> > > Bloomberg L.P.
> > > <rcaudy1 at bloomberg.net>
> > > 
> ---------------------------------------------------------------------
> > > [Alumnus]
> > > <caudy at cnds.jhu.edu>
> > > Center for Networking and Distributed Systems
> > > Department of Computer Science
> > > Johns Hopkins University
> > > 
> ---------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > Spread-users mailing list
> > > Spread-users at lists.spread.org
> > > http://lists.spread.org/mailman/listinfo/spread-users
> > >
> > 
> 
> 
> -- 
> ---------------------------------------------------------------------
> Ryan W. Caudy
> <rcaudy at gmail.com>
> ---------------------------------------------------------------------
> Bloomberg L.P.
> <rcaudy1 at bloomberg.net>
> ---------------------------------------------------------------------
> [Alumnus]
> <caudy at cnds.jhu.edu>         
> Center for Networking and Distributed Systems
> Department of Computer Science
> Johns Hopkins University          
> ---------------------------------------------------------------------
> 




More information about the Spread-users mailing list