[Spread-users] Spread daemon seems to "forget" its name

Ryan Caudy rcaudy at gmail.com
Mon Nov 8 21:43:52 EST 2004


Hi,

What OS are you using?  What kind's of things are your clients doing? 
This isn't something that has turned up in ordinary testing, although
I haven't put 3.17.3 through it's paces the way I have with a slightly
hacked 3.17.2, or a precursor to the current CVS head based on 3.17.2.
 In order to reproduce the problem, it would help to know any
descriptive information you can think of.

Cheers,
Ryan


On Mon, 8 Nov 2004 16:51:56 +0100, Schroeder, Heiko, ADBM62
<heiko.schroeder at eads.com> wrote:
> Hi,
> 
> we just came across a problem which (I think) hints to some
> memory management relating bug in Spread. This is
> with version 3.17.3.
> 
> We have a system of 12 hosts that host several process each
> that communicate using Spread.  When switching one of the
> hosts off and on again, sometimes (in about 30-50% of all
> cases!), the whole system breaks down. At first, the crash
> was because of an "illegal private name to kill" message.
> I changed this message into a warning to see how the
> system would react and switched SESSION debugging
> on.
> 
> The following output comes from one of the hosts that
> were not switched off (the others produce output that is
> very similar):
> 
> [Mon 08 Nov 2004 13:57:18] Sess_read: queueing message of type 4 with len 0
> to the protocol
> [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type field 0x80000084
> [Mon 08 Nov 2004 13:57:19] Sess_read: queueing message of type 4 with len 0
> to the protocol
> Membership id is ( 176161537, 1099915040)
> [Mon 08 Nov 2004 13:57:19] --------------------
> [Mon 08 Nov 2004 13:57:19] Configuration at mfc2 is:
> [Mon 08 Nov 2004 13:57:19] Num Segments 1
> [Mon 08 Nov 2004 13:57:19]      12      10.128.255.255    4803
> [Mon 08 Nov 2004 13:57:19]              mfc1                    10.128.3.1
> 
> [Mon 08 Nov 2004 13:57:19]              mfc2                    10.128.3.2
> 
> [Mon 08 Nov 2004 13:57:19]              mfc3                    10.128.3.3
> 
> [Mon 08 Nov 2004 13:57:19]              mfc5                    10.128.3.5
> 
> [Mon 08 Nov 2004 13:57:19]              mfc6                    10.128.3.6
> 
> [Mon 08 Nov 2004 13:57:19]              siu1                    10.128.2.1
> 
> [Mon 08 Nov 2004 13:57:19]              siu2                    10.128.2.2
> 
> [Mon 08 Nov 2004 13:57:19]              siu3                    10.128.2.3
> 
> [Mon 08 Nov 2004 13:57:19]              siu5                    10.128.2.5
> 
> [Mon 08 Nov 2004 13:57:19]              gpcu1                   10.128.1.1
> 
> [Mon 08 Nov 2004 13:57:19]              gpcu2                   10.128.1.2
> 
> [Mon 08 Nov 2004 13:57:19]              gpcu3                   10.128.1.3
> 
> [Mon 08 Nov 2004 13:57:19] ====================
> [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type field 0x80000084
> [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc name mfc2 is not
> my name
> [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3636 ( mailbox 24 )
> [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type field 0x80000084
> [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc name mfc2 is not
> my name
> [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3669 ( mailbox 27 )
> [Mon 08 Nov 2004 13:57:19] Sess_read: Message has type field 0x80000084
> [Mon 08 Nov 2004 13:57:19] Sess_validate_read_header: proc name mfc2 is not
> my name
> [Mon 08 Nov 2004 13:57:19] Sess_kill: killing session P3637 ( mailbox 22 )
> [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal private name to kill
> #P3636#
> [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal private name to kill
> #P3669#
> [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal private name to kill
> #P1274#
> [Mon 08 Nov 2004 13:57:19] Sess_handle_kill: Illegal private name to kill
> #P2135#
> 
> Just before the new configuration message is output, everyhting seems
> to be fine. But after this, the "My.name" is suddenly empty. All of the 11
> "remaining" hosts showed the same problem, the one that "came back"
> did not (might be by chance, though).
> 
> I'll try to investigate this further but I'd be very happy if someone who
> really
> understands the code could help here... ;-)
> 
> CU
> 
>    Heiko
> 
> --
> Heiko Schröder
> EADS Deutschland GmbH
> Defence and Communication Systems
> Naval Combat Systems (ADBM62)
> Bontekai 55
> 26382 Wilhelmshaven - Germany
> Tel: +49 44 21.15 43-230
> Fax: +49 44 21.15 43-111
> e-Fax: +49 731.392-20 91 11
> heiko.schroeder at eads.com
> 
> www.eads.com
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 


-- 
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>         
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University          
---------------------------------------------------------------------




More information about the Spread-users mailing list