[Spread-users] regular crashes in spread 4.0.0rc2

matthew.garman at gmail.com matthew.garman at gmail.com
Thu Nov 1 10:49:41 EDT 2007


Hi,

We've seen an increasing number of crashes with our spread daemons
recently.  We're still running version 4.0.0rc2.  We haven't made
any configuration changes in quite a while.

I can't seem to pinpoint exactly what causes the crashes.

The following possible scenarios *sometimes* appear to cause a
spread daemon crash:

    - High machine (CPU) load
    - Heavy or bursty spread traffic
    - Heavy/bursty non-spread network traffic
    - Taking down the network interface of a machine in the spread
      segment (or rebooting a machine)

Other times the crashes just seem arbitrary (e.g. low CPU load, very
little spread or other net traffic, etc).

First question: can reducing the timeouts too much (in membership.c,
as suggested in section 2.4.1 of the Spread Users's Guide) increase
the likelihood of actual spread crashes?

Next question: I've looked at the core files generated by the
crashes in gdb.  Usually they have no useful information.  See below
for an example of the backtrace.

Other times, I get what looks like a more useful core dump, but even
then the code is still too cryptic for me to get an idea why it
might be crashing.  See below.

Finally: any comments or suggestions are welcome.  Unfortunately I
cannot replicate these crashes with a version compiled with
debugging symbols in a test environment; they only seem to occur in
production.  So any thoughts or ideas anyone has will be useful.

Thank you,
Matt


Example 1: not too useful core backtrace:

$ gdb -c core.19596  /usr/local/sbin/spread
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `/usr/local/sbin/spread -c /usr/local/etc/spread.conf'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib64/tls/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/tls/libc.so.6... (no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
#0  0x000000000040772f in ?? ()
(gdb) bt
#0  0x000000000040772f in ?? ()
#1  0x000000000041c0fa in ?? ()
#2  0x000000000041c9c0 in ?? ()
#3  0x0000000000408a27 in ?? ()
#4  0x000000000040b2be in ?? ()
#5  0x0000000000406988 in ?? ()
#6  0x0000000000403132 in ?? ()
#7  0x000000000040425a in ?? ()
#8  0x000000000040d118 in ?? ()
#9  0x000000000040233d in ?? ()
#10 0x00000039b8e1c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#11 0x0000000000401d0a in ?? ()
#12 0x0000007fbffffcb8 in ?? ()
#13 0x000000000000001c in ?? ()
#14 0x0000000000000003 in ?? ()
#15 0x0000007fbffffe6f in ?? ()
#16 0x0000007fbffffe86 in ?? ()
#17 0x0000007fbffffe89 in ?? ()
#18 0x0000000000000000 in ?? ()


Example 2: more useful backtrace:

$ gdb -c core.7081 /usr/local/sbin/spread
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `/usr/local/sbin/spread -c /usr/local/etc/spread.conf'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib64/tls/libm.so.6...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
#0  0x000000000040772f in G_compare_proc_ids_by_conf (a=0x6cbec0, b=0x6cc0d0)
    at groups.c:189
189       int  ia = Conf_proc_by_id( **(const int32**) a, &dummy_proc );
(gdb) bt
#0  0x000000000040772f in G_compare_proc_ids_by_conf (a=0x6cbec0, b=0x6cc0d0)
    at groups.c:189
#1  0x000000000041c0fa in stdskl_low_insert ()
#2  0x000000000041c9c0 in stdskl_put_seq_n ()
#3  0x0000000000408a27 in G_send_heavyweight_join (grp=0x6cb9b0, joiner=0x0, 
    new_mbox=-1) at groups.c:1728
#4  0x000000000040b2be in G_handle_reg_memb (reg_memb=
            {hash_code = 48934469, num_segments = 1, num_total_procs = 4, segments = {{bcast_address = -1062683649, port = 4813, num_procs = 3, procs = {0x67d190, 0x67d204, 0x67d2ec, 0x67d2ec, 0x0 <repeats 124 times>}}, {bcast_address = 0, port = 0, num_procs = 0, procs = {0x0 <repeats 128 times>}} <repeats 19 times>}}, reg_memb_id=Variable "reg_memb_id" is not available.
) at groups.c:366
#5  0x0000000000406988 in Sess_deliver_reg_memb (reg_memb=
            {hash_code = 48934469, num_segments = 1, num_total_procs = 4, segments = {{bcast_address = -1062683649, port = 4813, num_procs = 3, procs = {0x67d190, 0x67d204, 0x67d2ec, 0x67d2ec, 0x0 <repeats 124 times>}}, {bcast_address = 0, port = 0, num_procs = 0, procs = {0x0 <repeats 128 times>}} <repeats 19 times>}}, reg_memb_id={proc_id = -1062683843, time = 1193866040}) at session.c:1914
#6  0x0000000000403132 in Discard_packets () at protocol.c:1184
#7  0x000000000040425a in Prot_handle_token (fd=Variable "fd" is not available.
) at protocol.c:646
#8  0x000000000040d118 in E_handle_events () at events.c:680
#9  0x000000000040233d in main (argc=Variable "argc" is not available.
) at spread.c:198






More information about the Spread-users mailing list