[Spread-users] Spread crashes and instabilities

Jared Go jared at hobnob.com
Mon Sep 19 00:58:22 EDT 2005


Hello everyone,

I'm currently working on a project that uses Spread for communication 
amongst 15-20 machines and have been running into some instabilities.  
Our systems are all running OpenBSD 3.7 with the OpenBSD spread package, 
which is version 3.17.03.

Our configuration places one machine in each segment, which is 
undesirable but currently necessary given the topology of the underlying 
network.

The main problem that we're having is that occasionally a problem occurs 
which causes all spread daemons to crash.  We've observed this happening 
as often as once every two or three days and as infrequently as once 
over the course of several weeks.  Heavier usage seems to exacerbate the 
problem, but we haven't had any conclusive evidence of this.

In the spread logs, we note two failure modes, which happen across all 
machines on the network.  The first is:

G_analize_groups: Gstate is 3
Exit caused by Alarm(EXIT)

and the second is:

Net_ucast_token: Token too long for packet!
Exit caused by Alarm(EXIT)

In the second scenario, I've traced the error back up and found that we 
seem to have an excess of holes.  The token dump right before the 
Net_ucast_token error shows:

Pending Members:

Form Token reps list -- Count (10) index (1)
       0: 192.168.197.20 (T 1 SegInd 2)        1: 192.168.197.15 (T 1 
SegInd 4)        2: 192.168.197.34 (T 1 SegInd 6)
       3: 192.168.197.246 (T 1 SegInd 7)       4: 192.168.197.248 (T 1 
SegInd 8)       5: 192.168.197.14 (T 1 SegInd 9)
       6: 192.168.197.11 (T 1 SegInd 10)       7: 192.168.197.29 (T 1 
SegInd 11)       8: 192.168.197.19 (T 1 SegInd 12)
       9: 192.168.197.250 (T 1 SegInd 13)
Form Token RING list -- Count (1)
Ring 0: MembID 192.168.197.21 - 1126633340      TransTime 0
       ARU: 9  HighSeq: 1540   NumHoles: 1524
       NumCommit: 1    NumTrans: 1
       Message Holes:  10      11      12      13      14      15      
16      17      18      19      20      21      22      23     ... (all 
numbers omitted for brevity) ... 1522    1523    1524    1525    1526    
1527    1528    1529    1530    1531    1532    1533
       Trans List:     0: 192.168.197.20
       Commit List:
====================================================
Net_ucast_token: Token too long for packet!

It looks as if we got the top few messages which were for some reason at 
a much higher sequence number than what we saw previously, which causes 
an overflow of holes that are too large for the packet to send.   I've 
noticed on one machine that the log showed two rings in the dump:

Form Token Membership ID 0.2.0.2 : -1062681315
Form Token RING list -- Count (2)
Ring 0: MembID 192.168.197.21 - 1126632477      TransTime 0
       ARU: 1531       HighSeq: 1533   NumHoles: 0
       NumCommit: 1    NumTrans: 1
-- snipped --
Ring 1: MembID 192.168.197.21 - 1126633340      TransTime 0
       ARU: 9  HighSeq: 1540   NumHoles: 1524
       NumCommit: 1    NumTrans: 1

Here, it looks like the last seven messages were supposed to go to ring 
0 but were instead sent to ring 1, which caused the large number of 
holes to appear, and thus the crash.  Should this machine even appear on 
two rings given that there is only one machine in each segment?

Does anyone have any insights/ideas into why either of these problems 
are occuring or how we can solve them?  Any help would be greatly 
appreciated.

Thanks!
-Jared Go
jared at hobnob.com




More information about the Spread-users mailing list