[Spread-users] Spread crashes and instabilities
Jared Go
jared at hobnob.com
Mon Sep 19 00:58:22 EDT 2005
Hello everyone,
I'm currently working on a project that uses Spread for communication
amongst 15-20 machines and have been running into some instabilities.
Our systems are all running OpenBSD 3.7 with the OpenBSD spread package,
which is version 3.17.03.
Our configuration places one machine in each segment, which is
undesirable but currently necessary given the topology of the underlying
network.
The main problem that we're having is that occasionally a problem occurs
which causes all spread daemons to crash. We've observed this happening
as often as once every two or three days and as infrequently as once
over the course of several weeks. Heavier usage seems to exacerbate the
problem, but we haven't had any conclusive evidence of this.
In the spread logs, we note two failure modes, which happen across all
machines on the network. The first is:
G_analize_groups: Gstate is 3
Exit caused by Alarm(EXIT)
and the second is:
Net_ucast_token: Token too long for packet!
Exit caused by Alarm(EXIT)
In the second scenario, I've traced the error back up and found that we
seem to have an excess of holes. The token dump right before the
Net_ucast_token error shows:
Pending Members:
Form Token reps list -- Count (10) index (1)
0: 192.168.197.20 (T 1 SegInd 2) 1: 192.168.197.15 (T 1
SegInd 4) 2: 192.168.197.34 (T 1 SegInd 6)
3: 192.168.197.246 (T 1 SegInd 7) 4: 192.168.197.248 (T 1
SegInd 8) 5: 192.168.197.14 (T 1 SegInd 9)
6: 192.168.197.11 (T 1 SegInd 10) 7: 192.168.197.29 (T 1
SegInd 11) 8: 192.168.197.19 (T 1 SegInd 12)
9: 192.168.197.250 (T 1 SegInd 13)
Form Token RING list -- Count (1)
Ring 0: MembID 192.168.197.21 - 1126633340 TransTime 0
ARU: 9 HighSeq: 1540 NumHoles: 1524
NumCommit: 1 NumTrans: 1
Message Holes: 10 11 12 13 14 15
16 17 18 19 20 21 22 23 ... (all
numbers omitted for brevity) ... 1522 1523 1524 1525 1526
1527 1528 1529 1530 1531 1532 1533
Trans List: 0: 192.168.197.20
Commit List:
====================================================
Net_ucast_token: Token too long for packet!
It looks as if we got the top few messages which were for some reason at
a much higher sequence number than what we saw previously, which causes
an overflow of holes that are too large for the packet to send. I've
noticed on one machine that the log showed two rings in the dump:
Form Token Membership ID 0.2.0.2 : -1062681315
Form Token RING list -- Count (2)
Ring 0: MembID 192.168.197.21 - 1126632477 TransTime 0
ARU: 1531 HighSeq: 1533 NumHoles: 0
NumCommit: 1 NumTrans: 1
-- snipped --
Ring 1: MembID 192.168.197.21 - 1126633340 TransTime 0
ARU: 9 HighSeq: 1540 NumHoles: 1524
NumCommit: 1 NumTrans: 1
Here, it looks like the last seven messages were supposed to go to ring
0 but were instead sent to ring 1, which caused the large number of
holes to appear, and thus the crash. Should this machine even appear on
two rings given that there is only one machine in each segment?
Does anyone have any insights/ideas into why either of these problems
are occuring or how we can solve them? Any help would be greatly
appreciated.
Thanks!
-Jared Go
jared at hobnob.com
More information about the Spread-users
mailing list