[Spread-users] Delayed membership establishment issue

Scott Barvick sbarvick at revasystems.com
Tue May 16 18:09:35 EDT 2006


I've been looking into an issue I'm seeing, and I think it is time to
ask for help.  

I have Spread 3.17.3 running on 3 or 4 hosts with a small number of
groups and a fair bit of traffic among them.   The hosts are multihomed
and have configurations that look like:

Spread_Segment 239.255.3.62:4803 {
     tap2     172.16.3.65 {
     172.17.188.2
     172.16.3.65
     C 127.0.0.1 }
     tap3     172.16.3.66 {
     172.17.188.3
     172.16.3.66
     C 127.0.0.1 }
     tap4     172.16.3.62 {
     172.17.188.4
     172.16.3.62
     C 127.0.0.1 }
}

Steady state communication among the three of them works very well, but
we have noticed that with 3 nodes, a restart of the third node can take
a long time for the clients to see membership messages and even (I
think) get connected to the local daemon.  I've been looking at a bunch
of debug output and have found the following difference between a 'good'
run (quick connect) and a 'delayed' one where the connection can be
delayed for minutes sometimes but then it kicks in and all is fine:

*Good*

================
... banner ...
| Version 3.17.03 Built 15/October/2004
|
\===========================================================================/
Conf_init: using file: /var/reva/etc/spread.conf
Successfully configured Segment 0 [239.255.3.62:4803] with 4 procs:
                        tap1: 172.16.3.64
                        tap2: 172.16.3.65
                        tap3: 172.16.3.66
                        tap4: 172.16.3.62
ENABLING Dangerous Monitor Commands! Make sure Spread network is secured
Setting SO_REUSEADDR to always on -- make sure Spread daemon host is
secured!
E_init: went ok
DL_init_channel: bind for recv_channel for port 4803 with chan 6 ok
DL_init_channel: Joining multicast address 239.255.3.62 went ok
DL_init_channel: went ok on channel 6
DL_init_channel: bind for recv_channel for port 4803 with chan 8 ok
DL_init_channel: Joining multicast address 239.255.3.62 went ok
Moving process spread (pid:20378) to Fifo scheduling with priority:4
DL_init_channel: went ok on channel 8
DL_init_channel: bind for recv_channel for port 4804 with chan 9 ok
DL_init_channel: setsockopt for recv and broadcast went ok
DL_init_channel: went ok on channel 9
DL_init_channel: bind for recv_channel for port 4803 with chan 10 ok
DL_init_channel: Joining multicast address 239.255.3.62 went ok
DL_init_channel: went ok on channel 10
DL_init_channel: bind for recv_channel for port 4804 with chan 11 ok
DL_init_channel: setsockopt for recv and broadcast went ok
DL_init_channel: went ok on channel 11
DL_init_channel: setsockopt for send and broadcast went ok
DL_init_channel: setting Mcast TTL to 1
DL_init_channel: went ok on channel 12
E_attach_fd: fd 6, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 8, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 10, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 9, fd_type 0, code 0, data 0x0, priority 1
Active_priority 0
E_attach_fd: fd 11, fd_type 0, code 0, data 0x0, priority 1
Active_priority 0
E_queue: (only) event queued func 0x80566a0 code 0 data 0x0 in future
(0:0)
E_queue: (last) event queued func 0x8056410 code 0 data 0x0 in future
(1:0)
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
Sess_init: INET bind for port 4803 interface 172.17.188.3 ok
Sess_init: INET went ok on mailbox 13
Sess_init: INET bind for port 4803 interface 172.16.3.66 ok
Sess_init: INET went ok on mailbox 14
Sess_init: INET bind for port 4803 interface 127.0.0.1 ok
Sess_init: INET went ok on mailbox 15
Sess_init: UNIX bind for name /tmp/4803 ok
Sess_init: UNIX went ok on mailbox 16
E_attach_fd: fd 13, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 13, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 14, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 14, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 15, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 15, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 16, fd_type 0, code 1, data 0x0, priority 0
Active_priority 0
Sess_init: ended ok
DL_init_channel: setsockopt for send and broadcast went ok
DL_init_channel: setting Mcast TTL to 1
DL_init_channel: went ok on channel 17
E_handle_events: next event 
E_sub_time: negative time result.
E_handle_events: poll select
E_handle_events: select with timeout (0, 0)
E_handle_events: next event 
E_handle_events: exec time event 
DL_send: sent a message of 40 bytes to (239.255.3.62,4803) on channel 12
E_queue: (last) event queued func 0x80566a0 code 0 data 0x0 in future
(1:0)
E_handle_events: poll select
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 40 bytes on channel 6
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (0, 999247)
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 60 bytes on channel 6
E_dequeue: event dequeued func 0x80566a0 code 0 data 0x0
E_dequeue: first event dequeued func 0x8056410 code 0 data 0x0
E_queue: (only) event queued func 0x8056390 code 0 data 0x0 in future
(1:500000)
DL_send: sent a message of 40 bytes to (239.255.3.62,4803) on channel 12
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 44 bytes on channel 6
E_queue: dequeued a (first) simillar event
E_queue: (only) event queued func 0x8056390 code 0 data 0x0 in future
(1:500000)
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 40 bytes on channel 6
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (1, 499966)
Moving process tmServer (pid:20397) to Fifo scheduling with priority:2
E_handle_events: exec handler for fd 11, fd_type 0, priority 1
DL_recv: received 788 bytes on channel 11


*delayed*

| Version 3.17.03 Built 15/October/2004
|
\===========================================================================/
Conf_init: using file: /var/reva/etc/spread.conf
Successfully configured Segment 0 [239.255.3.62:4803] with 4 procs:
                        tap1: 172.16.3.64
                        tap2: 172.16.3.65
                        tap3: 172.16.3.66
                        tap4: 172.16.3.62
ENABLING Dangerous Monitor Commands! Make sure Spread network is secured
Setting SO_REUSEADDR to always on -- make sure Spread daemon host is
secured!
E_init: went ok
DL_init_channel: bind for recv_channel for port 4803 with chan 6 ok
DL_init_channel: Joining multicast address 239.255.3.62 went ok
DL_init_channel: went ok on channel 6
DL_init_channel: bind for recv_channel for port 4803 with chan 8 ok
Moving process spread (pid:15202) to Fifo scheduling with priority:4
DL_init_channel: Joining multicast address 239.255.3.62 went ok
DL_init_channel: went ok on channel 8
DL_init_channel: bind for recv_channel for port 4804 with chan 9 ok
DL_init_channel: setsockopt for recv and broadcast went ok
DL_init_channel: went ok on channel 9
DL_init_channel: bind for recv_channel for port 4803 with chan 10 ok
DL_init_channel: Joining multicast address 239.255.3.62 went ok
DL_init_channel: went ok on channel 10
DL_init_channel: bind for recv_channel for port 4804 with chan 11 ok
DL_init_channel: setsockopt for recv and broadcast went ok
DL_init_channel: went ok on channel 11
DL_init_channel: setsockopt for send and broadcast went ok
DL_init_channel: setting Mcast TTL to 1
DL_init_channel: went ok on channel 12
E_attach_fd: fd 6, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 8, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 10, fd_type 0, code 0, data 0x0, priority 2
Active_priority 0
E_attach_fd: fd 9, fd_type 0, code 0, data 0x0, priority 1
Active_priority 0
E_attach_fd: fd 11, fd_type 0, code 0, data 0x0, priority 1
Active_priority 0
E_queue: (only) event queued func 0x80566a0 code 0 data 0x0 in future
(0:0)
E_queue: (last) event queued func 0x8056410 code 0 data 0x0 in future
(1:0)
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
E_dequeue: no such event
Sess_init: INET bind for port 4803 interface 172.17.188.3 ok
Sess_init: INET went ok on mailbox 13
Sess_init: INET bind for port 4803 interface 172.16.3.66 ok
Sess_init: INET went ok on mailbox 14
Sess_init: INET bind for port 4803 interface 127.0.0.1 ok
Sess_init: INET went ok on mailbox 15
Sess_init: UNIX bind for name /tmp/4803 ok
Sess_init: UNIX went ok on mailbox 16
E_attach_fd: fd 13, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 13, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 14, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 14, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 15, fd_type 0, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 15, fd_type 2, code 2, data 0x0, priority 0
Active_priority 0
E_attach_fd: fd 16, fd_type 0, code 1, data 0x0, priority 0
Active_priority 0
Sess_init: ended ok
DL_init_channel: setsockopt for send and broadcast went ok
DL_init_channel: setting Mcast TTL to 1
DL_init_channel: went ok on channel 17
E_handle_events: next event 
E_sub_time: negative time result.
E_handle_events: poll select
E_handle_events: select with timeout (0, 0)
E_handle_events: next event 
E_handle_events: exec time event 
DL_send: sent a message of 40 bytes to (239.255.3.62,4803) on channel 12
E_queue: (last) event queued func 0x80566a0 code 0 data 0x0 in future
(1:0)
E_handle_events: poll select
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 40 bytes on channel 6      <=========== **************
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (0, 999211)
E_handle_events: exec handler for fd 6, fd_type 0, priority 2
DL_recv: received 216 bytes on channel 6
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (0, 950546)

The difference starts at the ********* and I don't see any traffic on
channel 11 until it 'kicks in' and everything starts working as shown
below:

....
DL_send: sent a message of 784 bytes to (172.16.3.64,4804) on channel 12
DL_send: sent a message of 784 bytes to (172.16.3.64,4804) on channel 12
E_dequeue: first event dequeued func 0x8056740 code 0 data 0x0
E_queue: (only) event queued func 0x8055fe0 code 0 data 0x0 in future
(5:0)
E_dequeue: no such event
E_handle_events: poll select
E_handle_events: select with timeout (4, 999991)
E_handle_events: exec handler for fd 11, fd_type 0, priority 1
DL_recv: received 624 bytes on channel 11
DL_send: sent a message of 28 bytes to (172.16.3.64,4804) on channel 12
E_queue: dequeued a (first) simillar event
E_queue: (only) event queued func 0x8055fe0 code 0 data 0x0 in future
(1:0)
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: exec handler for fd 11, fd_type 0, priority 1
DL_recv: received 624 bytes on channel 11
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (0, 999958)
E_handle_events: exec handler for fd 11, fd_type 0, priority 1
DL_recv: received 28 bytes on channel 11
DL_send: sent a message of 28 bytes to (172.16.3.64,4804) on channel 12
E_queue: dequeued a (first) simillar event
E_queue: (only) event queued func 0x8055fe0 code 0 data 0x0 in future
(1:0)
E_handle_events: next event 
E_handle_events: poll select
E_handle_events: select with timeout (0, 999987)
E_handle_events: exec handler for fd 11, fd_type 0, priority 1
DL_recv: received 28 bytes on channel 11
DL_send: sent a message of 28 bytes to (172.16.3.64,4804) on channel 12
E_queue: dequeued a (first) simillar event
E_queue: (only) event queued func 0x8055fe0 code 0 data 0x0 in future
(1:0)
Membership id is ( -1408236736, 1147787491)


I've watched the network during these times and the interesting thing is
that from the network point of view, when things become unblocked, a
unicast packet comes out of the blocked node to start things off 
21:06:55.826651 tap2.33264 > tap3.4803: udp 48 (DF)
21:06:56.826620 tap3.33022 > tap2.4804: udp 784 (DF)
21:06:56.826638 tap3.33022 > tap2.4804: udp 784 (DF)
21:07:01.826917 arp who-has tap2 tell tap3
21:07:01.827074 arp reply tap2 is-at 0:90:fb:5:32:50
21:07:02.827648 tap2.33264 > tap3.4803: udp 48 (DF)
21:07:03.827587 tap3.33022 > tap2.4804: udp 784 (DF)
21:07:03.827604 tap3.33022 > tap2.4804: udp 784 (DF)

... 2 minutes 40 seconds of not establishing group membership ....

21:09:43.852675 tap2.33264 > tap3.4803: udp 48 (DF)
21:09:44.852625 tap3.33022 > tap2.4804: udp 784 (DF)
21:09:44.852640 tap3.33022 > tap2.4804: udp 784 (DF)
21:09:44.853624 tap2.33264 > tap3.4804: udp 624 (DF)
21:09:44.853681 tap2.33264 > tap3.4804: udp 624 (DF)
21:09:44.853744 arp who-has tap4 tell tap3
21:09:44.853885 arp reply tap4 is-at 0:90:fb:2:86:12
21:09:44.853892 tap3.33022 > tap4.4804: udp 624 (DF)   <======== ******
21:09:44.853896 tap3.33022 > tap4.4804: udp 624 (DF)

In this run, tap2 and tap4 were up and tap3 took a long time to connect.
There were no packets from tap3 to tap4 or vice versa prior to this, and
afterward, all was well.

I have seen no problems like this at all with just two nodes.

Any similar experiences or pointers where to look would be appreciated.

Thanks,
Scott




-- 





More information about the Spread-users mailing list