[Spread-users] Spread 4.1.0 gets very confused when a daemon can't see every other daemon

Melissa Jenkins melissa-spread at temeletry.co.uk
Thu Sep 9 17:46:54 EDT 2010


Hi,

I have a working Spread configuration that I've been playing with for quite a while now.

I've noticed that if one of the Spread Daemons in a configuration can't see all the others the whole 'network' stops passing messages.

For example;

Host A <=wan=> Host B <= wan => -+---   Host C --+
                                 |               |-- LAN
                                 +---   Host D --+


A & B are each in different segments.  C&D are in the same segment.  Link from A to B is WAN, B to C&D is WAN, C & D are directly connected and in the same broadcast domain.

Spread_Segment  .255  {
                site3_C          C 
                site3_D          D 
}

Spread_Segment  B {
               site2_B            B
}

Spread_Segment A {
               site1_A          A
}


If A loses it's route to Host C (and/or) D, but B can still see C and D then the whole Spread topology seems to get stuck.

Or, if B can see host C but not D - but C+D can see each other.  The following log snippets are when Host A has been manually stopped, but host B can only see Host C, whereas C + D can see each other.  Obviously this was caused by a problem with the physical LAN, however I'd really like a way to stop Spread 'dying' in this situation - even if it means that each host forms it's own 'group' and they don't communicate between each other.

The log on C/D look similar to the following (repeated over and over again):
[Wed 08 Sep 2010 04:56:58] Memb_handle_message: handling join message from B, State is 1
[Wed 08 Sep 2010 04:56:58] Handle_join in OP
[Wed 08 Sep 2010 04:58:03] Memb_handle_message: handling join message from B, State is 1
[Wed 08 Sep 2010 04:58:03] Handle_join in OP
[Wed 08 Sep 2010 04:59:08] Memb_handle_message: handling join message from B, State is 1
[Wed 08 Sep 2010 04:59:08] Handle_join in OP
[Wed 08 Sep 2010 05:00:13] Memb_handle_message: handling join message from B, State is 1
[Wed 08 Sep 2010 05:00:13] Handle_join in OP

The log on B looks like the following (repeated over and over again):
[Wed 08 Sep 2010 08:39:19] Memb_handle_message: handling refer message from D, State is 4
[Wed 08 Sep 2010 08:39:19] Handle_refer in GATHER
[Wed 08 Sep 2010 08:39:19] Memb_handle_message: handling join message from C, State is 4
[Wed 08 Sep 2010 08:39:20] Send_join: State is 4
[Wed 08 Sep 2010 08:39:21] Send_join: State is 4
[Wed 08 Sep 2010 08:39:22] Send_join: State is 4
[Wed 08 Sep 2010 08:39:23] Send_join: State is 4
[Wed 08 Sep 2010 08:39:24] Send_join: State is 4
[Wed 08 Sep 2010 08:39:24] Form_or_fail:failed, return to OP
[Wed 08 Sep 2010 08:40:24] Send_join: State is 4
[Wed 08 Sep 2010 08:40:24] Memb_handle_message: handling join message from C, State is 4
[Wed 08 Sep 2010 08:40:24] Memb_handle_message: handling refer message from D, State is 4
[Wed 08 Sep 2010 08:40:24] Handle_refer in GATHER
[Wed 08 Sep 2010 08:40:25] Send_join: State is 4
[Wed 08 Sep 2010 08:40:26] Send_join: State is 4
[Wed 08 Sep 2010 08:40:27] Send_join: State is 4
[Wed 08 Sep 2010 08:40:28] Send_join: State is 4
[Wed 08 Sep 2010 08:40:29] Form_or_fail:failed, return to OP
[Wed 08 Sep 2010 08:41:29] Send_join: State is 4

Any messages sent on B seem to block indefinitely in this situation.  (I believe they also block on C & D, but I haven't confirmed this)

Does anybody have any ideas on how to encourage Spread to settle and allow messages to pass in this situation?  I'm happy to look at code, but would need guidance on where to look!

Thanks!
Mel

PS: I've had similar problems if host B is multihomed on two networks - A on one, C/D on the it will never "form".  It seems to get confused about B having two addresses (one on each lan) - I worked around this by just removing one Spread Segment and directly connecting the client.   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20100909/62ace686/attachment.html 


More information about the Spread-users mailing list