[Spread-users] operation after partition heals

Scott Barvick sbarvick at revasystems.com
Fri May 20 14:41:23 EDT 2005


Following up, I think the answer in this case is (in membership.c):

        /* Lookup timeout when only one segment exists can be longer,
         * since a no remote segments need to be probed
         */
        if ( Cn.num_segments == 1 )
            Lookup_timeout.sec = 300;

We only have one segment, but we are using Spread over multicast with
IGMP snooping/filtering, and it looks like I need the unicasts to kick
things over (5 minutes later).  On a hub, the repair is quick.

Thanks,
Scott




On Wed, 2005-05-18 at 19:36, Jonathan Stanton wrote:
> You should receive a second membership showing that the partition has 
> healed. How long that takes depends on your network setup and traffic. 
> 
> By default if no messages are seen, every 60 -90 seconds each daemon 
> will 'ping' all the other daemons it can't talk to to try and 
> 'reconnect' with them. So if nothing happens before then, that should 
> trigger the merge. More often, with LAN setups, if any messages are sent 
> through spread once the the wire is plugged back in, the other daemons 
> will see those broadcasts and thus learn that other daemons are 
> alive/connected again and should trigger a membership then. That is much 
> faster -- just a few seconds after the message. This doesn't work with 
> wide-area daemons as they don't get the lan broadcasts. 
> 
> The 60-90 second timeout can be changed by editing membership.c
> (Lookup_timeout) and recompiling and some users do that for specialized
> environments where they need fast recovery and are willing to pay an
> increased bandwidth overhead cost. 
> 
> If you do not see any merge after about 90 seconds or so, then something 
> strange is wrong and please email me back. 
> 
> Cheers,
> 
> Jonathan
> 
> On Wed, May 18, 2005 at 06:33:16PM -0400, Scott Barvick wrote:
> > What is the expected operation after a network partition heals?  I get
> > the membership change messages when the partition occurs so that each of
> > 2 nodes thinks it is the only one, but when the partition heals (I plug
> > the ethernet cable back in), I don't see any awareness of the other node
> > on either side.  Shouldn't the multicast/broadcast messages find their
> > way into Spread and kick something off?
> > 
> > More specifically, I have 2 nodes and I pull the network interface cable
> > out of one of them.  When I plug it back in, I can ping.  I haven't
> > looked at the network traffic yet, but I can do that.  I have MEMBERSHIP
> > logging turned on and I see the token and membership operations at the
> > partition, but there is no output when the partition heals.
> > 
> > Thanks for any advice,
> > Scott
> > 
> > 
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list