[Spread-users] proper way to shutdown a spread daemon?

Tue Nov 14 11:36:41 EST 2006

Spread is designed to handle any crash failure, so it definitely should be able to 
be "killed" and restarted easily. Now there have been occasional bugs in the past 
where the membership algorithm would not complete -- which could cause the 
symptoms you describe (daemons in the same set not merging their state when a 
fault is repaired). As always :-) we think they are all fixed in the current 
source tree - but new bugs are occasionally found...

What version of Spread was this with? Was there anything in the log of either the 
'restarted' daemon or it's peers at the time they wouldn't merge?

One known limitation is if a daemon has a clients connected to it and  the daemon 
is killed before the clients, because of the way TCP works, the network port the 
daemon needs to bind to is considered "busy" in TIME_WAIT state for a minute or so 
and during that time the daemon restart will fail. This is a limitation of TCP, 
not Spread and can be overcome by using the SO_REUSEADDR socket option. We support 
that automatically when possible (some OS's are insecure about allowing it's use). 
You can find some documentation on this feature in the sample.spread.conf file, 
look for the SocketPortReuse option.

Now I know that doesn't have anything to do with the problem you reported, but I 
wanted to mention it for completeness as it is the one known limit on restarting, 
and it justifies the only 'advice' on shutting down spread cleanly, which is to 
kill the clients first, then the daemon (if possible) as that avoids the TCP 
waiting time.

Cheers,

Jonathan

On Tue, Nov 14, 2006 at 08:09:02AM -0600, matthew.garman at gmail.com wrote:
> 
> One of our machines running (several instances of) spread went down,
> and we had to reboot.  When the machine came back online, I
> re-started all the daemons.  All of them, except one, re-joined
> their respective spread networks.
> 
> Further restarts of the problematic daemon did not bring it back
> into its network.  I had to restart all the other "peer" daemons
> (i.e. daemons on different machines but in the same segment).
> 
> I'm wondering if we have not been shutting spread down properly.
> Generally, we just use the kill command (which, on the Linux distro
> I'm using, sends the TERM signal by default).
> 
> Granted, this was a special situation, because we had a problem with
> the hardware itself.  But I think it's worth knowing, what is the
> best way to make sure spread shuts down gracefully?
> 
> Also, what circumstances might lead to the situation described
> above, i.e. a daemon does not correctly re-join its network?
> 
> Thank you!
> Matt
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------