[Spread-users] daemon death and timing parameters

Fri Dec 2 10:00:16 EST 2005

Hello,

We have been running an experiment where we run a daemon and an
application on multiple hosts, in the same segment. Once the system is
up and running we kill the Spread daemon and application on one of the
hosts. From the time the daemon and application are killed until
another member of the group receives a membership notification is
about 7-9 seconds.

We would really like to push this notification time down to
sub-second, though how much under is an open question. Reading through
the manual there are a number of parameters that can be tweaked. I was
hoping that someone on the list might have a good feeling for some
settings I could try or what I would have to live with in terms of
throughput/false-positives/... in order to get the desired level of
detection. I will experiment, but was hoping for a quick/more-informed
answer. Also, are there any LAN/WAN parameters that I could be tweaked
that aren't in the timing values?

The target environment is three LANs with at least one host per LAN
and at most 4 hosts per LAN. Each host would have 1-3
applications. Throughput is not particularly high. Does it sound
reasonable to push the failure detection down to under a second in
these cases?

If I have multiple segments I've heard that one of the optimizations
that can happen is have multiple rings going with a "leader" in each
segment interacting with the other segments. If this happens, would
detection be quicker if a non-leader died than a leader?

Also, is there much difference in terms of timing to lose a whole LAN
worth of daemons as opposed to a single daemon? Are the timeouts and
such going in parallel or is it an iterative process to detect the
failures? 

          thank you for all your help, 

                 Paul Rubel