[Spread-users] Circular token over spread: 2 seconds lap time?

Jonathan Stanton jonathan at cnds.jhu.edu
Wed Jul 28 16:31:49 EDT 2004


As Ryan hinted at there is a reason for the Hurry_Timeout being 'high". 
You may want to lower it as you describe but the negative consequence is 
an increase in network overhead (packets/sec). 

The Hurry timeout is how long the 'leader' of the token ring will hold 
then token when NO other daemon wants the token. To emphasize this, if at 
least some of the daemons have messages to send, then the token will keep 
circling with no additional delay (i.e. hurry timeout has no effect). If 
the token circles several times around the ring and NO messages are sent 
(the counters do not increase) then the leader will grab the token and 
hold it for Hurry_timeout seconds. Then it will send it around the ring 
again in case daemons now have messages. 

So if your daemons are always busy the hurry timeout should not effect 
them. If you have a "light" load (a fwe messages per second per daemon) 
then the hurry timeout will increase the latency per message, but will 
avoid the token circling as fast as possible around the ring wasting 
bandwidth. 

Because of this behaviour, the hurry_timeout often triggers on benchmark 
code that sends "ping" style messages to tets latency, but it will not 
trigger on a busy production system because the messages activity will 
keept the token circling (and latency will be dominated by intra-OS 
scheduling delays).

Cheers,

Jonathan


On Wed, Jul 28, 2004 at 07:01:33PM +0200, Andreu Moreno i Vendrell wrote:
> Hello,
> 
> We have changed only Hurry_timeout to 40 ms and the lap time goes to  
> milisecond range.
> 
> Thanks for you help.
> 
> So the desicion is to selecte the suitable value of Hurry_timeout!
> 
> Thanks,
> 
> Andreu Moreno
> 
> 
> >The protocol layer code isn't the area of Spread that I'm most
> >familiar with, but as far as I understand it, if the network is doing
> >great and there aren't a lot of packets being sent/lost, the network
> >leader will hold the token for Hurry_timeout.  To see the code I just
> >scanned through to try to figure this out, look at
> >Prot_handle_token(), To_hold_token(), and Prot_token_hurry() in
> >protocol.c.
> >
> >This is (I think) a performance decision, to avoid wasting too many
> >resources rotating the token when it isn't necessary... the goal of
> >the system is more throughput, than latency.
> >
> >The other timeouts in membership.c are definitely unrelated...
> >collectively, they represent the time at which Spread assumes the
> >token is lost, and the times of several phases of the daemon
> >membership algorithm.  You may want to update them (carefully) in
> >order to improve the performance of the daemon membership algorithm on
> >a low-latency network.  In general, only do so proportionally.
> >
> >I suspect that decreasing Hurry_timeout should make your problem go
> >away, although there are reasons not to do so if you're using Spread
> >for real.  Let the list know if this works for you.
> >
> >Cheers,
> >Ryan
> >
> >
> >On Tue, 27 Jul 2004 14:15:47 -0700, Steven Dake <sdake at mvista.com> wrote:
> > 
> >
> >>Gautam
> >>
> >>I have tried a similiar protocol (http://developer.osdl.org/dev/openais)
> >>with 12 processors and find the token rotation time under full network
> >>load to be about 10msec.  Thus, I'd not suggest setting your token
> >>timeout to lower then this value and you may even want a much larger
> >>value.  I settled on 100msec, although I do not support WAN
> >>configurations which would warrant much larger timeouts.
> >>
> >>I find each node takes about 0.5msec to handle the token under full
> >>network load (10MB/sec throughput on a 100mbit network, 1472 sized
> >>packets) takes about 6msec for one node to send the messages and other
> >>nodes to process them per token rotation.
> >>
> >>If spread uses the same algorithms as Yair Amir's PHD thesis suggests,
> >>none of those timer values should have any effect on performance of the
> >>token rotation.  These timeouts are only for determining a configuration
> >>and determining a faulty processor.
> >>btw, I am not familiar with some of the timeouts below so I could be
> >>wrong :).
> >>
> >>Thanks
> >>-steve
> >>
> >>
> >>
> >>On Tue, 2004-07-27 at 08:19, Gautam H. Thaker wrote:
> >>   
> >>
> >>>The "2 second" value is a results of the default spread timing
> >>>parameters which are:
> >>>
> >>>Default Spread parameters:
> >>>
> >>>Token_timeout.sec  =   5; Token_timeout.usec  = 0;
> >>>Hurry_timeout.sec  =   2; Hurry_timeout.usec  = 0;
> >>>Alive_timeout.sec  =   1; Alive_timeout.usec  = 0;
> >>>Join_timeout.sec   =   1; Join_timeout.usec   = 0;
> >>>Rep_timeout.sec    =   2; Rep_timeout.usec    = 500000;
> >>>Seg_timeout.sec    =   2; Seg_timeout.usec    = 0;
> >>>Gather_timeout.sec =   5; Gather_timeout.usec = 0;
> >>>Form_timeout.sec   =   5; Form_timeout.usec   = 0;
> >>>Lookup_timeout.sec =  60; Lookup_timeout.usec = 0;
> >>>
> >>>In my tests I have noted that these values results in Spread
> >>>communications suffering a maximum latency of 2 seconds. When I change
> >>>these parameters to values below the maximum latencies I observe are
> >>>much less.
> >>>
> >>>"Very Fast" Spread parameters:
> >>>
> >>>Token_timeout.sec  =   0; Token_timeout.usec  = 100000;
> >>>Hurry_timeout.sec  =   0; Hurry_timeout.usec  =  40000;
> >>>Alive_timeout.sec  =   0; Alive_timeout.usec  =  20000;
> >>>Join_timeout.sec   =   0; Join_timeout.usec   =  20000;
> >>>Rep_timeout.sec    =   0; Rep_timeout.usec    =  60000;
> >>>Seg_timeout.sec    =   0; Seg_timeout.usec    =  40000;
> >>>Gather_timeout.sec =   0; Gather_timeout.usec = 100000;
> >>>Form_timeout.sec   =   0; Form_timeout.usec   = 100000;
> >>>Lookup_timeout.sec =   1; Lookup_timeout.usec = 200000;
> >>>
> >>>
> >>>The latencies ranges observed for a variety of message sizes for these
> >>>two parameter values are shown in the attached graphic. (All our test
> >>>results are also available online at:
> >>>
> >>>http://www.atl.external.lmco.com/projects/QoS/compare/cgi-bin/left2_part1.cgi?filter=emulab.*%28spread%7Ctcp%29
> >>>
> >>>I was wondering if anyone has pushed Spread parameter to even much lower
> >>>than "very fast" values. Certainly on Linux 2.6 kernel or on Solaris
> >>>both of which have 1000 HZ clocks the lowest value of parameter should
> >>>be settable at about 2 msec (rather than 20 msec in "very fast" above.)
> >>>
> >>>Gautam
> >>>
> >>>Andreu Moreno i Vendrell wrote:
> >>>     
> >>>
> >>>>Hello,
> >>>>
> >>>>We have 2 seconds lap time in a circular token over spread. Do you know 
> >>>>what's
> >>>>wrong?
> >>>>
> >>>>Test description:
> >>>>
> >>>>a) 3 computers in an isolated LAN: Machine 1, Machine 2 and Machine 3.
> >>>>b) Spread 3.17.2 version installed in every machine.
> >>>>c) RedHat 8.0 Linux installed in every machine.
> >>>>d) Machine 1: runs a program that joins group "1" and on reception of a
> >>>>message it sends a message to group "2".
> >>>>e) Machine 2: runs a program that joins group "2" and on reception of a
> >>>>message it sends a message to group "3".
> >>>>f) Machine 3: runs a program that joins group "3" and on reception of a
> >>>>message it sends a message to group "3". This program is the last to be
> >>>>executed and also sends a message to group "1" to start the token to
> >>>>circulate.
> >>>>
> >>>>Results:
> >>>>
> >>>>The lap time is about 2 seconds?????
> >>>>
> >>>>Thanks,
> >>>>
> >>>>Andreu
> >>>>
> >>>>       
> >>>>
> 
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------




More information about the Spread-users mailing list