[Spread-users] Spread performance tuning: retransmissions and bandwidth usage

Fri Feb 25 21:01:23 EST 2005

Hello everybody,

This is a long message. I have done some performance testing of an 
application built on top of Spread and I have discovered that, while 
Spread has a pretty good latency which scales well with the amount of 
data being transmitted, the number of retransmissions is pretty high and 
the bandwidth utilization does not exceed 70%. I would like to know if 
there are any optimizations that I could do -- in addition to the ones 
described here -- to drive the end-to-end latency down by reducing the 
number of retransmissions and by using more bandwidth. I am presenting 
three examples, which are different runs of the same application.. At 
the end of this message, I am summarizing the most important (in my 
opinion) observations and I am drawing some conclusions from them, so 
you can skip directly to that point if you think the details are not 
very important.

The difference between the three examples below is that the message 
payloads are 4 KB, 16 KB and 64 KB. The application is sending other 
messages as well, but their size is negligible when compared to the 
"big" messages.  Basically, with each new experiment I am sending 4 
times more data.

A simplified, but correct description of the application is that only 3 
hosts are sending the "big" messages, after receiving requests from one 
of the other 17 hosts. In the rest of this message, by "round-trip time" 
I mean the time elapsed between sending such a request and receiving the 
"big" message. The spmonitor screenshots below are from one of the hosts 
sending "big" messages.

I am running Spread on Emulab in a 100 Mbps LAN with 20 hosts. Each host 
is a Pentium III @ 850 MHz (1697 bogomips), running Linux RedHat 9.0 
with the TimeSys 3.1 kernel:

-bash-2.05b$ uname -a
Linux fourofnine.tsl25.pces.emulab.net 2.4.7-timesys-3.1.214 #1 Tue Oct 
21 20:42:36 MDT 2003 i686 unknown

I should also note that we have tuned the parameters of Spread to get 
the best performance in this setting. The timeouts are significantly 
reduced, and MAX_SESSION_MESSAGES is about 30,000. The Spread daemons 
have a real-time FIFO scheduling policy, with priority 1. On each node, 
the only other process that has a real-time scheduling policy is the ssh 
daemon (with the same priority). Also, due to some particularities of 
Emulab, the multicast address and the IP address of a host belong to 
different network interfaces. The results for the three examples are 
listed below.

"4K" EXAMPLE

============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 77 seconds :
Membership  :  20  procs in 1 segments, leader is fourofnine
rounds   :    5304      tok_hurry :      36     memb change:       2
sent pack:   48921      recv pack :  189302     retrans    :     471
u retrans:     244      s retrans :     227     b retrans  :       0
My_aru   :  228644      Aru       :  228574     Highest seq:  228644
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:  131667      Deliver Pk:  228662     Pers Window:      15
Delta Mes:   19619      Delta Pack:   34180     Delta sec  :      10
==================================

Monitor> Monitor: send status query

============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 87 seconds :
Membership  :  20  procs in 1 segments, leader is fourofnine
rounds   :    6080      tok_hurry :      36     memb change:       2
sent pack:   56158      recv pack :  217299     retrans    :     522
u retrans:     245      s retrans :     277     b retrans  :       0
My_aru   :  262220      Aru       :  262130     Highest seq:  262220
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:  150978      Deliver Pk:  262238     Pers Window:      15
Delta Mes:   19311      Delta Pack:   33556     Delta sec  :      10
==================================

Monitor>

In this case, the average round-trip time is 54 ms. Running iftop on the 
network interface that has the multicast address shows that this 
experiment uses roughly 30 Mb/s of bandwidth. The Spread daemons on my 
20 machines utilize between 14% - 21% of the CPU time. In the 10 seconds 
between the two information displays, the token has rotated 776 times, 
which means that Spread does about 78 rounds / s. During this time, 
about 34000 packets  have been delivered, at a rate of 3400 packets / s. 
(Since the application messages have a variable size, I believe the 
packet delivery rate is the right number to reason about throughput). 
There have been 0.066 retransmissions per round and 5.1 retransmissions 
per second.

"16K" EXAMPLE

============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 81 seconds :
Membership  :  20  procs in 1 segments, leader is fourofnine
rounds   :    7837      tok_hurry :      33     memb change:       2
sent pack:   67319      recv pack :  293244     retrans    :    2970
u retrans:     114      s retrans :    2856     b retrans  :       0
My_aru   :  268805      Aru       :  268685     Highest seq:  268805
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:   54114      Deliver Pk:  268823     Pers Window:      15
Delta Mes:    7569      Delta Pack:   37665     Delta sec  :      10
==================================

Monitor> Monitor: send status query

============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 91 seconds :
Membership  :  20  procs in 1 segments, leader is fourofnine
rounds   :    8925      tok_hurry :      33     memb change:       2
sent pack:   76765      recv pack :  333725     retrans    :    3275
u retrans:     119      s retrans :    3156     b retrans  :       0
My_aru   :  306503      Aru       :  306442     Highest seq:  306503
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:   61700      Deliver Pk:  306521     Pers Window:      15
Delta Mes:    7586      Delta Pack:   37757     Delta sec  :      10
==================================

Monitor>

Now, the average round-trip time is 116 ms. Iftop shows that in this 
case I am using about 52 MB/s of bandwidth. The Spread daemons utilize 
between 20% - 28% of the CPU time. In this case, Spread does about 109 
rounds / s, and delivers about 37000 packets (3700 packets / s). There 
have been 0.28 retransmissions per round and 30.5 retransmissions per 
second.

"64K" EXAMPLE

============================
Status at nineofnine V 3.17. 3 (state 1, gstate 1) after 88 seconds :
Membership  :  20  procs in 1 segments, leader is nineofnine
rounds   :    9565      tok_hurry :      35     memb change:       2
sent pack:   79248      recv pack :  359068     retrans    :    4892
u retrans:     124      s retrans :    4768     b retrans  :       0
My_aru   :  283356      Aru       :  283268     Highest seq:  283370
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:   16472      Deliver Pk:  283375     Pers Window:      15
Delta Mes:    2077      Delta Pack:   36195     Delta sec  :      10
==================================

Monitor> Monitor: send status query

============================
Status at nineofnine V 3.17. 3 (state 1, gstate 1) after 98 seconds :
Membership  :  20  procs in 1 segments, leader is nineofnine
rounds   :   10786      tok_hurry :      35     memb change:       2
sent pack:   89484      recv pack :  405229     retrans    :    5514
u retrans:     139      s retrans :    5375     b retrans  :       0
My_aru   :  319817      Aru       :  319618     Highest seq:  319817
Sessions :       1      Groups    :      19     Window     :      60
Deliver M:   18584      Deliver Pk:  319836     Pers Window:      15
Delta Mes:    2112      Delta Pack:   36350     Delta sec  :      10
==================================

Monitor>

Round-trip delay is 585 ms, bandwidth is 64 Mb/s and CPU utilization is 
between 25% - 35%. Spread does about 122 rounds / s and it delivers 3600 
packets / s. There have been 5 retransmissions / round and 62 
retransmissions / s.

In summary:
--------------------------------------------------------------
                        "4K"          "16K"         "64K"
--------------------------------------------------------------
Rounds/s                78            109           122
Packets/s               3400          3700          3600
Retransmissions/s       5.1           30.5          62
Bandwidth               30 Mb/s       52 Mb/s       64 Mb/s
CPU                     14% - 21%     20% - 28%     25% - 35%
Average round-trip      54 ms         116 ms        585 ms
--------------------------------------------------------------

 From these (not very rigorous) observations, I can derive the folowing 
conclusions:
* The round-trip latency scales sub-linearly with the amount of data 
being transmitted
* The throughput is roughly the same for these three examples
* The token rotates faster if there is more data to send
* The rate of retransmission increases if there is more data to send
* Spread doesn't utilize more than 70% of the available bandwidth

I believe these results speak well of the performance of the Spread 
package. I am wondering if it couldn't do even better. Can you think of 
any optimizations that I could do to reduce the retransmission rate and 
to improve the bandwidth utilization, with the ultimate purpose of 
reducing the end-to-end latency? What exactly causes the 
retransmissions? (Since the token protocol enforces mutual exclusion 
between Spread daemons, there shouldn't be any collisions in the 
Ethernet LAN). What prevents Spread from using more bandwidth to make 
sure that messages are delivered faster? Is it the size of the UDP 
packets exchanged by the daemons? Can this size be tuned?

Thank you in advance for your thoughts on this matter.

Tudor

-- 
______________________________________

Tudor A. Dumitras

ECE Department
Carnegie Mellon University
(412) 268-5005

http://www.ece.cmu.edu/~tdumitra

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20050225/7cc54101/attachment.html