[Spread-users] Spread performance tuning: retransmissions and bandwidth usage
Tudor Dumitras
tdumitra at ece.cmu.edu
Fri Feb 25 21:01:23 EST 2005
Hello everybody,
This is a long message. I have done some performance testing of an
application built on top of Spread and I have discovered that, while
Spread has a pretty good latency which scales well with the amount of
data being transmitted, the number of retransmissions is pretty high and
the bandwidth utilization does not exceed 70%. I would like to know if
there are any optimizations that I could do -- in addition to the ones
described here -- to drive the end-to-end latency down by reducing the
number of retransmissions and by using more bandwidth. I am presenting
three examples, which are different runs of the same application.. At
the end of this message, I am summarizing the most important (in my
opinion) observations and I am drawing some conclusions from them, so
you can skip directly to that point if you think the details are not
very important.
The difference between the three examples below is that the message
payloads are 4 KB, 16 KB and 64 KB. The application is sending other
messages as well, but their size is negligible when compared to the
"big" messages. Basically, with each new experiment I am sending 4
times more data.
A simplified, but correct description of the application is that only 3
hosts are sending the "big" messages, after receiving requests from one
of the other 17 hosts. In the rest of this message, by "round-trip time"
I mean the time elapsed between sending such a request and receiving the
"big" message. The spmonitor screenshots below are from one of the hosts
sending "big" messages.
I am running Spread on Emulab in a 100 Mbps LAN with 20 hosts. Each host
is a Pentium III @ 850 MHz (1697 bogomips), running Linux RedHat 9.0
with the TimeSys 3.1 kernel:
-bash-2.05b$ uname -a
Linux fourofnine.tsl25.pces.emulab.net 2.4.7-timesys-3.1.214 #1 Tue Oct
21 20:42:36 MDT 2003 i686 unknown
I should also note that we have tuned the parameters of Spread to get
the best performance in this setting. The timeouts are significantly
reduced, and MAX_SESSION_MESSAGES is about 30,000. The Spread daemons
have a real-time FIFO scheduling policy, with priority 1. On each node,
the only other process that has a real-time scheduling policy is the ssh
daemon (with the same priority). Also, due to some particularities of
Emulab, the multicast address and the IP address of a host belong to
different network interfaces. The results for the three examples are
listed below.
"4K" EXAMPLE
============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 77 seconds :
Membership : 20 procs in 1 segments, leader is fourofnine
rounds : 5304 tok_hurry : 36 memb change: 2
sent pack: 48921 recv pack : 189302 retrans : 471
u retrans: 244 s retrans : 227 b retrans : 0
My_aru : 228644 Aru : 228574 Highest seq: 228644
Sessions : 1 Groups : 19 Window : 60
Deliver M: 131667 Deliver Pk: 228662 Pers Window: 15
Delta Mes: 19619 Delta Pack: 34180 Delta sec : 10
==================================
Monitor> Monitor: send status query
============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 87 seconds :
Membership : 20 procs in 1 segments, leader is fourofnine
rounds : 6080 tok_hurry : 36 memb change: 2
sent pack: 56158 recv pack : 217299 retrans : 522
u retrans: 245 s retrans : 277 b retrans : 0
My_aru : 262220 Aru : 262130 Highest seq: 262220
Sessions : 1 Groups : 19 Window : 60
Deliver M: 150978 Deliver Pk: 262238 Pers Window: 15
Delta Mes: 19311 Delta Pack: 33556 Delta sec : 10
==================================
Monitor>
In this case, the average round-trip time is 54 ms. Running iftop on the
network interface that has the multicast address shows that this
experiment uses roughly 30 Mb/s of bandwidth. The Spread daemons on my
20 machines utilize between 14% - 21% of the CPU time. In the 10 seconds
between the two information displays, the token has rotated 776 times,
which means that Spread does about 78 rounds / s. During this time,
about 34000 packets have been delivered, at a rate of 3400 packets / s.
(Since the application messages have a variable size, I believe the
packet delivery rate is the right number to reason about throughput).
There have been 0.066 retransmissions per round and 5.1 retransmissions
per second.
"16K" EXAMPLE
============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 81 seconds :
Membership : 20 procs in 1 segments, leader is fourofnine
rounds : 7837 tok_hurry : 33 memb change: 2
sent pack: 67319 recv pack : 293244 retrans : 2970
u retrans: 114 s retrans : 2856 b retrans : 0
My_aru : 268805 Aru : 268685 Highest seq: 268805
Sessions : 1 Groups : 19 Window : 60
Deliver M: 54114 Deliver Pk: 268823 Pers Window: 15
Delta Mes: 7569 Delta Pack: 37665 Delta sec : 10
==================================
Monitor> Monitor: send status query
============================
Status at fourofnine V 3.17. 3 (state 1, gstate 1) after 91 seconds :
Membership : 20 procs in 1 segments, leader is fourofnine
rounds : 8925 tok_hurry : 33 memb change: 2
sent pack: 76765 recv pack : 333725 retrans : 3275
u retrans: 119 s retrans : 3156 b retrans : 0
My_aru : 306503 Aru : 306442 Highest seq: 306503
Sessions : 1 Groups : 19 Window : 60
Deliver M: 61700 Deliver Pk: 306521 Pers Window: 15
Delta Mes: 7586 Delta Pack: 37757 Delta sec : 10
==================================
Monitor>
Now, the average round-trip time is 116 ms. Iftop shows that in this
case I am using about 52 MB/s of bandwidth. The Spread daemons utilize
between 20% - 28% of the CPU time. In this case, Spread does about 109
rounds / s, and delivers about 37000 packets (3700 packets / s). There
have been 0.28 retransmissions per round and 30.5 retransmissions per
second.
"64K" EXAMPLE
============================
Status at nineofnine V 3.17. 3 (state 1, gstate 1) after 88 seconds :
Membership : 20 procs in 1 segments, leader is nineofnine
rounds : 9565 tok_hurry : 35 memb change: 2
sent pack: 79248 recv pack : 359068 retrans : 4892
u retrans: 124 s retrans : 4768 b retrans : 0
My_aru : 283356 Aru : 283268 Highest seq: 283370
Sessions : 1 Groups : 19 Window : 60
Deliver M: 16472 Deliver Pk: 283375 Pers Window: 15
Delta Mes: 2077 Delta Pack: 36195 Delta sec : 10
==================================
Monitor> Monitor: send status query
============================
Status at nineofnine V 3.17. 3 (state 1, gstate 1) after 98 seconds :
Membership : 20 procs in 1 segments, leader is nineofnine
rounds : 10786 tok_hurry : 35 memb change: 2
sent pack: 89484 recv pack : 405229 retrans : 5514
u retrans: 139 s retrans : 5375 b retrans : 0
My_aru : 319817 Aru : 319618 Highest seq: 319817
Sessions : 1 Groups : 19 Window : 60
Deliver M: 18584 Deliver Pk: 319836 Pers Window: 15
Delta Mes: 2112 Delta Pack: 36350 Delta sec : 10
==================================
Monitor>
Round-trip delay is 585 ms, bandwidth is 64 Mb/s and CPU utilization is
between 25% - 35%. Spread does about 122 rounds / s and it delivers 3600
packets / s. There have been 5 retransmissions / round and 62
retransmissions / s.
In summary:
--------------------------------------------------------------
"4K" "16K" "64K"
--------------------------------------------------------------
Rounds/s 78 109 122
Packets/s 3400 3700 3600
Retransmissions/s 5.1 30.5 62
Bandwidth 30 Mb/s 52 Mb/s 64 Mb/s
CPU 14% - 21% 20% - 28% 25% - 35%
Average round-trip 54 ms 116 ms 585 ms
--------------------------------------------------------------
From these (not very rigorous) observations, I can derive the folowing
conclusions:
* The round-trip latency scales sub-linearly with the amount of data
being transmitted
* The throughput is roughly the same for these three examples
* The token rotates faster if there is more data to send
* The rate of retransmission increases if there is more data to send
* Spread doesn't utilize more than 70% of the available bandwidth
I believe these results speak well of the performance of the Spread
package. I am wondering if it couldn't do even better. Can you think of
any optimizations that I could do to reduce the retransmission rate and
to improve the bandwidth utilization, with the ultimate purpose of
reducing the end-to-end latency? What exactly causes the
retransmissions? (Since the token protocol enforces mutual exclusion
between Spread daemons, there shouldn't be any collisions in the
Ethernet LAN). What prevents Spread from using more bandwidth to make
sure that messages are delivered faster? Is it the size of the UDP
packets exchanged by the daemons? Can this size be tuned?
Thank you in advance for your thoughts on this matter.
Tudor
--
______________________________________
Tudor A. Dumitras
ECE Department
Carnegie Mellon University
(412) 268-5005
http://www.ece.cmu.edu/~tdumitra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20050225/7cc54101/attachment.html
More information about the Spread-users
mailing list