[Spread-users] throughput degradation

Thu Dec 26 08:16:44 EST 2002

Hi Martin,

I am pretty sure I know some of what is going on, but since the spmonitor
reports you provided were only partial and not the complete report, I
cannot pin-point wether there are additional issues in your network.

The "r" and "s" tests are not that good. A 5 to 10% loss is pretty
high assuming your computers are similar and the receiver is not busy
with something else. Today's computers should be able to receive close
to 100Mbits/sec with s and r with almost no losses (e.g. 94Mbits/sec.)
Of course, I cannot tell why you have these losses. Generally, an
issue of the network, the operating system or the network interface
card on at least one of the machines. In your case, probably something
in the operating system. (I have seem past operating system kernels
behaving bad when there is more than once CPU, but I guess Solaris 8
should be extremely good handling these things).

The Spread part: Spread has its own very conservative flow control in
the vanilla version (the one you just download off the web with no
changes). Therefore, you should generally see almost no losses with Spread
running at full speed (of the vanilla version) if your computers are
homogeneous and clean. The fact that you had to increase your buffers
to get that effect seems suspicious to me. But I am not an expert of
Solaris 8. If the performance of your computers are similar or better
to cheap PCs you could buy 2 years ago running any of Linux/FreeBSD I
would not expect that to happen. But - the reason you see no losses in
Spread is because Spread has a built-in flow control.

Now - to the point of performance degradation: Again, I am not sure
this is the only reason (due to incomplete spmonitor reports) but any
reliable multicast protocol will have some overhead when the number of
receiver machines is increased. In the case of Spread, we optimized
for the possibility of many senders and the possibility of running
closer to the potential speed of the network (so we have to be more
accurate in flow control to achieve that). That is one of the goals of
Spread (research-wise it is easier to build a one-sender
many-receivers protocol and we did not focus on that). At least some
of the degradation you see is the consequence of that.

Just for perspective, your base numbers are 30Mbits/sec for 2 machines and you go
down to about half for 10 machines. Assuming a 100Mbits/sec LAN, I
recall measuring over 60Mbits/sec (some people in my lab claimed closer
to 80Mbits/sec). I don't remember the parameters exactly.
But 30Mbits/sec - 16Mbits/sec is probably less than half what
you should get.

Vanilla Spread (what you download on the spread.org) is tuned so that
very few people will contact us. Therefore, it is fairly conservative
in its flow control and membership parameters and rarely people complain.
You can tune it to achieve better performance by changing the two flow control
numbers using the spmonitor. After you find the numbers you like for your
system you could build it in the source. The main reason these are not
tunable numbers from the config file is that less than 0.5% of the
people that download Spread really would care and the potential for
errors in setting these numbers would increase the number of unhappy
people and the people having problems setting Spread.

I figure that the few people that have a performance problem
have also an interesting problem they are tackling :) and we
(either CNDS or Spread Concepts or both, depending on the problem and the need)
would love to talk with them about it.

      Happy new year!

      :) Yair.    http://www.cs.jhu.edu/~yairamir

Martin> Hello,

Martin> In our scenario, we use one sender and 11 receivers. We have a multicast
Martin> address defined in our switch and the spread daemon configured with a
Martin> single Spread_Segment.

Martin> 1) test with ./r and ./s
Martin> ========================

Martin> These tests showed, that we have packet misses between 5 and 10%. The
Martin> misses seem to be random - not host-dependet and not switch-module
Martin> dependent.

Martin> 2) test with spflooder
Martin> ======================

Martin> As mentioned earlier in this user-group, setting the kernel tuneables of
Martin> Solaris 8 to

Martin> ndd -set /dev/udp udp_recv_hiwat 32768
Martin> ndd -set /dev/udp udp_xmit_hiwat 32768

Martin> resulted in zero retrans, which is hard to believe.

Martin> flooder: completed multicast of 10000 messages, 1000 bytes each.

Martin> Status at host1 V 3.17. 0 (state 1, gstate 1) after 105 seconds :
Martin> sent pack:       6      recv pack :   10047     retrans    :       0
Martin> u retrans:       0      s retrans :       0     b retrans  :       0
Martin> ============================
Martin> Status at host2 V 3.17. 0 (state 1, gstate 1) after 102 seconds :
Martin> sent pack:       6      recv pack :   10047     retrans    :       0
Martin> u retrans:       0      s retrans :       0     b retrans  :       0

Martin> ....

Martin> However, what really troubles us is that if we increase the number of
Martin> receivers, the time needed to transmit data increases.

Martin> With one sender and one receiver, we can transmit 1 GB of data using
Martin> spflooder ( time ./bin/spflooder -ro -s 4804 -b 10240 -m 100000) in
Martin> about 260 seconds which equals 30.7 MBit/s.

Martin> If we increase the number of receivers to 10, the throughput drops to 16.9
Martin> MBit/s.

Martin> Receiver-Hosts          Übertragungsdauer               in MBit/s
Martin> 1                               260 sek                 30.7 MBit/s
Martin> 2                               276 sek                 29.0 MBit/s
Martin> 3                               295 sek                 27.1 MBit/s
Martin> 4                               315 sek                 25.4 MBit/s
Martin> 5                               326 sek                 24.5 MBit/s
Martin> 6                               351 sek                 22.8 MBit/s
Martin> 7                               364 sek                 22.0 MBit/s
Martin> 8                               449 sek                 17.8 MBit/s
Martin> 9                               380 sek                 21.1 MBit/s
Martin> 10                              474 sek                 16.9 MBit/s

Martin> The big question is, how the throughput develops if we will use 20 or 30
Martin> receiver hosts.

Martin> Can you explain this throughput degradation? Can you explain, why ./r und
Martin> ./s report misses, but ./spmonitor doesn't?

Martin> Best regards,

Martin> Martin

Martin> _______________________________________________
Martin> Spread-users mailing list
Martin> Spread-users at lists.spread.org
Martin> http://lists.spread.org/mailman/listinfo/spread-users