[Spread-users] is sprecv generating an incorrect report?
John Schultz
jschultz at spreadconcepts.com
Thu Mar 13 15:08:57 EDT 2008
I looked at r.c's code and I'm not quite sure I understand what it is
trying to do. Maybe Jonathan or Yair can comment on my analysis below.
> --> count is 1, i is 0, missed 1 total missed 1, corrupt 0
> --> count is 30, i is 10, missed 20 total missed 21, corrupt 0
> --> count is 52, i is 39, missed 13 total missed 34, corrupt 0
> --> count is 65, i is 61, missed 4 total missed 38, corrupt 0
> --> count is 78, i is 74, missed 4 total missed 42, corrupt 0
> --> count is 88, i is 87, missed 1 total missed 43, corrupt 0
> --> count is 4152, i is 4151, missed 1 total missed 44, corrupt 0
This part is pretty obvious and makes sense. Basically, when the receiver
gets a packet # that is AHEAD of where the receiver is, it counts the ones
it didn't get as "missed." It sums each "missed" into "total missed."
These numbers and their relationship make sense together.
> -------
> Report: total packets at least 4152, total missed 44, total corrupted 0
> Initiating count from 4152 to 4151
> -------
This kind of report occurs when the receiver gets a packet # that is
BEHIND where the receiver is. The thing I don't understand, and maybe it
is a (reporting) bug, is that in this case the code sets "total missed"
equal to the packet #.
The code could be trying to handle the scenario where the last packet of a
flood (which triggers a final report) is lost and a different flood is
started. The code tries to reset and process the new flood seperately
while accounting for any messages missed in the new flood.
However, this heuristic doesn't work as intended if packets arrive
out-of-order within in a single flood, which seems to be what is
happenning in your runs. If all of this reporting occurred within a
single run, which I assume it did, then we need to simply sum the "missed"
to get the correct "total missed" and ignore the reporting of "total
missed." This will double count some losses though as when the code
resets it will double count some losses when it jumps ahead in the
sequence again.
When I do that manually for your output I get about 140 losses out of
10000 packets. So, your losses aren't nearly as bad as spsend + sprecv
are portraying it. So long as a run doesn't have one of these reports:
> -------
> Report: total packets at least 4152, total missed 44, total corrupted 0
> Initiating count from 4152 to 4151
> -------
Then you can trust its reporting of "total missed." If it does have such
a report then you need to manually sum the missed to get an estimate of
the losses. These reports do indicate though that your network is
reordering packets. That isn't a "sin" but it is unusual behavior in a
LAN.
Spread should certainly be able to handle such situations and should never
become completely non-responsive, but your network is acting strangely for
a LAN. One question to answer would be why is your LAN network reordering
and losing packets at ~1% loss rate.
Cheers!
---
John Schultz
Spread Concepts
Phn: 443 838 2200
More information about the Spread-users
mailing list