[Spread-users] is sprecv generating an incorrect report?

John Schultz jschultz at spreadconcepts.com
Thu Mar 13 15:08:57 EDT 2008


I looked at r.c's code and I'm not quite sure I understand what it is 
trying to do.  Maybe Jonathan or Yair can comment on my analysis below.

> --> count is 1, i is 0, missed 1 total missed 1, corrupt 0
> --> count is 30, i is 10, missed 20 total missed 21, corrupt 0
> --> count is 52, i is 39, missed 13 total missed 34, corrupt 0
> --> count is 65, i is 61, missed 4 total missed 38, corrupt 0
> --> count is 78, i is 74, missed 4 total missed 42, corrupt 0
> --> count is 88, i is 87, missed 1 total missed 43, corrupt 0
> --> count is 4152, i is 4151, missed 1 total missed 44, corrupt 0

This part is pretty obvious and makes sense.  Basically, when the receiver 
gets a packet # that is AHEAD of where the receiver is, it counts the ones 
it didn't get as "missed."  It sums each "missed" into "total missed." 
These numbers and their relationship make sense together.

> -------
> Report: total packets at least 4152, total missed 44, total corrupted 0
> Initiating count from 4152 to 4151
> -------

This kind of report occurs when the receiver gets a packet # that is 
BEHIND where the receiver is.  The thing I don't understand, and maybe it 
is a (reporting) bug, is that in this case the code sets "total missed" 
equal to the packet #.

The code could be trying to handle the scenario where the last packet of a 
flood (which triggers a final report) is lost and a different flood is 
started.  The code tries to reset and process the new flood seperately 
while accounting for any messages missed in the new flood.

However, this heuristic doesn't work as intended if packets arrive 
out-of-order within in a single flood, which seems to be what is 
happenning in your runs.  If all of this reporting occurred within a 
single run, which I assume it did, then we need to simply sum the "missed" 
to get the correct "total missed" and ignore the reporting of "total 
missed."  This will double count some losses though as when the code 
resets it will double count some losses when it jumps ahead in the 
sequence again.

When I do that manually for your output I get about 140 losses out of 
10000 packets.  So, your losses aren't nearly as bad as spsend + sprecv 
are portraying it.  So long as a run doesn't have one of these reports:

> -------
> Report: total packets at least 4152, total missed 44, total corrupted 0
> Initiating count from 4152 to 4151
> -------

Then you can trust its reporting of "total missed."  If it does have such 
a report then you need to manually sum the missed to get an estimate of 
the losses.  These reports do indicate though that your network is 
reordering packets.  That isn't a "sin" but it is unusual behavior in a 
LAN.

Spread should certainly be able to handle such situations and should never 
become completely non-responsive, but your network is acting strangely for 
a LAN.  One question to answer would be why is your LAN network reordering 
and losing packets at ~1% loss rate.

Cheers!

---
John Schultz
Spread Concepts
Phn: 443 838 2200




More information about the Spread-users mailing list