[Spread-users] high delay of messages

Wed Feb 1 03:34:55 EST 2006

On Tue, 2006-01-31 at 20:32, Cristina Nita-Rotaru wrote:
> We use Spread in a local area network for an
> application that is relatively bandwidth intensive
> in a many-to-many setting. However we noticed
> a huge delay in delivery of the packets (up
> to 94 seconds sometimes). We are sending about
> 20-30 Mbits/sec. All communication is FIFO. Platform
> is Windows, configuration is IP Multicast. In general
> the servers recover after a while and packets are
> delivered within acceptable delays, then again some
> big delay is noticeable.
> 
> Did anybody else experience this issue?
> We suspect that this may be related with the
> way Spread implements flow control.
> 
> We have several questions:
> 
> 1) What is the difference between the
> Window and Personal_Window variables
> used in the flow control protocol in
> flow_control.c.
> 
> 2) How were their values chosen (Window
> is 60 and Personal_Window is 15).
> 

If I recall, the window is how many messages may be sent per token
rotation and the personal window is how many messages may be sent by the
token holder at each posession of the token.  The global window rotates
around the ring, so I doubt you are having flow control problems, unless
it is related to the socket interface in windows.

Another likely cause of problem is overflow of the transmit buffers for
windows.  If these multicasts overflow the socket transmit buffer there
will be lost messages and the recovery algorithm will recover the
messages.  But if the socket message queues are constantly being
overfilled, the protocol will slowly make forward progress under heavy
load but long delays in delivery could be seen.  We kind of solved this
problem in totem in openais by "braking" the protocol when the window
between lost messages became large ie: there was a missing message 256
messages ago, we disallow sending new messages and only allow sending of
recovery multicasts.  I'm not sure if spread does this, but we found if
the sort queue used to store the messages to be retransmitted was small,
this queue could be overflowed in practical settings.  The braking
formally prevents this from happening.

Another likely cause is overflow of the receive buffers at each node,
resulting in lost messages and activation of the recovery of messages
via the retransmit list.

This is the main cause of lost messages in ring protocols, since
ethernet looses perhaps 1 out of 1 million udp packets when it is
properly limited to the bandwidth of the ethernet medium.

In linux, it is possible to set the transmit buffer to 256kb or larger
(so frame size of about 1400 * 15 can fit in the entire transmit
buffer).  Take a look at the totem protocol in openais to see how these
i/o buffer sizes are set.

Who knows about windows and its transmit and receive buffers.  

There are other flow control characteristics of spread that I am unaware
of relating to multiring capabilities - the spread maintainers may know
more.

I would check for recovered messages - this indicates the flow control
is not working for the target (windows) or your ethernet medium is
broken or overloaded independently of the flow control mechanisms in the
ring protocol.

In our implementation of totem in openais we choose the amount of
messages that may be transmitted by the token holder (personal window)
by dividing the i/o buffer queue size of the socket by the mtu of the
packets.  Spread as I recall supports only 1500 mtu. But totem in
openais supports jumbo frame sizes so the window must be calculated
dynamically.  Then we flush the inbound queues on receipt of the token. 
This results in ensuring the receive buffers at each node are never
overflowed (which is a main cause of message loss).  I'm not sure if
spread does this, but it is recommended if it doesn't as flow control
problems will emerge under load.

Another possible problem is scheduling starvation - ie the spread daemon
is running at lower priority then other processes and is scheduled only
rarely.  In an ideal world, the protocol would run at the highest round
robin priority level, above standard user processes.  I don't know if
windows has a way to specify this, but in linux we use the highest
priority level and schedule at round robin through the
sched_setscheduler syscall.  We found this to be a serious problem in
telecommunications applications which would create priority inversions
between the totem protocol in openais and telecommunications
applications which generally preset their scheduling priorities above
that of the standard linux scheduler.

One final problem - your switch may not properly duplicate multicast
messages across all ports and instead drop many messages.  This can be
tested by using a hub on a private network since hubs always send
multicast messages to all nodes without store/forwarding.

All things to look at.

Regards
-steve
> 
> thank you for your help,
> -- Cristina
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users