[Spread-users] sporadic latencies with SP_receive

Mon Jan 9 16:04:49 EST 2012

Johannes + everybody,

I've looked into this report (thanks for the demonstration app!) and have figured out both complaints.

The first complaint is that a receiver can sometimes see latencies of up to 100ms from SP_receive even when lots of traffic has been sent to them.  

This happens because of the way Spread currently deals with writes to client sockets that block.  When a write to a client socket blocks, Spread schedules a Badger_timeout (by default 100ms) callback to try writing to the socket again in the future.  If, in the meantime, another message is queued on the connection, then it proactively tries to send immediately.  What this means is that if the daemon has a bit of a back log of messages to send to a receiver and its TCP socket blocks and no more traffic is subsequently queued to the connection, then you will see up to a 100ms pause until the daemon tries to send to it again.  It is possible that it will fill up the TCP buffers and block again (quickly), especially with very large messages like in the example, and schedule another timeout for another 100 ms.  So, you can trigger the very stutter-stop kind of performance that you reported.

One simple way to mitigate this issue is to reduce Badger_timeout (top of session.c) significantly (i.e. - by a factor of 10x - 100x).  The drawback to this is the overhead of "polling" truly blocked sockets every Badger_timeout.

The "better way" to mitigate this issue is to instead have Spread monitor write-ability of client sockets and push out traffic as the sockets are able.  We need to think if doing this would have any adverse effects versus the way Spread currently handles client I/O.

The second complaint was that you got a stack corruption bug when you passed a privateGroup array of only MAX_PRIVATE_NAME characters.  This is expected as the privateGroup array is expected to be MAX_GROUP_NAME characters long.

Sorry it took so long to get around to diagnosing these issues.

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Phn: 301 830 8100
Cell: 443 838 2200

On Jul 25, 2011, at 7:47 AM, Johannes Wienke wrote:

Dear all,

we encountered some latency issues in our applications using spread.
Today we tried to isolate the problem and came up with a test program
that demonstrates the behavior.

Generally, the observation is that in a threaded setup, using local
communication, and a small sleep between calls to SP_receive, these
receive calls sometimes take up to 100 ms, e.g. generating this log:

receive took 55 us
receive took 68 us
receive took 97071 us
receive took 54 us
receive took 67 us
receive took 97060 us
receive took 54 us
receive took 68 us
receive took 97086 us
receive took 56 us
receive took 69 us
receive took 97091 us
receive took 56 us
receive took 67 us
receive took 97071 us

The attached program exactly produces this output. Please note that this
only happens if the sleep call is present in line 108.

We have also verified that this is not related to the architecture we
are running on (Linux 32 and 64 bit), nevertheless we got a stack
corruption on 32 bit in the sender thread with a privateGroup array only
MAX_PRIVATE_NAME characters long. Thus the increased size. Is this also
a known problem?

We would be happy to get some insights or fixes in how to prevent this
issue. In a real application the sleep is usually not required to
trigger the problem as the receiving thread is still doing other things
in its loop.

Regards,
Johannes
<spreadtest.cpp>_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3805 bytes
Desc: not available
Url : http://lists.spread.org/pipermail/spread-users/attachments/20120109/3d75444f/attachment.bin