[Spread-users] Java SpreadConnection SoTimeout bug
sotola at gmail.com
Wed Feb 9 16:50:01 EST 2011
The 100 ms SpreadConnection socket timeout (in the Java interface) causes problems when latency is introduced into a network and/or messages are very large.
The error occurs when the delay between chunks of a fragmented Spread message is greater than 100 ms. When this happens, one of the four socket reads in the SpreadConnection's internal_receive() function throws an InterruptedIOException. The next time internal_receive is called, it incorrectly assumes that the next few bytes are the start of a new header, not the remainder of the previous message.
Parsing this garbage header results in NegativeArraySize exceptions (if the multiplication overflows), OutOfMemory errors (if numGroups is sufficiently large) at line 1197 in SpreadConnection.java (shown below), and/or "Illegal Message: Message Dropped" Spread exceptions (if numGroups or datalen is < 0, line 1119).
byte buffer = new byte[numGroups * MAX_GROUP_NAME];
This behavior is consistent with quite a few other threads, but I *think* they all came to the wrong conclusion/solution:
I was able to trigger the bug by introducing lag with tc/netem on the spread daemon's host machine (I'm running a java client that implements the BasicMessageListener interface on a different machine):
tc qdisc add dev eth0 root netem delay 200ms 10ms distribution normal
(To remove the lag: tc qdisc del dev eth0 root)
My temporary solution is to increase the timeout to 5 seconds (SpreadConnection.java, line 1637):
I'm no network Guru, but my quick fix doesn't smell right (I think we'll just run into the problem again on a really laggy network). Is the timeout even necessary? Is there any reason why the timeout was set to 100 ms? Currently, I'm leaning towards a configurable timeout option.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Spread-users