[Spread-users] Spread Reliability
gordan at bobich.net
Tue Mar 27 09:44:53 EDT 2007
I am seeing very odd things happening with Spread, at least when use
through the Perl library (in addition to downright dangerous signal
behaviour artefacts I mentioned in my previous email).
When running under a very heavy load, in a single machine / single daemon
case (so the daemon should only be passing messages internally, and thus
suffer no network losses between daemons), messages disappear. Literally.
They get sent with type/service level that should be reliable (e.g.
RELIABLE_MESS or SAFE_MESS), and yet out of 100,000 messages sent, when I
start flooding, I can catch at most half, usually nearer a third.
This could be a problem in the spread library that the Perl library is
linked against, rather than the daemon itself. What happens is that
Spread::receive() will suddenly start returning an empty data set. Once
this happens, stopping the sender/flooder will not have any effect - the
Spread::receive() call will keep returning nothing immediately (rather
than blocking for a message).
The only way to recover from the situation is to call Spread::disconnect
and then re-connect. In a flooding situation (i.e. stress testing), this
happens approximately 20-30 times in the 60 seconds or so that it takes to
send 100,000 test messages. That means that the disconnect/reconnect cycle
lasts about 1,000 messages, plus the 1,000 messages that are queued that
get lost whene the listener disconnects. This, times 20-30 accounts for
the 50,000-60,000 messages that get lost out of 100,000. Message size was
arbitrarily picked to be about 10,000 bytes. With 1400 byte messages (i.e.
such that they fit into a single UDP packet), the messages do seem to get
reliably delivered without silent connection breakages.
Now, reliability of 30-50% for RELIABLE_MESS/SAFE_MESS 10KB messages seems
pretty appaling, especially since it seems to happen completely silently.
Nothing indicates that messages got lost. It seems even worse that the
receiving connection silently and permanently breaks and ends up requiring
a re-connect to the daemon to start working again.
Can anybody suggest a workaround, a fix or an alternative library to
use? At the moment, Spread doesn't look like it is really usable with
all the silent failures. :-(
More information about the Spread-users