[Spread-users] Spread Reliability

Gordan Bobic gordan at bobich.net
Tue Mar 27 09:44:53 EDT 2007


I am seeing very odd things happening with Spread, at least when use 
through the Perl library (in addition to downright dangerous signal 
behaviour artefacts I mentioned in my previous email).

When running under a very heavy load, in a single machine / single daemon  
case (so the daemon should only be passing messages internally, and thus 
suffer no network losses between daemons), messages disappear. Literally. 
They get sent with type/service level that should be reliable (e.g. 
RELIABLE_MESS or SAFE_MESS), and yet out of 100,000 messages sent, when I 
start flooding, I can catch at most half, usually nearer a third.

This could be a problem in the spread library that the Perl library is 
linked against, rather than the daemon itself. What happens is that 
Spread::receive() will suddenly start returning an empty data set. Once 
this happens, stopping the sender/flooder will not have any effect - the 
Spread::receive() call will keep returning nothing immediately (rather 
than blocking for a message).

The only way to recover from the situation is to call Spread::disconnect 
and then re-connect. In a flooding situation (i.e. stress testing), this 
happens approximately 20-30 times in the 60 seconds or so that it takes to 
send 100,000 test messages. That means that the disconnect/reconnect cycle 
lasts about 1,000 messages, plus the 1,000 messages that are queued that 
get lost whene the listener disconnects. This, times 20-30 accounts for 
the 50,000-60,000 messages that get lost out of 100,000. Message size was 
arbitrarily picked to be about 10,000 bytes. With 1400 byte messages (i.e. 
such that they fit into a single UDP packet), the messages do seem to get 
reliably delivered without silent connection breakages.

Now, reliability of 30-50% for RELIABLE_MESS/SAFE_MESS 10KB messages seems 
pretty appaling, especially since it seems to happen completely silently. 
Nothing indicates that messages got lost. It seems even worse that the 
receiving connection silently and permanently breaks and ends up requiring 
a re-connect to the daemon to start working again.

Can anybody suggest a workaround, a fix or an alternative library to 
use? At the moment, Spread doesn't look like it is really usable with 
all the silent failures. :-(

Gordan





More information about the Spread-users mailing list