[Spread-users] Spread Reliability

Wed Mar 28 10:55:14 EDT 2007

On Wed, 28 Mar 2007, Jim Vickroy wrote:

> >>>> You describe reading
> >>>> succesfully up to a certain point and then always getting a nothing
> >>>> back.  This is quite a bit more specific than "loosing messages" in
> >>>> that it seems like you lost the stream.
> >>>>         
> >>> Indeed, but it still strikes me as a pretty poor design. Dropping a  
> >>> client
> >>> that can't keep up isn't exactly a "reliable" messaging feature. It is
> >>> particularly poor that it does this even with UNRELIABLE_MESS  
> >>> messages.
> >>> Surely, at least those should be dropped when the queue overflows,  
> >>> rather
> >>> than disconnecting the client?
> >>>       
> >> This is spread's design.  Having a slow reading client slow down all  
> >> operations to a crawl isn't usable either.  So, Spread places the  
> >> responsibility of flow control on the user.  It has nothing to do  
> >> with poor design.  Disconnecting slow clients so that other clients  
> >> can continue to operate reliably is intended behaviour, so the design  
> >> meets those intentions quite well..
> >>     
> >
> > Not really. In the case where there is only one reader, dropping that 
> > reader because the writers are overwhelming it isn't a side-effect I can 
> > imagine being useful or desirable.
> >   
> What is to prevent other subscribers (i.e., "readers") from joining the 
> group at any time?  Are you asking for a Spread configuration option to 
> essentially provide one-way communication to a single subscriber?

Nothing at all, but the problem remains - just one very fast machine 
sending will equally overwhelm all the receivers that are trying to keep 
up (which will likely be all of them if all the machines have the same 
spec, which is reasonably likely in a real environment).

This means that there would need to be a sideband/auxiliary method of 
establishing performance of receivers, and then apply rate limiting on the 
sender(s). This is something that one might expect the message queue (and 
Spread does try to carry out the job of a publish/subscribe message queue) 
to handle a little more gracefully.

The existing method seems to offer little advantage in terms of 
reliability under high load compared to simply multicast UDP flooding. At 
least in that case you'd only lose the messages the listener is too slow 
to catch, as opposed to having to waste CPU time (which is already likely 
to be in short supply if the server is so overwhelmed that it's starting 
to drop packets) on re-connecting, and throwing away all the messages in 
the meantime. Not to mention that spread daemon isn't all that cheap in 
terms off CPU time, which also won't help the listener keep up when it is 
already falling behind.

> > It also means that if there is a single flooder flooding the network 
> > faster than any of the readers can read, e.g. due to a bug in the sending 
> > code somewhere, or even just having a sending machine that is so much 
> > faster than the others that it is generating messages faster than the 
> > listeners can process them, all the listeners get dropped. By the basic 
> > rule of concatenation being easier/faster than parsing/splitting, for most 
> > processes, readers are likely to be at a disadvantage.
> >
> >   
> That is why subscribers should only receive/queue messages for 
> processing elsewhere (perhaps in another thread).  My view is that it is 
> generally less robust to have subscribers performing message receiving 
> AND message processing.

So, you are proposing a solution where there is yet another queueing layer 
on top of the queueing that Spread does? That will add more overhead for 
one, which you might still be winning with by the time you have 4 CPUs in 
the server, but that isn't exactly standard hardware. Of course, a 
threadsafe library would help with that (it would remove the need for a 
separate secondary queue), but sadly, the one available for Perl crashes 
pretty solidly when you try to use it after connecting and forking. Either 
way, I cannot see how that would help things keep up on a single CPU 
server that is CPU bound when processing the messages - it would just add 
more process switching overhead. :-(

Gordan