[Spread-users] Delays in receiving messages

Tue Jan 25 17:11:37 EST 2011

Hi Yair,

Thanks for the quick response.

> To me it seems you have some loss on your network and possibly a control
> message (either a token or a hurry request) is lost for some reason.

How are the hurry requests sent (is it udp on either port 5333 in this
case, or 5333+1 = 5334)?

>From my understanding, if the token is lost the spread daemon on
host_b will, after the Hurry_timeout, submit a hurry request to the
leader (host_a's spread daemon) and continue to buffer the messages
until the token comes around again.  Is this correct?

If so, if it was the Hurry Request that was getting lost would it not
then be the larger Token timeout (or other timeout) that we would be
seeing as the delay time?  Is there another loss scenario that would
coincide with delays of up to Hurry_timeout?

This is with spread 4.1.0 on Solaris hosts, though I'm not sure if
that makes any difference (though there was earlier discussion of what
sounded like a similar problem on FreeBSD 8 that wasn't experienced in
Linux).

> Now, 10 messages per second for Spread is equivalent to an inactive network
> for most of the time (it would take a millisecond or so to propagate the
> one message and then there are 100ms with quiet.

Tried upping this to 500 per second, and the delays are still seen...
what would be considered an "active" network in this sense (that is,
what controls when the slow-down feature takes effect)?

> The question is why the control message is lost occasionally.

It's a good question.  These are all hosts on the same local network
connected (either 100 Mbps, or GigE).

> Changing the hurry timer is actually a fine solution if it solves your
> problem. You can actually eliminate the problem all together if you
> void the slow-down feature, but that will have a price of the token
> rotating even without new messages.

That's an interesting option... how do I void the slow-down feature?
Do I just set the Hurry_timeout to 0 secs, 0 usecs, or is there
another method?

It might be reasonable in this case to just always have the token go
around, as the servers are all local and if they are busy, then the
spread network will also be similarly busy.  If non-hurried token
passing is just burning idle IO or cpu, then that's pretty cheap
insurance to make sure messages don't get delayed.

> If you tell more about what your goal is, perhaps better comments can
> be made.

The goal is to get the random (large) latency down (or ideally
eliminated), while keeping CPU and IO usage to a reasonable level even
in very busy times.  Turning off the slow-down feature seems like a
good way of achieving that (once I know how).

I was wondering if reconfiguring the network might be more beneficial
as well.  Eventually the plan is to have 3 hosts on the outside of a
firewall (but on the same network segment), and two hosts inside the
firewall (again, sharing a network segment).  For now, each host is
its own Spread segment.  Would it be better to have a single spread
daemon running on each network segment, or perhaps combine the various
hosts on a network segment into a spread_segment?

How is the leader elected in that case (is it again, the first listed
in the Spread segment)?

Thanks again for your help,

K.