[Spread-users] More on messge freezes

John Schultz jschultz at spreadconcepts.com
Tue Jun 19 12:21:21 EDT 2007


It seems to me that your configuration is losing token packets for some 
reason.

Spread uses a token ring to communicate its control traffic between the 
daemons.  The control traffic flows around the ring in one direction, 
defined by the top-down order of the daemon listing in your configuration 
file.  If a token packet is lost, then the system will appear to freeze 
until the hurry timeout regenerates the token.  A daemon sends its (user) 
data traffic before it forwards the token onto the next daemon in the 
ring.

It doesn't necessarily take high loss rates to trigger a token loss.  You 
can either get unlucky (freezes are relatively rare) in a busy network or 
there can be a systematic bias in your network towards losing token 
packets.  We have seen the latter scenario caused by routers and switches 
where a daemon's data traffic burst will cause the router/switch to drop 
the tail end of that burst.  This causes problems because the tail end of 
the burst contains the token!

My hunch is that you have a router or switch that is cutting short one of 
your daemon's traffic bursts, which drops the token.  I think this is the 
problem because you only see the problem when certain daemons send.  My 
guess is that when you see freezes the daemon that is sending has to go 
across a router/switch to get the token to the next daemon in the ring and 
that router/switch is occasionally cutting short its bursts.

One way to test this hypothesis is to gradually lower the size of your 
daemons' (data) burst windows and see if the freezes go away.  You can 
lower the daemons' burst windows interactively using spmonitor.  In 
spmonitor's shell, do this by executing command #5 (flow control), fill in 
the burst window sizes you want and then send the updated flow control 
parameters to the daemons by executing command #7.  By default, the global 
window size is 60 (packets) and the individual burst window size is 15 
(packets).

For starters, keep the global window at 60 and lower the daemons' burst 
window sizes down to 5.  This will cause higher overhead as the ratio of 
data to control traffic will decrease, but it may eliminate the freezes 
you are seeing.

If this is the problem, then an alternate solution is to reconfigure the 
troublesome switch/router to allow larger bursts from sending machines.

Cheers!

---
John Schultz
Spread Concepts
Phn: 443 838 2200

On Tue, 19 Jun 2007, Doug Palmer wrote:

> I've been experimenting more, with the help of some lower-level
> networking experts, trying to find the cause of the odd freezes that
> we've been seeing.
>
> If I have four spread daemons running, with only two systems actually
> producing spread traffic, then I will get smooth communication in one
> direction and freezes in the other direction. (All freezes are 2
> seconds; we've experimented with reducing the hurry timeout and the
> freezes reduce accordingly.)
>
> If I pull the network cable out of the non-traffic systems, then
> communication works fine both ways. If I then connect either system,
> then the freeze starts re-appearing.
>
> If I configure the systems so that my spread segment only contains the
> two machines, with the other systems plugs in, then everything works
> correctly in both directions.
>
> If I introduce a third spread daemon into the mix, then one
> sender1-receiver pair works correctly, but the other sender2->receiver
> pair freezes.
>
> If I try to split the two systems up into two different segments on
> different port numbers, then things work OK for a while, but I
> eventually see freezes start to appear and things become unstable.
>
> I don't think that it's possible to blame a network problem for this,
> particularly as everything else seems to work perfectly correctly. I've
> had some network experts look at the network packet flow and they seem
> satisfied with the results. To my untutored eye, it looks like
> introducing a third spread daemon into the mix, when it's not being
> directly used, causes it to hold onto things for a while.
>
> Doug
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>




More information about the Spread-users mailing list