[Spread-users] More on messge freezes
jschultz at spreadconcepts.com
Tue Jun 19 12:21:21 EDT 2007
It seems to me that your configuration is losing token packets for some
Spread uses a token ring to communicate its control traffic between the
daemons. The control traffic flows around the ring in one direction,
defined by the top-down order of the daemon listing in your configuration
file. If a token packet is lost, then the system will appear to freeze
until the hurry timeout regenerates the token. A daemon sends its (user)
data traffic before it forwards the token onto the next daemon in the
It doesn't necessarily take high loss rates to trigger a token loss. You
can either get unlucky (freezes are relatively rare) in a busy network or
there can be a systematic bias in your network towards losing token
packets. We have seen the latter scenario caused by routers and switches
where a daemon's data traffic burst will cause the router/switch to drop
the tail end of that burst. This causes problems because the tail end of
the burst contains the token!
My hunch is that you have a router or switch that is cutting short one of
your daemon's traffic bursts, which drops the token. I think this is the
problem because you only see the problem when certain daemons send. My
guess is that when you see freezes the daemon that is sending has to go
across a router/switch to get the token to the next daemon in the ring and
that router/switch is occasionally cutting short its bursts.
One way to test this hypothesis is to gradually lower the size of your
daemons' (data) burst windows and see if the freezes go away. You can
lower the daemons' burst windows interactively using spmonitor. In
spmonitor's shell, do this by executing command #5 (flow control), fill in
the burst window sizes you want and then send the updated flow control
parameters to the daemons by executing command #7. By default, the global
window size is 60 (packets) and the individual burst window size is 15
For starters, keep the global window at 60 and lower the daemons' burst
window sizes down to 5. This will cause higher overhead as the ratio of
data to control traffic will decrease, but it may eliminate the freezes
you are seeing.
If this is the problem, then an alternate solution is to reconfigure the
troublesome switch/router to allow larger bursts from sending machines.
Phn: 443 838 2200
On Tue, 19 Jun 2007, Doug Palmer wrote:
> I've been experimenting more, with the help of some lower-level
> networking experts, trying to find the cause of the odd freezes that
> we've been seeing.
> If I have four spread daemons running, with only two systems actually
> producing spread traffic, then I will get smooth communication in one
> direction and freezes in the other direction. (All freezes are 2
> seconds; we've experimented with reducing the hurry timeout and the
> freezes reduce accordingly.)
> If I pull the network cable out of the non-traffic systems, then
> communication works fine both ways. If I then connect either system,
> then the freeze starts re-appearing.
> If I configure the systems so that my spread segment only contains the
> two machines, with the other systems plugs in, then everything works
> correctly in both directions.
> If I introduce a third spread daemon into the mix, then one
> sender1-receiver pair works correctly, but the other sender2->receiver
> pair freezes.
> If I try to split the two systems up into two different segments on
> different port numbers, then things work OK for a while, but I
> eventually see freezes start to appear and things become unstable.
> I don't think that it's possible to blame a network problem for this,
> particularly as everything else seems to work perfectly correctly. I've
> had some network experts look at the network packet flow and they seem
> satisfied with the results. To my untutored eye, it looks like
> introducing a third spread daemon into the mix, when it's not being
> directly used, causes it to hold onto things for a while.
> Spread-users mailing list
> Spread-users at lists.spread.org
More information about the Spread-users