[Spread-users] More on messge freezes

Tue Jun 19 09:03:12 EDT 2007

Doug Palmer wrote:
> I've been experimenting more, with the help of some lower-level
> networking experts, trying to find the cause of the odd freezes that
> we've been seeing.
> 
> If I have four spread daemons running, with only two systems actually
> producing spread traffic, then I will get smooth communication in one
> direction and freezes in the other direction. (All freezes are 2
> seconds; we've experimented with reducing the hurry timeout and the
> freezes reduce accordingly.)
> 
> If I pull the network cable out of the non-traffic systems, then
> communication works fine both ways. If I then connect either system,
> then the freeze starts re-appearing.
> 
> If I configure the systems so that my spread segment only contains the
> two machines, with the other systems plugs in, then everything works
> correctly in both directions.
> 
> If I introduce a third spread daemon into the mix, then one
> sender1-receiver pair works correctly, but the other sender2->receiver
> pair freezes.
> 
> If I try to split the two systems up into two different segments on
> different port numbers, then things work OK for a while, but I
> eventually see freezes start to appear and things become unstable.
> 
> I don't think that it's possible to blame a network problem for this,
> particularly as everything else seems to work perfectly correctly. I've
> had some network experts look at the network packet flow and they seem
> satisfied with the results. To my untutored eye, it looks like
> introducing a third spread daemon into the mix, when it's not being
> directly used, causes it to hold onto things for a while.

There could be a few things going on.  If one of the machines has a high load
(that also happens to have high priority) and the spread daemon doesn't get
any cpu time to execute, I imagine you might see behavior like you describe
(ie sending traffic one direction is very low latency, the other very high).

You might double check to see if the broadcast and/or multicast addresses are
set up right on each machine (depending on which you're using).

Finally, try pumping up the debug output level to see what's going on w.r.t.
membership messages (ie if you notice a certain node keeps getting kicked out,
then joining, and so on, you can be reasonably sure that node is the problem.
ie try "DebugFlags =  { PRINT EXIT STATUS FLOW_CONTROL MEMBERSHIP }" in your
spread.conf

HTH,
Matt