[Spread-users] Spread hanging in logging application

Yair Amir yairamir at cnds.jhu.edu
Thu Apr 4 13:42:47 EST 2002

Hi Tom,

I actually think that with medium to high likelihood, I already figured the
hanging problem and why only very few settings experience it.
If I am correct, it will be very simple to fix it but it will break
backward compatibility. Therefore, it will only happen in the next major
version. After the release of 3.16.2 will be over and the situation is
stabilized I thought we could issue a test daemon for the interested people
to deploy and see if that was indeed the problem. That test daemon will be
able to report if it located such a problem and overcome it.

Here are some details about what I think happens:

All of the monitor reports we got were showing one of two symptoms:

1. all of the daemons had all of the messages but each of them thought
that at least one daemon misses a certain message, say with sequence x.
Once x became to far from the current sequence, no new messages were allowed
to be generated before everyone could proceed. The result is that the token
continues to rotate, no new messages are generated, all old messages were
delivered, and the system appear to be stuck, although it works full speed
doing nothing.

2. the second symptom was probably similar to the first symptom although it
happened during a state transfer that occurred immediately after a membership
change, synchronizing the daemon's groups data structure. That synchronization
process could not terminate, although all state messages were probably
delivered. As a result spread works full speed doing nothing.

In both cases the nature of the problem is that the daemons think that at least
one daemon lost some packet but no daemon actually lost it. It can theoretically
happen if an old unicast token (older than the last 8 cycles or so) somehow just
arrived, and the way Spread 3.x works in terms of negative and positive acks (using
ARU field instead of all positive acks).

The solution is to have on the token another field that will report the id of the
daemon that set the ARU and caused it to go down. This will allow that member to
higher it and solve the problem. Adding that field will slightly change the
protocol between the daemons and will have no effect on clients.

The reason this happens rarely only with just a few installation: I think
this is the case because there is some problematic network component, probably
a switch. A few years ago we saw a switch (I think it was Altheon but I am not
sure) that from time to time delayed a message by up to 500 milliseconds on a
local area configuration while having other messages proceed without delays.

Tom, would you be willing to test such an experimental daemon?


	:) Yair.

Tom Mornini wrote:

> Has there been any headway made on the Spread hanging bug that we 
> recently reported that seems to be related to long periods of membership 
> stability?
> I heard on this list that 3.16.2 is nearly ready. Does it fix the bug?
> Is there a bug tracker available online?
> -- 
> -- Tom Mornini
> -- InfoMania Printing and Prepress

More information about the Spread-users mailing list