[Spread-users] Three members, one group, different group IDs

John Lane Schultz jschultz at spreadconcepts.com
Thu Aug 24 11:17:15 EDT 2006


Alec H. Peterson wrote:
> Hi all,
> 
> I have an application that uses Spread AGREED messaging.  There are 
> three members of this test group.  When one of the members' spread 
> daemon's is killed and immediately restarted, the gid[1] fields differ 
> by 1, and do not ever appear to rationalize.  I found this thread (which 
> interestingly was started by Theo a year ago, with whom I work) and in 
> it Yair suggests a fix:
> 
> http://commedia.cnds.jhu.edu/pipermail/spread-users/2005-April/002478.html
> 
> However, I implemented that fix and it didn't change the behavior.  I am 
> running this in Spread 3.17.3 (I think Yair's patch was in 3.17.2 but it 
> seemed OK in 3.17.3).  It's worth noting that I don't think that 
> initialization is necessary as the new() call uses calloc() which 
> initializes its memory...
> 
> Any thoughts on how I can keep the GID fields synchronized?
> 
> Thanks,
> 
> Alec
> 

Alec,

That sure sounds like a bug in the way GIDs are calculated and/or reported to 
users.  What I find interesting though is that your events seem only related to 
a client failing and then restarting.  Usually, this should only affect the 
gid[2] field, which is a counter that reflects light weight (client) group 
changes.  The gid[1] field is the # of seconds since the epoch at the ring 
representative when the last heavy weight (daemon) membership was formed.  We 
did add a fix to sometimes artificially advance that counter by one when there 
were cascading, heavy weight membership attempts.

My hunch is that there is a bug in the code that once a heavy weight membership 
is actually installed it doesn't forget about all of the previous cascading 
heavy weight changes.  Then, somehow, the light weight membership triggers the 
above mechanism and one of the daemons unilaterally raises its membership ID's 
time field by one.

Does this always occur or is it an intermittent problem?  Does this occur if you 
do your scenario from "scratch" (restart all the daemons and try)?

The bug is surely in groups.c, most likely related to unilaterally raising the 
time field in relation to cascading heavy weight membership attempts, and we 
will have to track it down.

-- 
John Schultz
Spread Concepts LLC
Phn: 443 838 2200
Fax: 301 560 8875




More information about the Spread-users mailing list