[Spread-users] Total freeze of all daemons after long running

Fri Mar 14 17:50:47 EDT 2008

Just a quick note -- spmonitor 'may' work even if client connections 
were closed as it does not 'connect' like a client and has lower level 
access to the daemon. 

Johns suggestion about using spmonitor (if the daemons accept the 
commands) should work and your clients should stay connected during the 
membership changes (as a partition and remerge should not close client 
connectsion although the client apps will receive membership messages 
notifying them about 2 memberships (one where the daemon goes to a 
singleton and a second where the daemons remerge)

Cheers,

Jonathan

On Thu, Mar 13, 2008 at 09:13:57PM +0100, Nico Meyer wrote:
> Hi John,
> 
> thanks for the quick answer. Unfortunately you suggestion will not work, since 
> as I wrote in my first post client connections are also immediately closed 
> when this happens, so spmonitor will not work either.
> 
> I hope you can come up with a solution soon.
> 
> Nico
> 
> > same issue as Chanh Hua's or not.
> >
> > Looking at your report, I would say the Aru and various sequence numbers
> > look extremely fishy.  They are all right around +/- 2^31, which could
> > indicate some kind of bug in handling rollover of those counters.
> > Furthermore, as I understand it, Aru should never be ahead of Highest_seq
> > as it seems to be here.  It looks like Aru rolled over before Highest_seq
> > and the other sequence numbers, which I believe shouldn't happen.
> >
> > The print out was an attempt to work around this freezing bug by
> > swallowing the token and forcing a restart of the synchronization between
> > the daemons.  Apparently, that isn't enough to handle your situation.
> >
> > My guess is that there is probably a bug with the synchronization protocol
> > when the counters are right around the rollover point that daemons simply
> > can't get past.
> >
> > Another potential work around would be to segment each of the daemons into
> > its own partition using spmonitor when this happens.  They should be able
> > to finish their protocol and install a membership by themselves.  Then
> > unpartition them.  Your way is probably more automated and less latency
> > though.
> >
> > We'll look into this.
> >
> > Cheers!
> >
> > ---
> > John Schultz
> > Spread Concepts
> > Phn: 443 838 2200
> >
> > On Thu, 13 Mar 2008, Nico Meyer wrote:
> > > Hi,
> > >
> > > I reported the exact same problem two months ago, but never got any
> > > answer. Please see
> > > http://commedia.cnds.jhu.edu/pipermail/spread-users/2008-January/003653.h
> > >tml for the original post.
> > >
> > > Today it happend again (excactly 2 months later, but this is most likely
> > > a coincidence). The logs show also the same numbers (compare with my
> > > original post):
> > > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > > rounds in EVS state; swallowing token; state:
> > > [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 21344
> > > [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> > > [Thu 13 Mar 2008 15:04:28] Last Token:
> > > [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> > > [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731508
> > > [Thu 13 Mar 2008 15:04:28]      seq:              0
> > > [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731508
> > > [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> > > [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> > > [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> > > [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
> > >
> > > repeated every few seconds.
> > > and a little later:
> > >
> > > [Thu 13 Mar 2008 15:06:19] Prot_handle_token: BUG WORKAROUND: Too many
> > > rounds in EVS state; swallowing token; state:
> > > [Thu 13 Mar 2008 15:06:19]      Aru:              3333
> > > [Thu 13 Mar 2008 15:06:19]      My_aru:           3333
> > > [Thu 13 Mar 2008 15:06:19]      Highest_seq:      2147482054
> > > [Thu 13 Mar 2008 15:06:19]      Highest_fifo_seq: 21344
> > > [Thu 13 Mar 2008 15:06:19]      Last_discarded:   2147482054
> > > [Thu 13 Mar 2008 15:06:19]      Last_delivered:   2147482054
> > > [Thu 13 Mar 2008 15:06:19]      Last_seq:         3333
> > > [Thu 13 Mar 2008 15:06:19]      Token_rounds:     501
> > > [Thu 13 Mar 2008 15:06:19] Last Token:
> > > [Thu 13 Mar 2008 15:06:19]      type:             0x80050080
> > > [Thu 13 Mar 2008 15:06:19]      transmiter_id:    -1062731508
> > > [Thu 13 Mar 2008 15:06:19]      seq:              1
> > > [Thu 13 Mar 2008 15:06:19]      proc_id:          -1062731508
> > > [Thu 13 Mar 2008 15:06:19]      aru:              3333
> > > [Thu 13 Mar 2008 15:06:19]      aru_last_id:      -1062731505
> > > [Thu 13 Mar 2008 15:06:19]      flow_control:     0
> > > [Thu 13 Mar 2008 15:06:19]      rtr_len:          1440
> > > [Thu 13 Mar 2008 15:06:19]      conf_hash:        -2002019299
> > >
> > > also repeatedly.
> > >
> > > and on another server in the same spread segment:
> > >
> > > [Thu 13 Mar 2008 15:04:28] Prot_handle_token: BUG WORKAROUND: Too many
> > > rounds in EVS state; swallowing token; state:
> > > [Thu 13 Mar 2008 15:04:28]      Aru:              -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      My_aru:           -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      Highest_seq:      2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Highest_fifo_seq: 746858665
> > > [Thu 13 Mar 2008 15:04:28]      Last_discarded:   2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Last_delivered:   2147482054
> > > [Thu 13 Mar 2008 15:04:28]      Last_seq:         -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      Token_rounds:     501
> > > [Thu 13 Mar 2008 15:04:28] Last Token:
> > > [Thu 13 Mar 2008 15:04:28]      type:             0x80050080
> > > [Thu 13 Mar 2008 15:04:28]      transmiter_id:    -1062731497
> > > [Thu 13 Mar 2008 15:04:28]      seq:              0
> > > [Thu 13 Mar 2008 15:04:28]      proc_id:          -1062731497
> > > [Thu 13 Mar 2008 15:04:28]      aru:              -2147481909
> > > [Thu 13 Mar 2008 15:04:28]      aru_last_id:      0
> > > [Thu 13 Mar 2008 15:04:28]      flow_control:     0
> > > [Thu 13 Mar 2008 15:04:28]      rtr_len:          1440
> > > [Thu 13 Mar 2008 15:04:28]      conf_hash:        -2002019299
> > >
> > >
> > > after the last incident I raised the timeout in membership.c a little, so
> > > my assumption that the low timeouts are the culprit seems to be wrong.
> > >
> > > Please, have a look this time, as this is a really serious problem.
> > > My workaround for now will be putting an exit(0) in the code where the
> > > message is generated, and restart the daemons after 5s. Hopefully this
> > > will at least recover the spread ring.
> > >
> > > Let me know, if you need any aditional info.
> > >
> > > Bye,
> > >
> > > Nico
> > >
> > >
> > >
> > > _______________________________________________
> > > Spread-users mailing list
> > > Spread-users at lists.spread.org
> > > http://lists.spread.org/mailman/listinfo/spread-users
> 
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------