[Spread-users] Message sequence counter wrap bug fixed
martin4321234 at googlemail.com
Mon Feb 2 13:19:24 EST 2009
in the meantime I did some stress testing concerning that patch and I
am no longer sure about it.
Instead of a 2^31 limit I tried a 4096 limit. This leads to a faster
counter re-initialisation by dropping the token every 4 hours in case
of our application. This is working fine 3 or 5 times but afterwards
one of the spread daemons will exit with a message like this:
2009-01-19 21:28:08 GMT Send_new_packets: created packet 3589 already exist 3
Exit caused by Alarm(EXIT)
I'm using 3.17.4 with only the small patch below (taken from your SVN
What is the meaning of the log message above?
Any hint would be helpful,
*** protocol.c.org Mon Nov 20 21:17:29 2006
--- protocol.c Tue Nov 4 15:01:12 2008
*** 458,463 ****
--- 458,471 ----
if( !Same_endian( Token->type ) )
Flip_token_body( New_token.elements.buf, Token );
+ /* Deal with wrapping seq values (2^31) by triggering a
membership by dropping the token */
+ if( (Memb_state() != EVS ) && (Token->seq > (1<<30) ) )
+ Alarm( PROTOCOL, "Prot_handle_token: Token Sequence
number (%ld) approaching 2^31 so trigger membership to reset it.\n",
if( Conf_leader( Memb_active_ptr() ) == My.id )
if( Get_arq(Token->type) != Get_arq(Last_token->type) )
On Tue, Nov 11, 2008 at 11:44 AM, M S <martin4321234 at googlemail.com> wrote:
> Hi Jonathan,
> thank you for your fix for the message sequence counter wrap bug.
> I've tested this patch for protocol.c with our spread-3.17.4
> installation and found it useful. The counter reset works as expected
> before the total malicious blocking limit is reached. There is no
> message loss.
> The only thing which I find a little bit annoying is the very long
> pause in daemon throughput during the reset process. I always observed
> a 12 (twelve!) seconds pause in throughput. I assume this is twice the
> Token_timeout value of 5 seconds plus something else (our spread
> configuration is not wide_network). For wide_networks it would be
> Of course it would be better, if this temporary blocking could be
> avoided, but for me that solution is currently sufficient.
> A minor recommendation: The logging classification of that new reset
> event should be Alarm(PRINT,"...Token seq number approaching...reset
> it") instead of Alarm(PROTOCOL,...) because I want to see that event
> in the spread.log without turning on other voluminous logging.
> On Sat, Oct 11, 2008 at 4:13 PM, Jonathan Stanton <jonathan at spread.org> wrote:
>> I've committed a fix to svn trunk for the problem where sequence numbers
>> used by the daemons cause a hung daemon when they reach 2^32 (the max
>> value of the counter). This fix works in my tests, but I would be very
>> interested in anyone who has had this problem verifing that it also
>> solves the problem for them. If you have not used the svn trunk before,
>> you can find instructions at www.spread.org/devel.html
>> This fix does not change the packet formats of Spread and so is fully
>> compatible with Spread 4.0 systems. However because it does not increase
>> the counter size, what it does do is trigger a spurrious membership
>> change amoung the daemons when the counter gets close to wrapping which
>> resets it back to 0. This membership will NOT be seen by any of your
>> client applications, but will cause a short (few second) pause in the
>> daemon throughput of messages.
>> Let me know what you think of this.
>> Jonathan Stanton jonathan at spread.org
>> Spread Group Messaging www.spread.org
>> Spread Concepts LLC www.spreadconcepts.com
>> Spread-users mailing list
>> Spread-users at lists.spread.org
More information about the Spread-users