[Spread-users] Bug in spread - message counter

John Lane Schultz jschultz at spreadconcepts.com
Mon Apr 7 11:05:28 EDT 2008


There does seem to be a problem with the round-around calculations when
the sequence numbers approach the signed limit.  We are working on it,
but I cannot forecast when a fix will be forthcoming.  Part of the reason
Spread is open source is so that the user community can help fix problems
and develop it further.

I can think of two short term work arounds that are less invasive
than changing the counters to 64b.  The first, and easiest, would be
to simply cause a network membership, administratively using spmonitor,
if/when the counters approach approximately 2^30, or if you like to
live dangerously, approximately 2^31.  Doing this should be far less
work than waiting for the lock up and then administratively having to
reboot your entire system.

The second would be to start the counters from -2^31 rather than zero
after every membership change.  This should roughly double the amount
of time before you enter the "danger zone."  I'm not sure if this would
cause any issues and I haven't tested it all.  It's just an idea with
which we all can experiment.

Nico is correct on enlarging the counters' sizes.  It would require
changes in the packet layouts and I'm not sure how much work that would
be.

Cheers!
John

---
John Lane Schultz
Spread Concepts LLC
Phn: 443 838 2200 
Fax: 301 560 8875

Monday, April 7, 2008, 5:10:18 AM, you wrote:

> Hello Witold,

> I think a few people would be interested in this fix, since this affects other
> people too. I first reported this problem 3 months ago, but I fell through
> the cracks. Last month I tried it again, and John told me they would look
> into it. There haven't been any news since, which is a little disturbing
> given how bad this problem is when it bites you.

> But just changing the counters might not work, because the sequence number is
> also transmitted to other deamons, so you would probably also need to change
> the message format also.

> Bye,
> Nico

> On Monday 07 April 2008 10:39:06 Witold Kręcicki wrote:
>> I have 4 spread daemons running high (~10000mps) traffic, and I have been
>> encountering 'full lockups' of the whole system (no new connections
>> accepted, no messages passed). After testing, it came up that this lockup
>> happens when 'highest seq' (and all other similiar counters) value in
>> spread monitor comes close to signed 32bit int limit (2**31). As a
>> temporary workaround, I'm planning to change these counters to 64bit - do I
>> have to change it in spread daemon only, or also in client library? Which
>> ones do I have to change?



> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list