[Spread-users] strange 2 second hiccups

John Schultz jschultz at spreadconcepts.com
Sun Aug 20 23:31:53 EDT 2006


A two second timeout is used for restarting the token when it is lost 
between two daemons, rather than declaring a daemon membership, in certain 
configurations.  If the token packet is being lost, then this kind of 
occasional hang will occur.  2% is an awfully high drop rate though, 
unless you are on a nasty WAN.  If anything you probably have a flaky NIC 
in one or more of your machines and/or a flaky router or switch that is 
dropping the token.  Token packets are sent after all the data packets of 
a daemon and therefore if a router begins dropping packets due to "being 
full" they are highly likely to be dropped.

To begin diagnosing what is happening I would use spmonitor and see how 
many token_hurry events and retransmissions you are having.  Anything more 
than a couple of either usually indicates one of the problems I mentioned 
above and can be used to track down where the flakiness exists.

Cheers!

---
John Schultz
Spread Concepts
Phn: 443 838 2200

On Sun, 20 Aug 2006, Kim Barrett wrote:

> I've been developing an application using spread(4.0.0rc2) and have
> run into a very strange performance problem. I have a simple
> round-trip test between two machines, and I'm seeing occasional
> (perhaps 2%) 2 second round trips, where the vast majority are around
> 1-2 millisecond. Instrumenting the application led me to suspect that
> the 2 second hiccups were occurring within the spread daemon,
> although I haven't been able to reproduce the problem with a simple
> test case. I modified the application to use bare udp/multicast
> rather than spread, as an additional check, and no hiccups occur
> there either. Another oddity is that, for any given pair of machines,
> the delays always occur in only one direction. All messages are using
> UNRELIABLE_MESS. I've spent some time looking over the spread code,
> but haven't found anything that would account for this yet. (The bug
> report I sent a week or so ago regarding timeout setups was a result
> of that examination, but isn't the cause of this problem.) At this
> point, I'm pretty much stumped. Any suggestions for how to track this
> down would be appreciated.
>
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>




More information about the Spread-users mailing list