[Spread-users] spread daemon hangs after running for a few days

John Schultz jschultz at spreadconcepts.com
Tue Mar 11 16:35:13 EDT 2008


On Tue, 11 Mar 2008, chanh hua wrote:

> Bring me to my question, what network property is the misses data 
> suppose to tell us?

The misses data tells you how many of the sent packets the receiver missed 
(i.e. - didn't receive).  From your report it looks like the sender sent 
10000 packets but the receiver only heard 1999 (10000 - 8001) of them 
before it got the last packet.  If correct, then that would be about an 
80% loss rate for your configuration.  A typical loss rate for LAN 
broad/multicast is well below 1%.

> The explanation he gave for why we might have observed all
> these misses was b/c the broadcast address used contains all
> network machines(i.e. desktops, printers, etc...) and not
> just servers and most of those machines ignore broadcast.
> But since he doesn't know what these results mean, he can't
> say for sure.

He is correct that broadcast will bother (i.e. - potentially increase 
load) all the machines on the associated subnet.  If you instead use 
multicast, then either your switch/router or you NICs should filter out 
the packets before an interrupt is generated on non-participating 
machines.  Multicast is preferable, however, occasionally some switches 
and routers don't implement multicast well or their multicast is 
misconfigured.  In such situations, broadcast sometimes works better due 
to its simplicity.

> If this is an issue, would using a multicast address be better? 
> However, when i used a multicast address for the test, i still saw a lot 
> of misses.

Typically, broadcast should not increase loss versus multicast unless your 
switch/router is biased against broadcast somehow for some weird reason.

> I talked to the network admin, and he's not seeing any drops
> btw the servers on the segments.  And he confirmed the
> broadcast address i used was correct.

Well, it definitely seems like something is wrong from your reports.  Try 
using spmonitor to view the status of the daemons as they are running. 
Like I said, if you see their retrans counts going up by more than a 
couple a second, then something is probably wrong in your network.

> would having a lot of drops lead cause daemon to be unresponsive?

Theoretically, it could.  If the daemons got stuck in a loop of trying to 
establish a membership due to intermittent / flaky communications with 
other daemons, then the system would appear to freeze as the daemons stop 
processing client communications in this state.  Usually, the "freeze"
wouldn't persist forever but rather you would see lots of daemon 
membership changes and progress would stutter forward.

Cheers!

---
John Schultz
Spread Concepts
Phn: 443 838 2200




More information about the Spread-users mailing list