john, <br><br>when a daemon goes into a bad state, it doesn't just stutter, but stops working completely; however, other daemon in the cluster still functions correctly.<br><br>i did some further testing and the massive drops only occur btw these two servers. if i used spsend and sprecv against any one of this server and another machine not defined in the segment, the drop comes down to the amount you specified. so the drops is somehow confined to this segment. will need to do some more digging.<br><br>looking at sptmonitor output, the "retrans" to "sent pack" seems high. do you think there is a network issue or configuration problem?<br><br><br><br>============================<br>Status at as-phl-cbiis4 V 3.17. 3 (state 1, gstate 1) after 74354 seconds :<br>Membership : 4 procs in 2 segments, leader is as-ny-cbapps2<br>rounds : 42454 tok_hurry :
38124 memb change: 9<br>sent pack: 6 recv pack : 3040 retrans : 0<br>u retrans: 0 s retrans : 0 b retrans : 0<br>My_aru : 2745 Aru : 2745 Highest seq: 2745<br>Sessions : 0 Groups : 33 Window : 60<br>Deliver M:
1595 Deliver Pk: 3048 Pers Window: 15<br>Delta Mes: 1595 Delta Pack: 2745 Delta sec : 74354<br>==================================<br><br>Monitor><br>============================<br>Status at as-phl-cbiis1 V 3.17. 4 (state 1, gstate 1) after 73908 seconds :<br>Membership : 4 procs in 2 segments, leader is as-ny-cbapps2<br>rounds : 21796 tok_hurry : 37928 memb change: 4<br>sent pack: 1708 recv pack : 1137 retrans : 1291<br>u retrans: 37 s
retrans : 0 b retrans : 1254<br>My_aru : 2745 Aru : 2745 Highest seq: 2745<br>Sessions : 9 Groups : 56 Window : 60<br>Deliver M: 1595 Deliver Pk: 2944 Pers Window: 15<br>Delta Mes: 0 Delta Pack: 0 Delta sec :
-446<br>==================================<br><br>Monitor><br>============================<br>Status at as-ny-cbsql3 V 3.17. 3 (state 1, gstate 1) after 74106 seconds :<br>Membership : 4 procs in 2 segments, leader is as-ny-cbapps2<br>rounds : 42313 tok_hurry : 37997 memb change: 6<br>sent pack: 5 recv pack : 3072 retrans : 0<br>u retrans: 0 s retrans : 0 b retrans : 0<br>My_aru : 2745 Aru :
2745 Highest seq: 2745<br>Sessions : 0 Groups : 19 Window : 60<br>Deliver M: 1595 Deliver Pk: 3040 Pers Window: 15<br>Delta Mes: 0 Delta Pack: 0 Delta sec : 198<br>==================================<br><br>Monitor><br>============================<br>Status at as-ny-cbapps2 V 3.17. 4 (state 1, gstate 1) after 73982 seconds :<br>Membership : 4 procs in 2 segments, leader is as-ny-cbapps2<br>rounds :
21797 tok_hurry : 37970 memb change: 4<br>sent pack: 1145 recv pack : 1714 retrans : 739<br>u retrans: 109 s retrans : 37 b retrans : 593<br>My_aru : 2745 Aru : 2745 Highest seq: 2745<br>Sessions : 20 Groups : 56 Window : 60<br>Deliver M:
1595 Deliver Pk: 3028 Pers Window: 15<br>Delta Mes: 0 Delta Pack: 0 Delta sec : -124<br>==================================<br><br><br><b><i>John Schultz <jschultz@spreadconcepts.com></i></b> wrote:<blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;"> On Tue, 11 Mar 2008, chanh hua wrote:<br><br>> Bring me to my question, what network property is the misses data <br>> suppose to tell us?<br><br>The misses data tells you how many of the sent packets the receiver missed <br>(i.e. - didn't receive). From your report it looks like the sender sent <br>10000 packets but the receiver only heard 1999 (10000 - 8001) of them <br>before it got the last packet. If
correct, then that would be about an <br>80% loss rate for your configuration. A typical loss rate for LAN <br>broad/multicast is well below 1%.<br><br>> The explanation he gave for why we might have observed all<br>> these misses was b/c the broadcast address used contains all<br>> network machines(i.e. desktops, printers, etc...) and not<br>> just servers and most of those machines ignore broadcast.<br>> But since he doesn't know what these results mean, he can't<br>> say for sure.<br><br>He is correct that broadcast will bother (i.e. - potentially increase <br>load) all the machines on the associated subnet. If you instead use <br>multicast, then either your switch/router or you NICs should filter out <br>the packets before an interrupt is generated on non-participating <br>machines. Multicast is preferable, however, occasionally some switches <br>and routers don't implement multicast well or their multicast is <br>misconfigured. In such
situations, broadcast sometimes works better due <br>to its simplicity.<br><br>> If this is an issue, would using a multicast address be better? <br>> However, when i used a multicast address for the test, i still saw a lot <br>> of misses.<br><br>Typically, broadcast should not increase loss versus multicast unless your <br>switch/router is biased against broadcast somehow for some weird reason.<br><br>> I talked to the network admin, and he's not seeing any drops<br>> btw the servers on the segments. And he confirmed the<br>> broadcast address i used was correct.<br><br>Well, it definitely seems like something is wrong from your reports. Try <br>using spmonitor to view the status of the daemons as they are running. <br>Like I said, if you see their retrans counts going up by more than a <br>couple a second, then something is probably wrong in your network.<br><br>> would having a lot of drops lead cause daemon to be
unresponsive?<br><br>Theoretically, it could. If the daemons got stuck in a loop of trying to <br>establish a membership due to intermittent / flaky communications with <br>other daemons, then the system would appear to freeze as the daemons stop <br>processing client communications in this state. Usually, the "freeze"<br>wouldn't persist forever but rather you would see lots of daemon <br>membership changes and progress would stutter forward.<br><br>Cheers!<br><br>---<br>John Schultz<br>Spread Concepts<br>Phn: 443 838 2200<br></blockquote><br><p> 
<hr size=1>Never miss a thing. <a href="http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs"> Make Yahoo your homepage.</a>