[Spread-users] spread clients don't receive messages

Shlomi Yaakobovich Shlomi at exanet.com
Wed Sep 1 13:26:28 EDT 2004


Hi all,

We have been testing spread in 4-nodes and 6-nodes formations, and we are experiencing problems. Let's begin in 4 nodes tests:

Our basic configuration includes 4 servers, each running 1 spread daemon on each server, and one or more spread clients on each server. The spread uses a "ethernet channel bonding" interface, this bonding is achieved by using intel's iANS module. We're using Redhat's 2.4.21 patched kernel, including asynchronous I/O. Both physical interfaces are 1000Mb/s, using e1000 module. Our default setting uses IP multicast multihomed configuration in spread.conf. Using latest spread (3.17.2)

Things seemed to be running well for some time, but for no apparent reason, suddenly clients that were connected to each local spread daemon were unable to receive replies from it. For example, running spflooder got stuck. This state was unrecoveravle, no matter how long we waited... We became suspicious of the iANS and disabled it, and all of a sudden things began to work well... So that's a strange behavior, has anyone seen this before ?  I tend to resolve this as a kernel/module issue and not a spread problem, what do you guys think ?

Later we tested with 6 nodes, this time we were smarter and did not use any special module but rather the standard interfaces.   Things went much better this time, up to a point... When we generated a big load on the network, lots of packets flying between the nodes, the spread daemons became "confused" and stopped responding to their clients. To be more accurate, they responded so slowly, that it was practically impossible to work - a message got acknowledged by the client after 7.5 minutes !  Doing strace on one of the spread daemons showed that the last message that a client sent kept being sent/received over and over again... The spreads did not recover from this, we needed to restart all of them... We still do not have a solution for this, and we are quite puzzled by this. I hear people here talking about dozens of spread daemons, but we're having problems with just as few as 6... We tried using broadcast instead of multicast, same result... There weren't any messages in the log file, nothing that could indicate spread was having problems.

Is someone using configuration similar to our own ?  Has anyone experienced similar problems ?

We could provide some more detailed debugging data, just point me at some direction...

Shlomi Yaakobovich


 





More information about the Spread-users mailing list