[Spread-users] spread clients don't receive messages

Shlomi Yaakobovich Shlomi at exanet.com
Sun Sep 26 05:00:22 EDT 2004


Hi Jonathan and all,

Sorry for not updating about this issue, I am not sure we are finished testing it yet, but here's what we've found out so far.

Short version: spread seems to work well in a stable network environment, but it is very fragile when network errors (including hardware) are encountered.

Long version:
We've been testing spread with 4 nodes and 6 nodes, and focused on making our system run on 6 nodes (i.e. 6 spread daemons), when each node is connected to the network through a single physical interface. For a while things worked ok, but all of a sudden they stopped working. This was a very tricky problem, it turned out to be (at least) one faulty network cable, which caused all the problems. We replaced the cables, and things improved, as far as spread was concerned. Still, we saw an alarming behavior: we are using spflooder to periodically monitor spread's responses; it turned out that when sending messages through the flooder with a size of 1000+ bytes they got stuck, but 600 bytes long messages had no problem. We changed the message size to 600.

Next test, we put some heavy load on the spread daemons, we wanted to see how do things work. When using a single interface with each node, things worked very well. We were bound by the network's bandwidth though. However, when we started using two interfaces on each node, using Ethernet channel bonding (bonding module in the Linux kernel), to improve the performance, we started to see problems, nodes got disconnected by the spread daemons. At first, we suspected the bonding module and played with it a bit (it even resulted in a patch we sent to them), but eventually it turned out that 2 out of the 6 machines were simply too slow, slower than the rest of the group. They were so slow that after several minutes of heavy load, they lagged about 1000 messages behind the rest of the group. 1000 is the maximum was MAX_SESSION_MESSAGES with this value, so apparently, spread worked "as designed" in this test.

Next, we had a test with using 2 switches, while each node has each interface connected to a different switch. For example, eth0 connected to SW-0, eth1 connected to SW-1. We used bonding in this mode too. However, this test failed miserably... We've had spread daemons crashing all over the place, with similar effects as with this following post:

http://lists.spread.org/pipermail/spread-users/2003-December/001741.html

I did not see any replies/solutions to that problem. It was quite obvious that something with the network is not working well, we just needed to find out what. The annoying thing was that the rest of our applications on our nodes worked very well, only spread had the problems. Some of the more apparent differences between them and spread is that they do not do multicast/broadcast. Yet again, it turned out that there was some hardware problem with the two switches that we used, they did not support load balancing between two ports on two different switches (the switches could not be "stacked"). The switches probably got confused when they saw frames with src MAC address coming from two different locations (ports) and could not handle it. When using active-backup mode. i.e. only one cable would send packets, things worked well. When we connected both interfaces to the same switch, things also worked well. The bottom line is, that the spread errors resulted due to bad network (switches) configuration/setup. Regardless, the spread crashes (as also describe in the link I pasted), should be fixed, IMHO.

Another problem that we saw, and have no solution for at the moment, is that changing a network parameter, that requires a network restart (e.g. change MTU, disable interface) gives spread a big headache. Spread daemons seem to disconnect, and fail to see each other. This is changed after a few minutes, but in our system, it is too long - we restart the spread daemons every time after we restart the network, and this solves the problem. This is, however, a workaround at best, since we need to actively control each network change.

The results from our tests indicate that the problems do not originate from spread, but rather from the network - topology, configuration, hardware, etc. However, spread is VERY sensitive for these settings, too sensitive, I believe. I would expect a much more resilient approach from such an important infrastructure, when it's essence is to provide credibility in a network environment, even if sometime there are network errors.


> -----Original Message-----
> From: Jonathan Stanton [mailto:jonathan at cnds.jhu.edu]
> Sent: Friday, September 24, 2004 8:09 AM
> To: Shlomi Yaakobovich
> Subject: Re: [Spread-users] spread clients don't receive messages
> 
> 
> Hi,
> 
> I was wondering if you were ever able to resolve the 6 node issue you 
> described here? I was checking through the outstanding spread 
> issues I'm 
> aware of and noticed this discussion had trailed off on the 
> list without 
> any obvious resolution.
> 
> If you were able to duplicate it, was there any additional 
> information you 
> can provide about what the Spread daemon was doing when it kept 
> re-receiving the messages from the client?
> 
> Thanks,
> 
> Jonathan




More information about the Spread-users mailing list