[Spread-users] Spread daemons crashing on a wireless network among windows machines.
gherron at islandtraining.com
Mon Dec 22 13:04:40 EST 2003
I'm getting a consistent crash of the spread server. Before
attempting to debug (or deciding to scarp spread -- something I'd
rather not do), I thought I'd describe the situation here and see what
advice people can offer.
I'm running spread daemons on each of a number of (up to 11) laptops
running windows and connected via a wireless network. On one machine
runs a small master program which sends out a 100KB packet every 30
seconds or so, while running on the other machines is a small client
which merely accepts the packet and sends back a response to the
master. (This communication vaguely simulates my real application.)
The whole thing runs more-or-less well for a while but invariably one
or more of the spread servers crashes with the message
Discard_packets: packet X not exist
Exit caused by Alarm(EXIT)
(Where X varies: 3, 4, 83, ...)
Here's some observations:
* When I start up several daemons (at a command prompt so I can watch
the output), it can be several minutes before they all establish
communication with each other.
* Every 10-40 seconds or so, the daemons all report the
"Configuration at XXX is:". Sometimes the only thing that has
changed is the "Membership id is ( 168820993, 1072044166)", but
often the membership is erroneously reported as changed, with one
or machines being dropped from the list. Usually after a round or
two the membership is back up to full count. This constant
membership loss and re-establishment occurs even before the test
programs are run, when only the daemons are communicating among
* If I run on a wired network, things seem to work better, but I
can't be sure of that because I can only wire up about four
* Often while running the test program, communication gets backed-up
with none of the packets getting sent out to the client programs
for a minute or two. Eventually the log-jam is broken and a bunch
of packets go out from the master to the clients. I think (but
have not established this for sure) that the spread server crashes
occur during these log-jams. I also think (but have not
established this either) that the log-jams occur near the times
where the membership amoung the servers is erroneously changed.
It's been my observation that wireless networks stress networking
applicatons. However, I hope I can get spread to work in this
environment with my application, because it really appears to be the
right tool for what I'm doing.
Any thoughts or advice?
More information about the Spread-users