[Spread-users] Spread daemons crashing on a wireless network among windows machines.

Gary Herron gherron at islandtraining.com
Mon Dec 22 13:04:40 EST 2003


Hi,

I'm getting a consistent crash of the spread server.  Before
attempting to debug (or deciding to scarp spread -- something I'd
rather not do), I thought I'd describe the situation here and see what
advice people can offer.

I'm running spread daemons on each of a number of (up to 11) laptops
running windows and connected via a wireless network. On one machine
runs a small master program which sends out a 100KB packet every 30
seconds or so, while running on the other machines is a small client
which merely accepts the packet and sends back a response to the
master.  (This communication vaguely simulates my real application.)

The whole thing runs more-or-less well for a while but invariably one
or more of the spread servers crashes with the message

    Discard_packets: packet X not exist
    Exit caused by Alarm(EXIT)

    (Where X varies: 3, 4, 83, ...)

Here's some observations:

 * When I start up several daemons (at a command prompt so I can watch
   the output), it can be several minutes before they all establish
   communication with each other.

 * Every 10-40 seconds or so, the daemons all report the
   "Configuration at XXX is:".  Sometimes the only thing that has
   changed is the "Membership id is ( 168820993, 1072044166)", but
   often the membership is erroneously reported as changed, with one
   or machines being dropped from the list.  Usually after a round or
   two the membership is back up to full count.  This constant
   membership loss and re-establishment occurs even before the test
   programs are run, when only the daemons are communicating among
   themselves.

 * If I run on a wired network, things seem to work better, but I
   can't be sure of that because I can only wire up about four
   machines.

 * Often while running the test program, communication gets backed-up
   with none of the packets getting sent out to the client programs
   for a minute or two.  Eventually the log-jam is broken and a bunch
   of packets go out from the master to the clients.  I think (but
   have not established this for sure) that the spread server crashes
   occur during these log-jams.  I also think (but have not
   established this either) that the log-jams occur near the times
   where the membership amoung the servers is erroneously changed.


It's been my observation that wireless networks stress networking
applicatons.  However, I hope I can get spread to work in this
environment with my application, because it really appears to be the
right tool for what I'm doing.

Any thoughts or advice?

Thanks much,

Gary Herron







More information about the Spread-users mailing list