[Spread-users] Fault Resilience?

Mon Nov 11 19:50:46 EST 2002

I have a question about Spread and fault resilience. I have two separate
Windows boxes on my LAN each running a Spread Daemon and a client written in
Python. I plan on deploying this setup across a unreliable link once I am
finished developing. With the idea of "resilient to faults across external
or internal networks" I decided to see how Spread handles the loss of the
network between the two client/deamons. I am testing this by, quite simply,
momentarily pulling the ethernet cable out of the back of my system and
plugging it back in. If I pull the plug for a short time, on the order of
1-3 seconds, all is well. My clients pause and resume once the two daemons
seem to find each other again and resume sending message traffic. But for
faults of >~3 secs, the daemons both seem to go through some sort of reset
and never send anymore data. Here is the sequence:

Daemon on xxx:

<< pull the plug >>
Memb_token_loss: I lost my token, state is 1
Scast_alive: State is 2
Scast_alive: State is 2 << replug here is OK >>
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Memb_handle_token: handling form2 token
Handle_form2 in FORM
Memb_transitional
G_handle_trans_memb:
G_handle_trans_memb in GOP
Memb_regular
Membership id is ( 268183097, 1037061340)
--------------------
Configuration at xxx is:
Num Segments 2
        1       15.252.38.255     4803
                yyy              15.252.38.57
        0       15.252.39.255     4803
====================
G_handle_reg_memb:  with (15.252.38.57, 1037061340) id
G_handle_reg_memb in GTRANS

So, am I incorrect in understanding what "fault reslience" means, do I have
something configured incorrectly or is there a problem in the way I am using
the daemon/client interface (I can supply more info if needed)? I am using
the Python bindings v1.3, Spread v3.17 and both machines are running 2KSP3.

Thanks,

don