[Spread-users] Fault Resilience?

WELCH,DONALD J (HP-Vancouver,ex1) donald_welch at hp.com
Wed Nov 13 12:13:20 EST 2002


Yes, on xxx, I have the (master) client join group 'master' and on yyy, I
have the (slave) client join group 'slave'. Both rec. membership messages
when they start. When I want to send a message from 'master' to 'slave', the
client on xxx uses multicast() and sends to the 'slave' group, and vice
versa for the slave.

I can't really tell if they get a membership message because they are stuck
at either a multicast() or receive() call. Is this possibly the problem
(that they aren't getting a membership message)? Should I be using poll()?
What code flow is required to handle disconnect/reconnect faults? 

I have had trouble programming to the Spread interface because the semantics
of the API are not very clear. The syntax is well documented, but the order
and use of the API calls as they relate to each other is not. There is no
mention (that I could find) in the docs about how to provide code pathways
for faults and other anomolies. 

Thanks,

/don/

Here is a more complete log:

<< pulled the plug >>
Memb_token_loss: I lost my token, state is 1
Scast_alive: State is 2
Scast_alive: State is 2
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4 
Memb_handle_token: handling form2 token
Handle_form2 in FORM
Memb_transitional
G_handle_trans_memb:
G_handle_trans_memb in GOP
Memb_regular
Membership id is ( 268183097, 1037061340)
--------------------
Configuration at xxx is:
Num Segments 2
        1       15.252.38.255     4803
                xxx              15.252.38.57
        0       15.252.39.255     4803
====================
G_handle_reg_memb:  with (15.252.38.57, 1037061340) id
G_handle_reg_memb in GTRANS
Send_join: State is 4
Memb_handle_message: handling join message from 268183316, State is 4
Send_join: State is 4
Memb_handle_message: handling join message from 268183316, State is 4
Memb_handle_message: handling join message from 268183316, State is 4
Send_join: State is 4
Memb_handle_message: handling join message from 268183316, State is 4
Send_join: State is 4
Memb_handle_message: handling join message from 268183316, State is 4
Send_join: State is 4
Memb_handle_token: handling form2 token
Handle_form2 in FORM
Memb_handle_token: handling form2 token
Handle_form2 in EVS
Memb_transitional
G_handle_trans_memb:
G_handle_trans_memb in GOP
Memb_regular
Membership id is ( 268183097, 1037061405)
--------------------
Configuration at xxx is:
Num Segments 2
        1       15.252.38.255     4803
                xxx              15.252.38.57
        1       15.252.39.255     4803
                yyy              15.252.39.20
====================
G_handle_reg_memb:  with (15.252.38.57, 1037061405) id
G_handle_reg_memb in GTRANS
G_handle_reg_memb: (  AGREED) GROUPS message sent in GTRANS with 86 bytes
G_handle_groups:
G_handle_groups in GGATHER
G_handle_groups: GROUPS message received from xxx - msgs 1, daemons 1
G_handle_groups:
G_handle_groups in GGATHER
G_handle_groups: GROUPS message received from yyy - msgs 2, daemons 2
G_handle_groups: Last GROUPS message received - msgs 2, daemons 2
G_compute_and_notify:
++++++++++++++++++++++
Num of groups: 2
[1] group master with 1 members:
        [1] #master#xxx
----------------------
[2] group slave with 1 members:
        [1] #slave#yyy
----------------------
<< killed the xxx master client due to hang >>>
Sess_read: failed receiving header on session 320: ret -1: error:
WSAECONNRESET:
 The virtual circuit was reset by the remote side executing a hard or
abortive c
lose. For UPD sockets, the remote host was unable to deliver a previously
sent U
DP datagram and responded with a Port Unreachable ICMP packet. The
application s
hould close the socket as it is no longer usable.
Sess_kill: killing session master ( mailbox 320 )
G_handle_kill: #master#xxx is killed
G_handle_kill in GOP
^C << killed spread >>


-----Original Message-----
From: Yair Amir [mailto:yairamir at cnds.jhu.edu]
Sent: Monday, November 11, 2002 5:26 PM
To: WELCH,DONALD J (HP-Vancouver,ex1)
Cc: spread-users at lists.spread.org
Subject: Re: [Spread-users] Fault Resilience?


Are your clients getting a membership message once you disconnect?
(Are you having your clients join a group?)

    :) Yair.
    
ex1)> I have a question about Spread and fault resilience. I have two
separate
ex1)> Windows boxes on my LAN each running a Spread Daemon and a client
written in
ex1)> Python. I plan on deploying this setup across a unreliable link once I
am
ex1)> finished developing. With the idea of "resilient to faults across
external
ex1)> or internal networks" I decided to see how Spread handles the loss of
the
ex1)> network between the two client/deamons. I am testing this by, quite
simply,
ex1)> momentarily pulling the ethernet cable out of the back of my system
and
ex1)> plugging it back in. If I pull the plug for a short time, on the order
of
ex1)> 1-3 seconds, all is well. My clients pause and resume once the two
daemons
ex1)> seem to find each other again and resume sending message traffic. But
for
faults of >>~3 secs, the daemons both seem to go through some sort of reset
ex1)> and never send anymore data. Here is the sequence:

ex1)> Daemon on xxx:

ex1)> << pull the plug >>
ex1)> Memb_token_loss: I lost my token, state is 1
ex1)> Scast_alive: State is 2
ex1)> Scast_alive: State is 2 << replug here is OK >>
ex1)> Send_join: State is 4
ex1)> Send_join: State is 4
ex1)> Send_join: State is 4
ex1)> Send_join: State is 4
ex1)> Send_join: State is 4
ex1)> Memb_handle_token: handling form2 token
ex1)> Handle_form2 in FORM
ex1)> Memb_transitional
ex1)> G_handle_trans_memb:
ex1)> G_handle_trans_memb in GOP
ex1)> Memb_regular
ex1)> Membership id is ( 268183097, 1037061340)
ex1)> --------------------
ex1)> Configuration at xxx is:
ex1)> Num Segments 2
ex1)>         1       15.252.38.255     4803
ex1)>                 yyy              15.252.38.57
ex1)>         0       15.252.39.255     4803
ex1)> ====================
ex1)> G_handle_reg_memb:  with (15.252.38.57, 1037061340) id
ex1)> G_handle_reg_memb in GTRANS

ex1)> So, am I incorrect in understanding what "fault reslience" means, do I
have
ex1)> something configured incorrectly or is there a problem in the way I am
using
ex1)> the daemon/client interface (I can supply more info if needed)? I am
using
ex1)> the Python bindings v1.3, Spread v3.17 and both machines are running
2KSP3.

ex1)> Thanks,

ex1)> don






ex1)> _______________________________________________
ex1)> Spread-users mailing list
ex1)> Spread-users at lists.spread.org
ex1)> http://lists.spread.org/mailman/listinfo/spread-users




More information about the Spread-users mailing list