[Spread-users] Tracking network down event [SEC=UNCLASSIFIED]

Pilling, Michael Michael.Pilling at dsto.defence.gov.au
Thu Feb 18 20:49:50 EST 2010


UNCLASSIFIED 

Yanchao, 

My understanding is that you will be notified of this event, but only when you attempt to use the network as spread optimises the network load of its failure detection algorithm by only detecting failure on use. If necessary, you could implement some kind of ping process group that sends periodic messages through spread to force early detection. This, of course, is a balancing act. If your application uses spread frequently this may not be necessary.

Also, it will not in anyway diagnose the cause of the problem but simply tell each party that they can now only communicated with a subset of the parties they could previously communicate with, and it will say that the partitioning has been caused by a network failure, not by application instances crashing or application parties voluntairily resigning from the group.

Each party would therefore get a partial view of the failure. Since the network is down by definition, it cannot be used to integrate these separate views to do strong failure location or diagnostics, although once the network is reestablished you could program your application to swap information and do fault localisation for incidents in the past. This can be useful for fault characterisation and obtaining a system relibility history.

Therefore when one of your application processes notices a network partition, if fast fault repair is important to you your spread application should notify a human or some other system that can communicate by some other means to gather a more global view to locate, diagnose and correct the fault. This more global view may include information gathered locally by each spread application node. 

Once the network fault is corrected, spread will automatically notice that the network is reconnected and issue further group membership change messages, allowing the application to restart and recover as appropriate. The beauty of extended virtual synchrony is that it gives all parties a consistent view of the network failure so that the path to recovery is clear (although application semantics dependent).

Regards,
Michael 


DSTO
PO BOX 1500
Att: Dr Michael Pilling
C3ID
Building 205
Edinburgh SA 5111
Ph +61 8 8259 7017
Fx +61 8 8259 5589 


Important:   This document remains the property of the Australian Defence
Organisation and is subject to the jurisdiction of the Crimes Act Section
70.  If you have  received  this  document in error, you are requested to
contact the sender and delete the document.   


_____________________________________________
From:   spread-users-bounces at lists.spread.org [mailto:spread-users-bounces at lists.spread.org]  On Behalf Of Guo, Yanchao
Sent:   Friday, 19 February 2010 11:35
To:     spread-users at lists.spread.org
Subject:        Tracking network down event 

Hi all, 

I have an server application and a client application communicate via the spread framework, and they are meant to run for long period of time (>7days). As they are connected via VPN, I am wondering if spread is able to detect network outage? i.e., if one of the routers along the path is down, will the members get notified for this event, so that I can start some auto re-connect process? 

Thanks.
Yanchao  << File: ATT2684031.txt >> 

IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.spread.org/pipermail/spread-users/attachments/20100219/7a0f5df8/attachment.html 


More information about the Spread-users mailing list