[Spread-users] Fault Tolerant Server

Yair Amir yairamir at cnds.jhu.edu
Sun Dec 12 21:05:39 EST 2004


David,

A lot of what you do is not needed when using Spread. The group 
membership notifications give you for free the whole process of 
discovery. It is also guaranteed that the group membership lists
will contain the members in the same order for all members.

Play a bit with sptuser or spuser and try it with several members
and several groups and see how it works.

If I missed something, let me know.

Cheers,

	:) Yair.

David Avraamides wrote:
> The basic approach I use is what I call "discovery". In our world each
> type of service would publish on a different spread group and whenever a
> service comes up it broadcasts a discovery request message on the group.
> Any other instances of the service respond with a discovery reply and
> the requesting service adds their private group name to a list. This
> service then adds itself to the list. The list is sorted alphabetically
> and this determines the ranking of the peers. Additionally, each service
> sends out heartbeats and a listening service updates the last heartbeat
> time of that peer. A background timer removes stale peers from the peer
> list. The net effect is that each instance of a service should maintain
> an identical list of the active peers in the peer group.
> 
> All services go through this process and the decision of whether a
> service should implement fault-tolerance, horizontal scaling, or both is
> up to the derived class. In the fault-tolerant only case, the peer with
> rank 0 is the only one that publishes messages. All of the peers will
> "hear" requests but only the 0-th peer in the ranking is active (i.e.
> the master). If it goes down, planned or unplanned, the rest of the
> peers will adjust their list and one of them will become the new peer
> and start publishing. This works whether the service is a request-reply
> model or a pub-sub model. In the pub-sub model, after discovery, the
> publisher sends out a topic list request so all clients will let it know
> what topics they are listening to. This way it should maintain the same
> subscription list as the master publisher.
> 
> For the horizontal scaling case, I simply "distribute" the requests
> among the peers by modding the request ID (set uniquely by the client)
> with the number of peers in the peer list. If the remainder matches this
> instance's rank, it process the message, else it discards it. For
> subscriptions, I don't have a request ID so I mod the hash of the topic
> ID. Same diff - each client request is handled by one service.
> 
> Issues:
> - I don't support true load balancing, its really just dynamic request
> partitioning. But that's fine for our needs (so far).
> - There are possible race conditions (server dies while processing a
> request and no other service will pick up the request). I just let the
> client handle it by retrying the request later. A typical client will
> blast out a large number of request and monitor the replies and after
> some time period, timeout and examine/retry any missing replies. If two
> replies are received by a client, the last one wins. In our world (hedge
> fund) the latest one is the best one to use.
> - I'm still playing with good timeouts for how long to wait before
> marking a service as stale or waiting for discovery replies (right now 5
> seconds).
> - Some services partition better less randomly, for example our
> calculation service is best partitioned by the type of model
> (convertible bond, option, CDS, etc.) so it can look at the instrument
> type of the request rather than modding the request ID. The point is
> that its up to the specific implementation of the service to make this
> decision.
> 
> I've also written a launcher service that can start/stop other services
> proactively or reactively, thus our risk scenario process could ask the
> launcher service to start up the calculation service on every machine on
> the LAN, run our risk scenarios, then ask them to be stopped (that
> hasn't really been tested yet, but its designed in). My plan is to
> deploy the launcher and "service" assembly (this is all in C#) on every
> machine in the firm (client or server) and make them available as worker
> machines if/when needed.
> 
> -Dave
> 
> -----Original Message-----
> From: spread-users-admin at lists.spread.org
> [mailto:spread-users-admin at lists.spread.org] On Behalf Of Mike Perik
> Sent: Thursday, December 09, 2004 11:54 AM
> To: spread-users at lists.spread.org
> Subject: RE: [Spread-users] Fault Tolerant Server
> 
> I would be interested.  
> 
> Scaling is another thing that I'm interested in or load balancing with
> failover.  Ultimately, I'd like to split the load across servers and if
> one of them goes down the other server(s) would pick up it's load.  
> 
> I've implemented a very simple algorithm in which the servers join a
> group and send a simple message requesting who is the current publisher.
> It then backs off a random amount of time ( < 1 sec) and if after a few
> requests it hasn't gotten a response it claims to the be publisher and
> starts to publish.
> 
> In order to do load balancing I would have to move this type of logic
> from the group level to the subject level.
> 
> Any other ideas/methods for handling this?
> 
> Thanks,
> Mike
> 
> --- David Avraamides
> <David.Avraamides at SevernRiverCapital.com> wrote:
> 
> 
>>We have implemented a redundant server model here to handle both 
>>server failures and to provide server scale-out where appropriate. Its
> 
> 
>>built on top of Spread, thus at this time it doesn't address failures 
>>of the Spread daemon itself, rather our notion of a messaging service 
>>that sits on top of the Spread network. I'm not sure if that's what 
>>you meant. I can give more details if you are interested.
>>-Dave
>>
>>________________________________
>>
>>From: spread-users-admin at lists.spread.org on behalf of Mike Perik
>>Sent: Thu 12/9/2004 9:37 AM
>>To: spread-users at lists.spread.org
>>Subject: [Spread-users] Fault Tolerant Server
>>
>>
>>
>>Has anyone implemented Fault Tolerance into their system.  I'm 
>>planning on implementing something that
>>
>>would allow multiple servers back each other up.  If
>>
>>primary goes down then one of the backups picks up the broadcasting of
> 
> 
>>data.
>>
>>I'd be interested in any designs you've use and how they have worked.
>>
>>Thanks,
>>Mike
>>
>>
>>                
>>__________________________________
>>Do you Yahoo!? 
>>Yahoo! Mail - Find what you need with new enhanced search.
>>http://info.mail.yahoo.com/mail_250
>>
>>_______________________________________________
>>Spread-users mailing list
>>Spread-users at lists.spread.org
>>
> 
> http://lists.spread.org/mailman/listinfo/spread-users
> 
>>
>>
> 
> 
> 
> 		
> __________________________________
> Do you Yahoo!? 
> Meet the all-new My Yahoo! - Try it today! 
> http://my.yahoo.com 
>  
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 
> 





More information about the Spread-users mailing list