[Spread-users] distributed financial application model - Spread group list access?

Tue Dec 7 20:54:25 EST 2004

My responses are inlined.  If you'd like to discuss more details, or
you have more questions, let me know.

Cheers,
Ryan

On Tue, 07 Dec 2004 13:02:08 -0500, Damon Hart <dhcom at sundial.com> wrote:
> Hi all -
> 
> I'm considering using Spread to maintain synchronicity across a
> distributed set of financial data producers and consumers. I have a
> design in mind based on an introductory understanding of Spread, but
> it has weaknesses for which perhaps more expert Spread users can make
> useful suggestions.
> 
> FRAMEWORK
> 
> A central database contains financial data, mainly various time
> series, on a large number of different instruments (a minimum of
> perhaps 10000 instruments and quite possibly many multiples of this.)
> 
> Producers (sources) are continually adding new data to the database at
> a fairly high rate (typically several hundred data items per second.)
> Each item is an addition, update or deletion of data associated with a
> particular instrument. There are relatively few producers at any given
> time (typically 2-4.)
> 
> Consumers (sinks) query the database to retrieve data for a single
> instrument or small set of instruments (up to around 50.) There are
> typically many more consumers than producers simultaneously running
> (perhaps 20 - 30.)  There is usually a high degree of overlap in the
> instrument sets of simultaneous consumers, such that the aggregate set
> of active instruments likely totals only 100-150 at any given time.
> 
> Producers and consumers are independent applications running on
> networked machines. Since queries to the database generally involve
> retrieval and construction of long time series of data, they are
> relatively expensive and consumers should cache the objects they
> construct from database queries wherever possible.
> 
> ISSUE/MODEL
> 
> The issue at hand is how best to alert consumers when their cached
> objects are invalidated by new data from producers. This could be
> accomplished by having all consumers poll the central database for
> changes each time before using their cached data. However, this
> largely negates the value of caching the data and does not scale well
> with many consumers making numerous repeated database queries.
> 
> The model I think makes sense would map an active instrument (i.e. at
> least one consumer holds cached data of this instrument) to a Spread
> group named based on the unique id of the instrument. A consumer would
> join Spread groups for each instrument it has in active use (1-50
> groups.)  Producers would continue to add data to the database, but
> would in addition issue Spread messages for new data in an instrument
> for which there has been a group established by a consumer join. In
> this way, consumers can get updates only for instruments in which they
> have an active interest without polling the database and producers can
> issue messages only for instruments which have some interested
> consumer.
> 

This makes sense.

> PROBLEM
> 
> The pitfall seems to be that there is no easy means for a producer to
> discover the list of Spread groups upon startup. I have scanned the
> maillist archives which seem to confirm that there is no API function
> to get a list of groups and members. Establishing a control group to
> which membership updates are replicated can serve to maintain a list
> of active groups, but does not appear to help build the list in the
> first place.
> 
> - can a producer somehow build an active group list, short of joining
>   every possible group and examining the resulting membership
>   messages? In this model, there are many possible groups of which
>   only a relative handful will have an existing member (consumer), so
>   joining and leaving each one seems a very poor way to discover the
>   active ones. Since it can issue messages to any group regardless of
>   membership, a producer should not need even to join each active
>   group, presuming there is a control group for membership updates
>   with which to maintain the producer's group list.
> 

Yes, you seem to understand Spread well.  I think the key is that each
member should send an AGREED notification to the control group, before
joining an instrument-subscription group.  They can then send another
AGREED message to the control group, right before leaving, allowing
you to attempt to maintain an active subscriber count or list.

This doesn't handle disconnects, but for that you could add a periodic
re-notification by all interested members, and have the producers
garbage-collect groups that they think are empty, after some number of
missed re-notification intervals.  The problem I see is that there
isn't a good way to bootstrap this, in the case of failure of all of
the producers.  As per my notes below, you don't cause any real
problems by sending notifications to empty groups, it's just not
handled efficiently.

>   There was discussion on the list (Jan 2004) about using specific
>   Spread groups in a similar fashion (based on key values from a
>   database), but that discussion largely concerned itself with the
>   resulting scalability concerns if ALL possible groups were indeed
>   joined/established. In this case, the producer wants to create and
>   maintain the (very limited) list of active groups and issue messages
>   only to these.
> 

I don't recall the original discussion, but I'm not sure that this is
necessarily impossible.  If you end up only needing around 10,000, or
a few times that, this might work, although I would encourage you to
use the current CVS head branch if you're going to do so.  My primary
concern in this case wouldn't be the normal operation, but heavyweight
memberships if one of your daemons (or its machine) goes down.

> - alternately, can a producer simply issue messages to all possible
>   groups and expect Spread to dispose efficiently of messages to
>   groups which don't exist (i.e. issue messages to each potential
>   group for which new instrument data is encountered, regardless of
>   whether some consumer has established the group through a join?)
>   This would side-step the problem of producer discovery of the group
>   list, but would pass the work of filtering out the large percentage
>   of uninteresting updates to Spread. While clearly less efficient
>   than the producer performing this filtering, hopefully this would
>   impact only the producers themselves and their connected Spread
>   daemons.
> 

There are significant inefficiencies here.  Spread doesn't filter out
messages that have no recipients until after they're ready to be
delivered.  This is necessary, because of the way lightweight
memberships work.

> Apologies for the long post and thanks in advance for any
> insights/advice!
> 
> Damon
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 

-- 
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>         
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University          
---------------------------------------------------------------------