[Spread-users] distributed financial application model - Spread group list access?

Tue Dec 7 13:02:08 EST 2004

Hi all -

I'm considering using Spread to maintain synchronicity across a
distributed set of financial data producers and consumers. I have a
design in mind based on an introductory understanding of Spread, but
it has weaknesses for which perhaps more expert Spread users can make
useful suggestions.

FRAMEWORK

A central database contains financial data, mainly various time
series, on a large number of different instruments (a minimum of
perhaps 10000 instruments and quite possibly many multiples of this.)

Producers (sources) are continually adding new data to the database at
a fairly high rate (typically several hundred data items per second.)
Each item is an addition, update or deletion of data associated with a
particular instrument. There are relatively few producers at any given
time (typically 2-4.)

Consumers (sinks) query the database to retrieve data for a single
instrument or small set of instruments (up to around 50.) There are
typically many more consumers than producers simultaneously running
(perhaps 20 - 30.)  There is usually a high degree of overlap in the
instrument sets of simultaneous consumers, such that the aggregate set
of active instruments likely totals only 100-150 at any given time.

Producers and consumers are independent applications running on
networked machines. Since queries to the database generally involve
retrieval and construction of long time series of data, they are
relatively expensive and consumers should cache the objects they
construct from database queries wherever possible.

ISSUE/MODEL

The issue at hand is how best to alert consumers when their cached
objects are invalidated by new data from producers. This could be
accomplished by having all consumers poll the central database for
changes each time before using their cached data. However, this
largely negates the value of caching the data and does not scale well
with many consumers making numerous repeated database queries.

The model I think makes sense would map an active instrument (i.e. at
least one consumer holds cached data of this instrument) to a Spread
group named based on the unique id of the instrument. A consumer would
join Spread groups for each instrument it has in active use (1-50
groups.)  Producers would continue to add data to the database, but
would in addition issue Spread messages for new data in an instrument
for which there has been a group established by a consumer join. In
this way, consumers can get updates only for instruments in which they
have an active interest without polling the database and producers can
issue messages only for instruments which have some interested
consumer.

PROBLEM

The pitfall seems to be that there is no easy means for a producer to
discover the list of Spread groups upon startup. I have scanned the
maillist archives which seem to confirm that there is no API function
to get a list of groups and members. Establishing a control group to
which membership updates are replicated can serve to maintain a list
of active groups, but does not appear to help build the list in the
first place.

- can a producer somehow build an active group list, short of joining
  every possible group and examining the resulting membership
  messages? In this model, there are many possible groups of which
  only a relative handful will have an existing member (consumer), so
  joining and leaving each one seems a very poor way to discover the
  active ones. Since it can issue messages to any group regardless of
  membership, a producer should not need even to join each active
  group, presuming there is a control group for membership updates
  with which to maintain the producer's group list.

  There was discussion on the list (Jan 2004) about using specific
  Spread groups in a similar fashion (based on key values from a
  database), but that discussion largely concerned itself with the
  resulting scalability concerns if ALL possible groups were indeed
  joined/established. In this case, the producer wants to create and
  maintain the (very limited) list of active groups and issue messages
  only to these.

- alternately, can a producer simply issue messages to all possible
  groups and expect Spread to dispose efficiently of messages to
  groups which don't exist (i.e. issue messages to each potential
  group for which new instrument data is encountered, regardless of
  whether some consumer has established the group through a join?)
  This would side-step the problem of producer discovery of the group
  list, but would pass the work of filtering out the large percentage
  of uninteresting updates to Spread. While clearly less efficient
  than the producer performing this filtering, hopefully this would
  impact only the producers themselves and their connected Spread
  daemons.

Apologies for the long post and thanks in advance for any
insights/advice!

Damon