[Spread-users] Daemon/Sgement Stability

Thu Sep 23 20:13:41 EDT 2004

My responses are inlined below.

Cheers,
Ryan

On Fri, 24 Sep 2004 09:54:00 +1000, Craig Foley <craig.foley at gmail.com> wrote:
> Hi,
> 
>   We've been looking at using spread for some financial messaging
> work, however, we've been concerned by the apparent willingness of the
> daemons to crash.
> 
>   We've created a system with around 10,000 groups with a single
> publisher sending individual messages to those groups, with a maximum
> load of around 500 messages per second.
> 
>   As has been mentioned before on this list, having so many groups can
> be a problem for when a new daemon joins, as it needs to be told about
> the current status for each group.  That's okay, but we seem to be
> able to get both Linux (Mandrake 10.0) and Windows based daemons to
> crash, when they first join, which then also brings down the other
> daemons already running in the segment.  (If the system as a whole is
> under heavy load at the time.)
> 

On the bright side, there will soon be a new version of Spread that
should really improve the behavior for large numbers of groups.  It's
hard to know what's going on without more information, of course...
what version are you using, by the way?  I'd recommend that you use
the most recent version of Spread, and that you log (see DebugFlags in
spread.conf) GROUPS, MEMBERSHIP, PROTOCOL, and SESSION, in addition to
the typical ERROR and PRINT.  That should provide a *lot* more detail
to debug from.  Actually, you might consider not logging PRINT,
because I believe there is a (very large, in your case) dump of groups
information printed at the end of the groups state exchange.

>   We've also noted that having a misconfiguered .conf file, one that
> is mismatched between systems that list the participating daemons, can
> cause all the daemons to crash.
> 

Yes.  The configuration files have to be consistent, or else Spread
will not function properly.  I could see a negotiation to compare conf
files (or a hash of them) during membership, to try to address this
problem in a friendlier manner, but to securely address this problem
would require much more.

>   We've not complied any daemons, we're just using the versions
> available from www.spread.org.  Is there a particular set of flags,
> for logging, that we could use to track this?  Is this a known
> problem?  We would like to see the daemons to be rock solid, or at
> least, the segment as a whole.
> 

My experience with the last daemon I worked on (not actually in CVS,
yet) was that it was rock solid when configured properly.  However, I
do not run a production, long-term Spread configuration.  The problem
you *may* be facing is that Spread tends to abort when confronted with
a real error condition, rather than try to robustly handle it. 
Usually, this is because the error condition reflects a
mis-configuration or networking issue.  If you see something like a
segmentation fault or bus error, on the other hand (or an abort in the
memory subsystem), that is likely to reflect a serious bug for us to
address.  Currently, I don't know of any outstanding issues of the
latter sort.

>   As is typical with anything like this, the above issues don't occur
> 100% of the time, but in the past week of testing, it's occured enough
> to be a concern.
> 
> Regards,
> Craig Foley.
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 

-- 
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>         
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University          
---------------------------------------------------------------------