[Spread-users] 'Connection closed by spread' ...

Jim Vickroy Jim.Vickroy at noaa.gov
Wed Sep 8 11:42:41 EDT 2004


Thanks for your valuable feedback Ryan.

Regarding the use of a fresh connection each time a message is to be
published, that is my (overly-cautious?) attempt to ensure I never fail to
publish because the mailbox has been dropped/corrupted.  This is a 24x7
application that simply must not fail, but I will change the implementation,
as you suggest, and monitor to see if there are reliability issues.  As a
side note, I never recorded any publication error except for "Connection
closed by Spread".

As for the reported publication failures, I ran the identical test scenario,
overnight, using our OPERATIONAL Spread server (I had been using our TEST
Spread server), and after over 350,000 successful message publications
without a single failure, I stopped the test.  Our OPERATIONAL and TEST
Spread servers are supposed to be identical, but they do run on different
computers and different networks.  Apparently, something is wrong with our
TEST environment.

In case you are curious, our OPERATIONAL Spread server is unused at this
point which is why I was able to use it for this test.  My application is to
become our first operational deployment using Spread.

I apologize for wasting your time on this issue.

-- jv


-----Original Message-----
From: spread-users-admin at lists.spread.org
[mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
Sent: Tuesday, September 07, 2004 10:38 PM
To: Jim Vickroy
Cc: SPREAD-USERS
Subject: Re: [Spread-users] 'Connection closed by spread' ...


I don't speak fluent python, but this doesn't seem like anything I
can't guess at from other languages.  In answer to a few of your
questions or comments from the code, (a) 3.17.1 prints because we
probably didn't update the version string with this release, and (b)
membership messages shouldn't be the problem... I don't see any reason
for your clients to be receiving any.

I see one thing that is very unusual, and that could potentially cause
race conditions.  For every publish, your client application is
connecting to spread, multicasting to spread, and disconnecting.  This
is a very inefficient way to use Spread, although it shouldn't be a
huge performance problem with the kind of load you mentioned.  I would
expect the sleeps you have in your code to avoid race conditions at
the client side, but not necessarily at the daemon side.

My guess about what's happening is that a race condition at the
daemon, in session.c, is being triggered.  I know we've had problems
of this sort in the past, but I'm not sure of the current status
(i.e., if there are any known bugs, or solutions in CVS but not
released).  Hopefully Jonathan can fill us in about this.

Either way, I would recommend that you try moving the connect and
disconnect outside the while loop (i.e., only do them once each).
There isn't any good reason to do otherwise, that I'm aware of.  Also,
I would recommend catching exceptions for the three calls separately,
to see which is failing.  Were you getting any other errors besides
connection closed?

Cheers,
Ryan


On Tue, 7 Sep 2004 05:27:46 -0600, Jim Vickroy <jim.vickroy at noaa.gov> wrote:
> Thanks, again Ryan.  Here is the source code; I hope you speak Python.
> <smile>
>
> # START of file ----------------------------------------------------
>
> '''
> A Spread publisher, intended to be run for "long" periods of time, that
> counts publication successes and failures.
>
> NOTES
>    o When run for hours by itself, no publishing failures are detected
>       -- 100% success rate.
>    o When run with multiple (i.e., 10) copies of itself, simultaneously,
>       the success rate drops to ~ 99.8%.
>
> LANGUAGE
>    O Python
>       http://www.python.org/
>
> REFERENCES
>    o The Python/Spread API is documented at:
>       http://www.python.org/other/spread/doc.html
>
> AUTHOR
>    jim.vickroy at noaa.gov
> '''
>
> import os, spread, time
> from   random import randint
>
> print 'Spread version:', spread.version() # prints (3, 17, 1) on my system
>
> host       = '... host server name goes here ...'
> port       = 4803
> address    = '%d@%s' % (port, host)
> sender     = 'Spread publishing failures checker on %s' %
> os.environ['COMPUTERNAME']
> group      = 'SEC.publishing.failures.statistics'
> template   = 'timestamp: %s'
> service    = spread.SAFE_MESS
> successes  = 0
> failures   = 0
>
> while True:
>    now     = time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(time.time()))
>    message = template % now
>    count   = randint(2,10)
>    for this in range(1, count):
>       try:
>          mailbox = spread.connect(address, '', priority=0,
membership=True)
>             # membership=True -> receive membership messages
>             # Is it possible that membership messages are what is filling
> the mailbox?
>          bytes_transmitted = mailbox.multicast(service, group, message, 0)
#
> message_type is zero
>          time.sleep(1) # second -- a precaution to ensure message makes it
> to Spread server
>          mailbox.disconnect()
>          assert bytes_transmitted == len(message), \
>             'expected %d bytes to be transmitted -- actual = %d' %
> (len(message), bytes_transmitted)
>       except Exception, cause:
>          failures += 1
>          print cause
>       else:
>          successes += 1
>    print '%s:  success: %d  failures: %d' % (now, successes, failures)
>    time.sleep(5) # seconds
>
> # END of file ----------------------------------------------------
>
>
>
> -----Original Message-----
> From: spread-users-admin at lists.spread.org
> [mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
> Sent: Monday, September 06, 2004 8:08 PM
> To: Jim Vickroy
> Cc: SPREAD-USERS
> Subject: Re: [Spread-users] 'Connection closed by spread' ...
>
> Could you post some sample source code for your simplified program?  I
> can't think of anything from your description to explain this behavior
> from Spread.
>
> Cheers,
> Ryan
>
> On Mon, 6 Sep 2004 19:03:50 -0600, Jim Vickroy <jim.vickroy at noaa.gov>
wrote:
> > OK, I let the simple publisher program (described below) run until it
had
> > completed ~ 4500 successful publications with zero failures.  So then, I
> > introduced more realism by running 10 copies of it simultaneously in
> > separate processes (separate console windows) and within 10 minutes all
of
> > the processes started experiencing (intermittent) publishing failures.
> This
> > is the behavior I'm seeing in the real application (which comprises 10
> > independent processes running on the same machine).
> >
> > I probably did not make it clear that these simple simulation programs
are
> > pure publishers (not publishers/subscribers).
> >
> > Any ideas about what is causing these failures?
> >
> > Thanks,
> >
> > -- jv
> >
> >
> >
> > -----Original Message-----
> > From: spread-users-admin at lists.spread.org
> > [mailto:spread-users-admin at lists.spread.org]On Behalf Of Jim Vickroy
> > Sent: Monday, September 06, 2004 3:08 PM
> > To: Ryan Caudy
> > Cc: SPREAD-USERS
> > Subject: RE: [Spread-users] 'Connection closed by spread' ...
> >
> > Thanks again for the feedback, Ryan -- and for your patience in
providing
> a
> > detailed explanation.
> >
> > I have checked the application and the publisher does not join the group
> it
> > publishes to -- only subscribers join groups.
> >
> > I have created a highly simplified version of the application that
> hopefully
> > will capture the (errant) behavior I reported.  One difference between
the
> > simplified version and the real application is that the simplified
version
> > is a single process that periodically publishes 1-10 messages while the
> real
> > application is 10 separate processes that periodically publish 1 or 2
> > messages.
> >
> > The simulation is running now, and I will post a follow-up when it has
run
> > for a sufficient period of time.
> >
> > Thanks,
> >
> > -- jv
> >
> > -----Original Message-----
> > From: spread-users-admin at lists.spread.org
> > [mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
> > Sent: Saturday, September 04, 2004 10:03 PM
> > To: Jim Vickroy
> > Cc: SPREAD-USERS
> > Subject: Re: [Spread-users] 'Connection closed by spread' ...
> >
> > To clarify what I said earlier, the -8, CONNECTION CLOSED, return code
> > isn't specific to receiving.  It will be returned by any of the
> > library functions if they try to send or recv on the mbox socket, and
> > get an error besides EAGAIN, EINTR, or EWOULDBLOCK.  You may not be
> > using the C API, but the behavior of any of the APIs should be
> > similar.  Although there are other possible errors that could cause
> > this to happen, the most likely one in this situation (no real network
> > problem, etc) is that Spread closed the socket for failing to receive.
> >
> > Part of the reason I think that the cause is what I described is that
> > you said "about once every 1000 publishing attempts."  It probably
> > isn't coincidental that this is the defined value for
> > MAX_SESSION_MESSAGES (see spread_params.h), which dictates the number
> > of messages Spread will allow to pile up for a session before
> > disconnecting it for failing to receive.
> >
> > Could you tell me a little bit more about your application?  What you
> > described should be absolutely fine, since you don't rely on Spread's
> > internal queuing any more than absolutely necessary.
> >
> > When do the applications that are having trouble connect to spread?
> > The normal paradigm for something like what you've described is to
> > have them connect before spinning off the receiving thread, and share
> > the mbox (with some sort of synchronization).  If I had to guess what
> > was going wrong from what you said before, I would guess that for the
> > applications that are both publishers and subscribers, you have opened
> > two mbox's, joined the relevant groups on both, and are only receiving
> > on one of them.
> >
> > If this is the case, I would recommend that you do one of the
> > following: (a) Have only one mbox.  Depending on the library
> > implementation you're working with, you may or may not need additional
> > synchronization.  OR (b) Have two mboxes, but for the
> > sending/publishing thread, do NOT join the groups.  Spread supports
> > open-group semantics, which means that you can send to a group without
> > being a member of it.
> >
> > I hope this helps.  If it doesn't, please give the list whatever other
> > information you can provide.
> >
> > Cheers,
> > Ryan
> >
> > On Sat, 4 Sep 2004 07:55:15 -0600, Jim Vickroy <jim.vickroy at noaa.gov>
> wrote:
> > > Thanks for your response, Ryan.
> > >
> > > I did not make it clear in my original posting, that these are
> publishing
> > > errors -- not subscriber errors.  The errors are being trapped by
> > try-catch
> > > blocks wrapping publishing requests.
> > >
> > > Most of the publishers are also subscribers to the same message group
> > (they
> > > must be), but each subscriber operates in its own dedicated thread
that
> > does
> > > nothing but receive and queue messages for subsequent processing.  I
> doubt
> > > the receiving/queuing thread is not keeping up with the publishers
> > > especially since the burst rate is only on the order of 10 messages
per
> > > second for one second.  The applications keep rather detailed logs of
> the
> > > messages received/published, and I see no evidence of any subscriber
> > failing
> > > to keep up with the publishing rate.
> > >
> > > It is curious, however, that the one publisher which is not also a
> > (Spread)
> > > subscriber is the only component that, so far, has not experienced a
> > > publishing error.  This component does have a receiver thread, but it
is
> > > monitoring a simple socket connection for message traffic.
> > >
> > > That said, I am a novice user of Spread and certainly may have an
> > > implementation problem; it is just not clear what is wrong.
> > >
> > > I will ask our administrator to upgrade to the current, stable version

> of
> > > Spread.
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: spread-users-admin at lists.spread.org
> > > [mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
> > > Sent: Friday, September 03, 2004 8:54 PM
> > > To: Jim Vickroy
> > > Cc: SPREAD-USERS
> > > Subject: Re: [Spread-users] 'Connection closed by spread' ...
> > >
> > > Hi,
> > >
> > > This error is usually caused by a failure to receive by clients to
> > > Spread.  If your clients let more than a certain number of messages,
> > > 1000 with a "vanilla" Spread, pile up at the daemon without receiving
> > > them, then Spread will disconnect them with that error code.
> > >
> > > You may want to look at past posts on this list about flow control.
> > >
> > > Also, on a side note, I would encourage you to use the most recent
> > > stable release of Spread.
> > >
> > > Cheers,
> > > Ryan
> > >
> > > On Fri, 3 Sep 2004 12:12:54 -0600, Jim Vickroy <jim.vickroy at noaa.gov>
> > wrote:
> > > > ... is the error that is happening more frequently than desirable --
> > about
> > > > once every 1000 publishing attempts.
> > > >
> > > > Could someone suggest a way to reduce this error rate (at least by a
> > > factor
> > > > of 10)?
> > > >
> > > > The platform:
> > > >         Spread: v 3.17.01 (20 June 2003)
> > > >         Spread Host: RedHat Workstation, Kernel: 2.4.21-4.EL
> > > >         Client Host: Microsoft Windows 2000 Server
> > > >         Client Software: Python v 2.3.3
> > > >
> > > > The use case:
> > > >         Messages are published in bursts at 1-minute intervals.
> > > >         Each burst of messages comprises 5-10 messages; each message
> is
> > > generated
> > > > by a distinct process.
> > > >         Each message is about 100 bytes.
> > > >         Publication service type is set to spread.SAFE_MESS.
> > > >
> > > > Thanks,
> > > >
> > > > -- jv
> > > >
> > > > _______________________________________________
> > > > Spread-users mailing list
> > > > Spread-users at lists.spread.org
> > > > http://lists.spread.org/mailman/listinfo/spread-users
> > > >
> > >
> > > --
> > > ---------------------------------------------------------------------
> > > Ryan W. Caudy
> > > <rcaudy at gmail.com>
> > > ---------------------------------------------------------------------
> > > Bloomberg L.P.
> > > <rcaudy1 at bloomberg.net>
> > > ---------------------------------------------------------------------
> > > [Alumnus]
> > > <caudy at cnds.jhu.edu>
> > > Center for Networking and Distributed Systems
> > > Department of Computer Science
> > > Johns Hopkins University
> > > ---------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > Spread-users mailing list
> > > Spread-users at lists.spread.org
> > > http://lists.spread.org/mailman/listinfo/spread-users
> > >
> > >
> >
> > --
> > ---------------------------------------------------------------------
> > Ryan W. Caudy
> > <rcaudy at gmail.com>
> > ---------------------------------------------------------------------
> > Bloomberg L.P.
> > <rcaudy1 at bloomberg.net>
> > ---------------------------------------------------------------------
> > [Alumnus]
> > <caudy at cnds.jhu.edu>
> > Center for Networking and Distributed Systems
> > Department of Computer Science
> > Johns Hopkins University
> > ---------------------------------------------------------------------
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
> >
>
> --
> ---------------------------------------------------------------------
> Ryan W. Caudy
> <rcaudy at gmail.com>
> ---------------------------------------------------------------------
> Bloomberg L.P.
> <rcaudy1 at bloomberg.net>
> ---------------------------------------------------------------------
> [Alumnus]
> <caudy at cnds.jhu.edu>
> Center for Networking and Distributed Systems
> Department of Computer Science
> Johns Hopkins University
> ---------------------------------------------------------------------
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>
>



--
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University
---------------------------------------------------------------------

_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list