[Spread-users] 'Connection closed by spread' ...

Jim Vickroy Jim.Vickroy at noaa.gov
Mon Sep 6 21:03:50 EDT 2004


OK, I let the simple publisher program (described below) run until it had
completed ~ 4500 successful publications with zero failures.  So then, I
introduced more realism by running 10 copies of it simultaneously in
separate processes (separate console windows) and within 10 minutes all of
the processes started experiencing (intermittent) publishing failures.  This
is the behavior I'm seeing in the real application (which comprises 10
independent processes running on the same machine).

I probably did not make it clear that these simple simulation programs are
pure publishers (not publishers/subscribers).

Any ideas about what is causing these failures?

Thanks,

-- jv

-----Original Message-----
From: spread-users-admin at lists.spread.org
[mailto:spread-users-admin at lists.spread.org]On Behalf Of Jim Vickroy
Sent: Monday, September 06, 2004 3:08 PM
To: Ryan Caudy
Cc: SPREAD-USERS
Subject: RE: [Spread-users] 'Connection closed by spread' ...


Thanks again for the feedback, Ryan -- and for your patience in providing a
detailed explanation.

I have checked the application and the publisher does not join the group it
publishes to -- only subscribers join groups.

I have created a highly simplified version of the application that hopefully
will capture the (errant) behavior I reported.  One difference between the
simplified version and the real application is that the simplified version
is a single process that periodically publishes 1-10 messages while the real
application is 10 separate processes that periodically publish 1 or 2
messages.

The simulation is running now, and I will post a follow-up when it has run
for a sufficient period of time.

Thanks,

-- jv

-----Original Message-----
From: spread-users-admin at lists.spread.org
[mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
Sent: Saturday, September 04, 2004 10:03 PM
To: Jim Vickroy
Cc: SPREAD-USERS
Subject: Re: [Spread-users] 'Connection closed by spread' ...


To clarify what I said earlier, the -8, CONNECTION CLOSED, return code
isn't specific to receiving.  It will be returned by any of the
library functions if they try to send or recv on the mbox socket, and
get an error besides EAGAIN, EINTR, or EWOULDBLOCK.  You may not be
using the C API, but the behavior of any of the APIs should be
similar.  Although there are other possible errors that could cause
this to happen, the most likely one in this situation (no real network
problem, etc) is that Spread closed the socket for failing to receive.

Part of the reason I think that the cause is what I described is that
you said "about once every 1000 publishing attempts."  It probably
isn't coincidental that this is the defined value for
MAX_SESSION_MESSAGES (see spread_params.h), which dictates the number
of messages Spread will allow to pile up for a session before
disconnecting it for failing to receive.

Could you tell me a little bit more about your application?  What you
described should be absolutely fine, since you don't rely on Spread's
internal queuing any more than absolutely necessary.

When do the applications that are having trouble connect to spread?
The normal paradigm for something like what you've described is to
have them connect before spinning off the receiving thread, and share
the mbox (with some sort of synchronization).  If I had to guess what
was going wrong from what you said before, I would guess that for the
applications that are both publishers and subscribers, you have opened
two mbox's, joined the relevant groups on both, and are only receiving
on one of them.

If this is the case, I would recommend that you do one of the
following: (a) Have only one mbox.  Depending on the library
implementation you're working with, you may or may not need additional
synchronization.  OR (b) Have two mboxes, but for the
sending/publishing thread, do NOT join the groups.  Spread supports
open-group semantics, which means that you can send to a group without
being a member of it.

I hope this helps.  If it doesn't, please give the list whatever other
information you can provide.

Cheers,
Ryan

On Sat, 4 Sep 2004 07:55:15 -0600, Jim Vickroy <jim.vickroy at noaa.gov> wrote:
> Thanks for your response, Ryan.
>
> I did not make it clear in my original posting, that these are publishing
> errors -- not subscriber errors.  The errors are being trapped by
try-catch
> blocks wrapping publishing requests.
>
> Most of the publishers are also subscribers to the same message group
(they
> must be), but each subscriber operates in its own dedicated thread that
does
> nothing but receive and queue messages for subsequent processing.  I doubt
> the receiving/queuing thread is not keeping up with the publishers
> especially since the burst rate is only on the order of 10 messages per
> second for one second.  The applications keep rather detailed logs of the
> messages received/published, and I see no evidence of any subscriber
failing
> to keep up with the publishing rate.
>
> It is curious, however, that the one publisher which is not also a
(Spread)
> subscriber is the only component that, so far, has not experienced a
> publishing error.  This component does have a receiver thread, but it is
> monitoring a simple socket connection for message traffic.
>
> That said, I am a novice user of Spread and certainly may have an
> implementation problem; it is just not clear what is wrong.
>
> I will ask our administrator to upgrade to the current, stable version of
> Spread.
>
>
>
>
> -----Original Message-----
> From: spread-users-admin at lists.spread.org
> [mailto:spread-users-admin at lists.spread.org]On Behalf Of Ryan Caudy
> Sent: Friday, September 03, 2004 8:54 PM
> To: Jim Vickroy
> Cc: SPREAD-USERS
> Subject: Re: [Spread-users] 'Connection closed by spread' ...
>
> Hi,
>
> This error is usually caused by a failure to receive by clients to
> Spread.  If your clients let more than a certain number of messages,
> 1000 with a "vanilla" Spread, pile up at the daemon without receiving
> them, then Spread will disconnect them with that error code.
>
> You may want to look at past posts on this list about flow control.
>
> Also, on a side note, I would encourage you to use the most recent
> stable release of Spread.
>
> Cheers,
> Ryan
>
> On Fri, 3 Sep 2004 12:12:54 -0600, Jim Vickroy <jim.vickroy at noaa.gov>
wrote:
> > ... is the error that is happening more frequently than desirable --
about
> > once every 1000 publishing attempts.
> >
> > Could someone suggest a way to reduce this error rate (at least by a
> factor
> > of 10)?
> >
> > The platform:
> >         Spread: v 3.17.01 (20 June 2003)
> >         Spread Host: RedHat Workstation, Kernel: 2.4.21-4.EL
> >         Client Host: Microsoft Windows 2000 Server
> >         Client Software: Python v 2.3.3
> >
> > The use case:
> >         Messages are published in bursts at 1-minute intervals.
> >         Each burst of messages comprises 5-10 messages; each message is
> generated
> > by a distinct process.
> >         Each message is about 100 bytes.
> >         Publication service type is set to spread.SAFE_MESS.
> >
> > Thanks,
> >
> > -- jv
> >
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> >
>
> --
> ---------------------------------------------------------------------
> Ryan W. Caudy
> <rcaudy at gmail.com>
> ---------------------------------------------------------------------
> Bloomberg L.P.
> <rcaudy1 at bloomberg.net>
> ---------------------------------------------------------------------
> [Alumnus]
> <caudy at cnds.jhu.edu>
> Center for Networking and Distributed Systems
> Department of Computer Science
> Johns Hopkins University
> ---------------------------------------------------------------------
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>
>



--
---------------------------------------------------------------------
Ryan W. Caudy
<rcaudy at gmail.com>
---------------------------------------------------------------------
Bloomberg L.P.
<rcaudy1 at bloomberg.net>
---------------------------------------------------------------------
[Alumnus]
<caudy at cnds.jhu.edu>
Center for Networking and Distributed Systems
Department of Computer Science
Johns Hopkins University
---------------------------------------------------------------------

_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users


_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list