[Spread-users] Question about thread-safety

White Stuart - stwhit Stuart.White at acxiom.com
Fri Jun 13 18:12:14 EDT 2003

Thanks again John.  This discussion has been very helpful!

In your note, you state "the #1 reason for a closed connection is either
your app is sending too many messages or not receiving fast enough".

I understand how to control the "sending too many messages" - by enforcing a
maximum number of outstanding messages as you described below.

That solves half the problem.  How do I solve the other half?  (if my app is
not receiving fast enough).  It is undesireable for my processes to simply
get kicked out of the group because they can't keep up with the other
processes.  Is there a way I can handle this more gracefully?  Perhaps
detect that one of the processes is getting overwhelmed and instruct the
other processes to stop sending until the slow process catches up?

How do other spread applications handle this?

Thanks again!

-----Original Message-----
From: John Schultz [mailto:jschultz at d-fusion.net]
Sent: Friday, June 13, 2003 3:15 PM
To: spread-users at lists.spread.org
Subject: Re: [Spread-users] Question about thread-safety

You are correct that you cannot put a lock around all SP* calls due to 
blocking recvs and sends.  Spread handles the necessary synchronization 

 From your description, I believe you are synchronizing with the library 
properly.  I do not know of any example threaded C Spread programs.

If you are receiving CONNECTION_CLOSED (or INVALID_SESSION -- same error 
in your case) this means either (very unlikely) there was a socket 
failure or (almost surely) that the Spread daemon forcefully closed your 
connection on the other end.

The #1 reason for the Spread daemon to close your connection is that you 
are not implementing any form of flow control -- that your application 
is sending too many msgs and/or not receiving fast enough.  Eventually, 
the daemon gets sick of you slowing down the system and disconnects you. 
There has been a lot of discussion about this common problem on the 
mailing list: poke around to learn more.

A simple solution for your problem is the following flow control: keep a 
count of the number of outstanding msgs this process has sent that have 
not yet been delivered to the group.  Before calling SP_*multicast check 
if the count of outstanding msgs is less than some maximum outstanding 
msg count. If it isn't, block until that condition is met. Then 
increment the outstanding msg count and send the msg. On the recv side, 
each time you receive a msg from yourself, decrement the outstanding msg 
count and signal any blocked senders.

The maximum outstanding msg count can either be a fixed small number 
(30-50) or based on the size of the membership of the group (500 / 
membership_size). This should solve your flow control problem 99.9% of 
the time.

I've included a little pthread pseduo-code of what I mean:

#include <string.h>
#include <pthread.h>
#include <sp.h>

typedef struct {
   int             m_Mbox;                  /* conn's mailbox */
   char            m_Name[MAX_GROUP_NAME];  /* conn's priv name */

   pthread_mutex_t m_Lock;
   pthread_cond_t  m_Cond;
   int             m_MaxOutstandingCnt;
   int             m_OutstandingCnt;

} sp_flow;

/* call my_send() instead of directly calling SP_multicast() */

int my_send(sp_flow *flow, service stype,
             const char *group, int16 mtype, int mlen, const char *msg)
     while (flow->m_OutstandingCnt > flow->m_MaxOutstandingCnt) {
       assert(flow->m_OutstandingCnt > 0);
       pthread_cond_wait(&flow->m_Cond, &flow->m_Lock);
     assert(flow->m_OutstandingCnt >= 0);

   return SP_multicast(flow->m_Mbox, stype, group, mtype, mlen, msg);

/* recv thread's main fcn */

void *recv_thread(void *arg)
   sp_flow *flow = (sp_flow*) arg; /* init'ed by recv thread's spawner */
   service service;
   char    sender[MAX_GROUP_NAME];
   int     ret;

   while (true) {
     service = 0;
     ret     = SP_receive(flow->m_Mbox, &service, sender, ...);

     if (ret < 0) {

     if (Is_regular_mess(service) &&
         strncmp(flow->m_Name, sender, MAX_GROUP_NAME) == 0) {

         assert(flow->m_OutstandingCnt > 0);

     ... process message ...

   /* don't forget to clean up flow! */

   return NULL;

Hope that helps!

John Schultz
Co-Founder, Lead Engineer
D-Fusion, Inc. (http://www.d-fusion.net)
Phn: 443-838-2200 Fax: 707-885-1055

White Stuart - stwhit wrote:

> Hi John,
> Thanks for your comments.  I am #defining _REENTRANT and linking
> libtspread.so.  I am not closing/opening multiple connections, so I don't
> think I'm running into your file descriptor-reuse issue.
> I decided to try locking a mutex around all SP_* calls to try to resolve
> problem, but it occurs to be that I cannot, because calls to SP_receive()
> will block until a message is available.  If I lock a mutex around all
> calls, you can see that this could easily create a deadlock situation.  A
> call to SP_receive will block (with the mutex locked) until a message
> becomes available, but none ever will, because the "sender" thread
> to lock the same mutex (and blocks) before he calls SP_multicast().
> Perhaps I'm experiencing one of the non-user errors you mentioned, such as
> socket failure.  Bummer.
> Are there any resources/example programs which demonstrate correct spread
> usage in threaded C applications?
> Thanks!
> -----Original Message-----
> From: John Schultz [mailto:jschultz at d-fusion.net]
> Sent: Friday, June 13, 2003 1:31 PM
> To: spread-users at lists.spread.org
> Subject: Re: [Spread-users] Question about thread-safety
> NOTE: if anyone knows how to instruct Unix/Linux systems not to reuse 
> file descriptor IDs in a process, could you please email me or the list 
> a good reference (with page #)? Thanks!
> Hi Stuart,
> What you are proposing is sound provided that you #define _REENTRANT 
> when compiling and link with Spread's thread safe library libtspread.a.
> Currently, you can get spurious errors from Spread for the following 
> reason: whenever there is a non-user error (socket failure, etc.) on a 
> mailbox/socket, the Spread user library immediately closes and 
> invalidates the mailbox/socket and returns CONNECTION_CLOSED.
> Any subsequent SP call's on that mailbox/socket will return 
> ILLEGAL_SESSION.  So, if your sender thread gets a CONNECTION_CLOSED 
> your recv'er thread would very likely get an ILLEGAL_SESSION (or maybe a 
> CONNECTION_CLOSED), and vice versa.  Just treat any such ILLEGAL_SESSION 
> error as if it were a CONNECTION_CLOSED error.
> Personally, I think that the Spread library should be modified to record 
> any such error and return it for all subsequent SP calls on that 
> mailbox.  Furthermore, the mailbox/socket should be invalidated/closed 
> only upon the user calling SP_disconnect on it.
> If your program is opening and closing multiple Spread connections then 
> there is also a OS file descriptor reuse race condition that could be 
> causing problems.  This race condition is best explained by example:
> Imagine you have a sender thread (x) and a receiver thread (y) for 
> mailbox/socket A and another thread (z) which is going to call 
> SP_connect to create a mailbox/socket B.  Just before x starts writing a 
> msg on A, y receives an error on A and therefore immediately 
> closes/invalidates it. Next, z successfully performs SP_connect and is 
> assigned mailbox/socket B, which happens to have the same value as A due 
> to the OS reusing file descriptor IDs. Finally, y happily (and 
> successfully) writes its msg for A on B not realizing that it is 
> actually writing to a different Spread connection!
> This behavior is obviously not correct! I'm not sure if this race 
> condition exists on Windows but it definitely exists in Unix/Linux. I 
> don't know if this problem can be reliably handled on the daemon side 
> and I doubt if currently the daemon even tries to detect it.
> The only way I can think of to avoid this race condition with the 
> current Spread library is to instruct your OS not to reuse file 
> descriptors IDs (see NOTE above).
> If the Spread library is modified as I suggested above, then I think the 
> race condition could be avoided by synchronizing calls to SP_connect and 
> SP_disconnect.

Spread-users mailing list
Spread-users at lists.spread.org

The information contained in this communication is
confidential, is intended only for the use of the recipient
named above, and may be legally privileged.
If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly
If you have received this communication in error,
please re-send this communication to the sender and
delete the original message or any copy of it from your
computer system. Thank You.

More information about the Spread-users mailing list