[Spread-users] mbox corruption and timeouts in membership.c

Jonathan Stanton jonathan at cnds.jhu.edu
Tue Nov 14 04:15:47 EST 2006


On Fri, Nov 03, 2006 at 04:52:45PM -0600, Matt Garman wrote:
> On 11/3/06, John Schultz <jschultz at spreadconcepts.com> wrote:
> >-8 (Connection Closed) errors for a client are almost invariably traced to
> >a lack of flow control amongst sending applications.  In a multicast
> >environment it is very easy for aggressive senders to overrun receivers'
> >buffers.  Spread tries not to allow slow readers to exhaust its memory and
> >cause the daemon to crash so it disconnects readers that aren't keeping up
> >with their flow of traffic.
> 
> Thank you for the quick feedback.  I set up a test spread
> configuration as follows:
> 
>    - Two segments (in different VLANs)
>    - One machine on each segment
>    - One spread daemon on each machine (i.e. two total spread daemons)
>    - Both daemons have the SESSION logging flag
>    - A client on machine A sends an enormous amount of messages (1000
> Hz, each message is about 1 kB)
>    - A receiver on machine B sleeps 1 sec between each received message
> 
> >One recommendation I can give is to turn on the SESSION logging flag and
> >then search the log for "sess_kill" and see why it is disconencting your
> >clients.
> 
> In the spread log on machine A (sender side) I have tons of these messages:
> 
> [Fri 03 Nov 2006 22:16:32] Sess_read: Message has type field 0x80000082
> [Fri 03 Nov 2006 22:16:32] Sess_read: queueing message of type 2 with
> len 0 to the protocol
> 
> Why does it say that it's queuing a message of length zero?

That is unusual. It should be the length of the data message read from the 
application. The 80000082 type is a RELIABLE regular data message.

> 
> In the spread log on machine B (receiver side) I have many of the
> following messages:
> 
> [Fri 03 Nov 2006 22:27:29] Sess_badger: for mbox 9
> 
> After those, next in the log is as follows:
> 
> [Fri 03 Nov 2006 22:27:29] Sess_write: killing mbox 9 for not reading
> [Fri 03 Nov 2006 22:27:29] Sess_kill: killing session 29818 ( mailbox 9 )
> [Fri 03 Nov 2006 22:29:54] Sess_accept: set sndbuf/rcvbuf to 204800
> [Fri 03 Nov 2006 22:29:54] Sess_recv_client_auth: Client requested
> NULL type authentication
> [Fri 03 Nov 2006 22:29:54] Sess_session_authorized: Accepting from
> 0.0.0.0 with private name 29818 on mailbox 9
> [Fri 03 Nov 2006 22:29:54] Sess_read: Message has type field 0x80010080
> [Fri 03 Nov 2006 22:29:54] Sess_read: queueing message of type 8 with
> len 0 to the protocol
> 
> My cursory glance through the code suggests that "sess_badger" is a
> "nag"-type function that keeps trying to send the message through to
> the sender.  Is this correct?  As you suggested, the log pretty

The badger function manages the output queue to teh receiver and keeps trying to 
write data to the receiver whenever possible (using select to test for writability)

> plainly said that the connection was killed due to not reading.

Yup. In this test case you definitely caused that. Whether it's happening the same 
way on your real app is a bit harder to be sure, but if you are able to do session 
logging on it, you will clearly see the 'kill' messages.

To make the logging lighter weight, you can look in session.c for the Sess_write 
Alarm() message about killing for not reading and change the first line from:
		Alarm( SESSION, 

to 

		Alarmp( SPLOG_PRINT, SESSION,

and set in your spread.conf file the
EventPriority =  ERROR

This will cause only high priority alarm events (ERROR level or higher) to print 
(SPLOG_PRINT is considered a 'must print' priority) while not printing all of the 
other session messages.

> 
> Finally, where is the sendbuf/rcvbuf size set?

It is set in the libspread/sp.c and daemon/session.c code in the connection 
functions. We raise the send/rcv bufs upto 200k by default, so usually there isn't 
much more you can do with that.

If you are sure it is not a long-term flow control problem but just burstyness, you 
can raise the "MAX_SESSION_MESSAGES" value in spread_params.h which will cause 
spread to buffer more messages before cutting off a client.

Cheers,

Jonathan

-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------




More information about the Spread-users mailing list