[Spread-users] Another way to leak (valgrind report)
jonathan at cnds.jhu.edu
Tue Aug 31 14:50:25 EDT 2004
On Tue, Aug 31, 2004 at 02:19:22PM -0400, David Shaw wrote:
> I am now testing with the CVS code and indeed the join/leave leak
> seems to be fixed. Excellent!
Glad to hear it.
> Unfortunately, there is still something that looks like a leak
> somewhere in the CVS code. Specifically, I can get basically all the
> memory on a box if I do:
> Repeat 600 times:
> Connect w/membership messages
> Repeat 500 times:
> Join a group
> Leave a group
This looks like the same basic outline as one of your earlier abuse
programs (mabuse2 I think). Is that correct?
What this should cause to happen is that 1000 messages are queued up for
each connection and then once the disconnect occurs, those messages are
released once Spread detects that the connection was closed. Detecting the
close can take a second or so, but should happen.
Now in all of these cases even when Spread releases memory by calling
'free', my experience has been that the reported memory usage on virtual
size (in top for example) does NOT go down. This is because the used
memory is fragmented and libc does not lower the sbrk() region. So even
though Spread currently may only have 12 MB or so allocated, the
allocations are scattered accross 200 MB of virtual memory space and so
VSS is 200MB. If another program causes Spread to swap out unused pages,
then the RSS does go down as not all physical memory pages are needed.
> I've attached a program that does that. Running this program causes
> spread's memory and fd usage to shoot upwards. The fd count stops at
> 511, but by that time, spread has most of the physical ram on the box.
> When the program finishes, spread's fd count eventually falls, but the
> memory is not released. I suppose this could be extraordinarily
> agressive caching, but leak or cache, spread is ending up with far too
> much memory.
How long does it take in real-time to get to this state? ( a few seconds?)
Now, I must say that this usage pattern looks quite unrealistic -- i.e.
why are you asking for membership messages if you don't ever read them? If
you do read them (insert a SP_recv() call in the inner loop) then I don't
think you will see excessive memory usage (even if you run multiple
threads in parallel).
Part of what I think is happening, is that join/leave calls are
asynchronous -- they just send a message to the daemon and do not wait for
a reply. Then the daemon has to process them and insert a reply membership
message to the outgoing queue for that connection. So you are basically
having upto 600 connections each initiating a thousand events (that Spread
has to deliver reliabily so it can't discard) and noting that this causes
a large amount of memory usage. I do not believe the memory usage is
unbounded (just large). (given the standard C sbrk() issue)
Now there could be an additional leak and maybe we are detecting the
closed connections too slowly -- which makes the problem more visable and
we might be able to work on -- but it might just be that we are using the
required amount of memory to handle the load you are sending. It's just
that that load is larger then available memory. (i.e. a DOS type resource
> >From earlier comments on the list, I wonder if the problem is messages
> (in this case, membership messages) that are waiting to be read by the
> client, but instead the client just disconnects. These messages can
> never be delivered at this point, since the client is gone.
> What happens to messages that are waiting to be read by a client if
> that client disconnects?
Once the daemon detects the disconnect (either because of a Disconnect
message or a closed fd) then all of those messages are freed.
These are just some thoughts based on what you reported. I'll try out your
test program tonite at home.
Jonathan R. Stanton jonathan at cs.jhu.edu
Dept. of Computer Science
Johns Hopkins University
More information about the Spread-users