For short running scripts it doesn't really matter if your Python module
is leaking memory; it all gets cleaned up on process exit anyway, however
if you have a long-running application you really don't want an ever
increasing memory footprint. So, it is helpful to check that your
module isn’t leaking memory, and to plug the leaks if they occur. This
post provides some ideas on how to do that, and describes some of the
implementation of my check_leaks
module.
If you suspect a given function foo
of leaking memory, then the
simplest approach to checking that it is not leaking memory is to get the set of
objects that exist before running the function, the set of objects the
exists after running the function, and then any object in after
that isn’t in before is going to be a leaked object.
So in Python the garbage collector tracks (practically) all the objects that
exist, and the gc
module provides us with access to the internals of the garbage collector.
In particular the get_objects
method, will provide a list of all
the objects tracked by the garbage collector. So at a really simple level we can do:
before_objects = gc.get_objects() foo() after_objects = gc.get_objects()
It is then a simple matter of working out which objects are in
after_objects
but not before_objects
, and
then that is our leaking objects! One catch however is that we don’t
know when the garbage collector has run, so with the above code it is
possible that there are objects in after_objects
that
could be cleaned by the garbage collector. Thankfully, the gc
module provides the collect
method which allows us to force a
garbage collection. So we end up with something like:
before_objects = gc.get_objects() foo() gc.collect() after_objects = gc.get_objects()
The other small gotcha in this case is that the before_objects
list will end up as an object in after_objects
. This isn’t really
a leak, so we can ignore it when display the leaking objects.
So far, this can give us a list of leaked objects, but it doesn’t really help resolve the issue except in some of the simplest cases. We need more information to really fix the issue. Some things that can help with this is knowing why it hasn’t been collected yet, which comes down to knowing which other objects still have a reference to the leaked object. Additionally it can be very useful to know where the object was originally allocated. Python gives us some tools to help give us this information.
The gc
module provides a get_referrers
method, which provides a list of all objects that refer to a given
object. This can be used to print out which objects refer to the given
object. Unfortunately this is less useful than you might think,
because it most cases the referring object is a dict
object, as most class instances and modules store references in
__dict__
rather than directly. With some more processing
it would probably be possible to work out the purpose of each dict
and provide more semantically useful information.
To work out where an object was originally allocated we can use
the new tracemalloc
module. When enabled the tracemalloc
module will store a traceback
for each memory allocation within Python. The
get_object_traceback()
method will return a traceback (or None) for
a given object. This makes debugging leaks significantly easier than
just guessing where an object was allocated.
Unfortunately, tracemalloc
can sometimes provide misleading
information when trying to detect leaks. Many of the Python built-in
types optimise memory allocation by keeping an internal free list. When freeing
an object, rather than calling the normal free routine, the object is placed
on this free list, and a new object allocation can grab an object directly
from the free list, rather than need to go through the normal memory manager
routines. This means when using get_object_traceback()
, the trace
returned will be from the original time the memory was allocated, which may
be for a long dead object. I think it would be good if the Python runtime
provide a way to selectively disable this freelist approach to allow precise
use of the get_object_traceback()
method. I have an initial
branch that does this for some object up on github.