Checking for leaks in Python

Thu, 26 Feb 2015 15:29:57 +0000

For short running scripts it doesn't really matter if your Python module is leaking memory; it all gets cleaned up on process exit anyway, however if you have a long-running application you really don't want an ever increasing memory footprint. So, it is helpful to check that your module isn’t leaking memory, and to plug the leaks if they occur. This post provides some ideas on how to do that, and describes some of the implementation of my check_leaks module.

If you suspect a given function foo of leaking memory, then the simplest approach to checking that it is not leaking memory is to get the set of objects that exist before running the function, the set of objects the exists after running the function, and then any object in after that isn’t in before is going to be a leaked object.

So in Python the garbage collector tracks (practically) all the objects that exist, and the gc module provides us with access to the internals of the garbage collector. In particular the get_objects method, will provide a list of all the objects tracked by the garbage collector. So at a really simple level we can do:

before_objects = gc.get_objects()
foo()
after_objects = gc.get_objects()

It is then a simple matter of working out which objects are in after_objects but not before_objects, and then that is our leaking objects! One catch however is that we don’t know when the garbage collector has run, so with the above code it is possible that there are objects in after_objects that could be cleaned by the garbage collector. Thankfully, the gc module provides the collect method which allows us to force a garbage collection. So we end up with something like:

before_objects = gc.get_objects()
foo()
gc.collect()
after_objects = gc.get_objects()

The other small gotcha in this case is that the before_objects list will end up as an object in after_objects. This isn’t really a leak, so we can ignore it when display the leaking objects.

So far, this can give us a list of leaked objects, but it doesn’t really help resolve the issue except in some of the simplest cases. We need more information to really fix the issue. Some things that can help with this is knowing why it hasn’t been collected yet, which comes down to knowing which other objects still have a reference to the leaked object. Additionally it can be very useful to know where the object was originally allocated. Python gives us some tools to help give us this information.

The gc module provides a get_referrers method, which provides a list of all objects that refer to a given object. This can be used to print out which objects refer to the given object. Unfortunately this is less useful than you might think, because it most cases the referring object is a dict object, as most class instances and modules store references in __dict__ rather than directly. With some more processing it would probably be possible to work out the purpose of each dict and provide more semantically useful information.

To work out where an object was originally allocated we can use the new tracemalloc module. When enabled the tracemalloc module will store a traceback for each memory allocation within Python. The get_object_traceback() method will return a traceback (or None) for a given object. This makes debugging leaks significantly easier than just guessing where an object was allocated.

Unfortunately, tracemalloc can sometimes provide misleading information when trying to detect leaks. Many of the Python built-in types optimise memory allocation by keeping an internal free list. When freeing an object, rather than calling the normal free routine, the object is placed on this free list, and a new object allocation can grab an object directly from the free list, rather than need to go through the normal memory manager routines. This means when using get_object_traceback(), the trace returned will be from the original time the memory was allocated, which may be for a long dead object. I think it would be good if the Python runtime provide a way to selectively disable this freelist approach to allow precise use of the get_object_traceback() method. I have an initial branch that does this for some object up on github.

blog comments powered by Disqus