Garbage collection in Python

Garbage collector aims at collecting objects occupying room in memory but that will be unlikely to be used since nothing make references to their address.

Garbage collection is a feature simplifying programming and improving robustness of the code dramatically. As a high-level programming language, Python has a garbage collector and provide a standard module, gc that allows to interact with it.

Prelude, psutil and context manager

The module psutil provides tools to monitor available memory on many different systems. I will have a basic usage of it to monitor memory usage of some Python code.

Context manager are a great way to encapsulate code and ensure that some initialization and termination of the code are well done. Since the system is not idle, we are not really measuring the space taken by the python object. The precision is low (approximately around 10Mb).

import psutil

class memory_usage():
    def __enter__(self): = psutil.virtual_memory().free

    def __exit__(self, *args):
        memory_usage = - psutil.virtual_memory().free
        print(f"Memory usage {10*(memory_usage//10**7)} Mb")

with memory_usage():
    L = list(range(1000000))
Memory usage 30 Mb

We can play a bit with memory_usage to have some idea of what is costly is Python. We compare here respectively the cost of storing 1.000.000 of dict, list, and tuple.

with memory_usage():
    K = [{} for _ in range(1000000)]
Memory usage 20 Mb
with memory_usage():
    R = [[] for _ in range(1000000)]
Memory usage 10 Mb

We can also check the impact of delete pointer reference.

with memory_usage():
    del L
Memory usage -40 Mb

We don’t really have to delete the list itself. It will be automatically delete if we reuse the reference.

with memory_usage():
    K = R = 3
Memory usage -330 Mb

We can also check that reference within the scope of a function are dealt-with appropriately.

def f():
    L = [{} for _ in range(1000000)]

with memory_usage():
Memory usage 0 Mb

Small application of gc

As we have seen, Garbage collector do a substantial amount of work to avoid programmers to think about memory management. However, this work requires to find compromise between automatic memory management and code performance.

The garbage collector collects explicit references to python object and delete object with no references. It is simple enough to be implemented efficiently but cannot detect cycle references. Those cycle references will never be free. It can happen in many way, lets look at some examples.

with memory_usage():
    L = [ [] for _ in range(1000000) ]
    for K in L:
        K.append(K) # self reference!
Memory usage 90 Mb

We have a list of 1.000.000 list self referring to them self. If we delete the list L, nothing is freed.

with memory_usage():
    del L
Memory usage 0 Mb

The memory used by those list is gone and will never be freed except at the termination of the whole script.

If such an example is unlikely to happen in real code, it can appears without noticing it. Let say we want to implement a very simple double linked list with only a append operation (don’t do that, it is a really inefficient implementation)

class DoubleLinkList:
    def __init__(self, v):
        self.pred = None
        self.succ = None
        value = v
        self.useless_memory = list(range(10000))

    def append(self, v):
        succ = self
        while succ.succ:
            succ = succ.succ
        succ.succ = DoubleLinkList(v)
        succ.succ.pred = succ

with memory_usage():
    D = DoubleLinkList(0)
    for i in range(1, 1000):
Memory usage 260 Mb

The memory used by DoubleLinkList is doomed. Even if explicitly free it.

with memory_usage():
    del D
Memory usage 0 Mb

Here, I used some useless field to make the object consume more memory. Tracking memory like this is rather crude. We can’t really see small memory leakage. Here come the gc module. It can explicitly collect all the object within the memory and reference to them.

For instance, I can check how many list are in my current memory.

import gc
print(len(list(filter(lambda e:type(e) is list, gc.get_objects()))))
L = [] # Lets add one more list
print("One more", len(list(filter(lambda e:type(e) is list, gc.get_objects()))))
del L # And delete it
print("It is deleted", len(list(filter(lambda e:type(e) is list, gc.get_objects()))))
One more 1000917
It is deleted 1000916

Lets package a bit gc for convenience:

def typeCount(some_type):
    return len(list(filter(lambda e:type(e) is some_type, gc.get_objects())))

A stupid class

Lets say we create a really stupid class. The only thing every instance does is to point to itself.

class IAmStupid:
    def __init__(self):
        self.self = self

L = [ IAmStupid() for _ in range(1000) ]

print(f"{typeCount(IAmStupidd)} Stupid Class")
del L
print("They can't be remove, so they are still:{typeCount(IAmStupid)}")
Traceback (most recent call last):
  File "/home/cha/conf_files/utils/check_py_md", line 63, in repl
    exec(code, env)
  File "<string>", line 7, in <module>
NameError: name 'IAmStupidd' is not defined

A more surprising example

So far we have seen how circular dependencies could be harmful to automatic management system. Those circular dependencies we have seen so far where obvious.

It can happen surprisingly.

class Closure:
    def __init__(self, v):
        self.v = v
        self.func = lambda e: self.v * e

This class show a pattern that can actually be use in real life context. We define a method on the go at object initialization. It is not a good practice but can be convenient.

As you probably already guess, this class act like the stupid one above.

C = Closure(0)

print(f"{typeCount(Closure)} Closured Class")
del C
print(f"They can't be remove, so they are still: {typeCount(Closure)}")
1 Closured Class
They can't be remove, so they are still: 1

Here the catch came from the lambda closure itself. In its scope it encompass self and thus point toward it. Many more tricky example could exists. I would be happy to add them, so don’t hesitate to mail me.

Some investigation of the closure itself

The catch here is to understand what a lambda closure mean. When you define a function with a lambda, it can access to the whole environment. When you call the function, the environment could have change and change the output of the lambda function.

When the scope of the lambda definition is global, it respect lazyness of Python and as expected, it follows variables name and not their value.

For instance:

x = y = 1 
f = lambda: x*y
x = 0

Within the scope of a function, the environment is no longer global and make the analysis a bit more tricky. For instance,

def some_func():
    x = 0
    y = 1
    f = lambda: x * y
    return f

other_func = some_func()
x = 1

What this code show is that the lambda function has somehow to remember its context. It can be further explore through two methods of the function __closure__ and __code__.

print(other_func.__code__.co_freevars, *other_func.__closure__)
('x', 'y') <cell at 0x7fab47478d38: int object at 0x8a48c0> <cell at 0x7fab47478d08: int object at 0x8a48e0>

If we go back to the Closure class, we can see explicitly the reference within the closure.

C = Closure(0)
print(C.func.__closure__, C.func.__code__.co_freevars)
del C
(<cell at 0x7fab474789a8: Closure object at 0x7fab4740cfd0>,) ('self',)

Automatic cleanup

I have said before that cyclic reference is wasting memory. It is actually not exactly the case. The garbage collector will eventually collect the memory, in particular if the main thread is idle.

It is possible to trigger it explicitly with the method gc.collect.

print(f"Instance of Closure {typeCount(Closure)}")
D = Closure(1)
print(f"Instance of Closure {typeCount(Closure)}")
del D
print(f"Instance of Closure {typeCount(Closure)}")
print(f"Instance of Closure {typeCount(Closure)}")
Instance of Closure 2
Instance of Closure 3
Instance of Closure 3
Instance of Closure 0