Garbage collection in Python

Garbage collector aims at collecting objects occupying room in memory but that will be unlikely to be used since nothing make references to their address.

Garbage collection is a feature simplifying programming and improving robustness of the code dramatically. As a high-level programming language, Python has a garbage collector and provide a standard module, gc that allows to interact with it.

Prelude, `psutil` and context manager

The module psutil provides tools to monitor available memory on many different systems. I will have a basic usage of it to monitor memory usage of some Python code.

Context manager are a great way to encapsulate code and ensure that some initialization and termination of the code are well done. Since the system is not idle, we are not really measuring the space taken by the python object. The precision is low (approximately around 10Mb).

import psutil

class memory_usage():
    def __enter__(self):
        self.free = psutil.virtual_memory().free

    def __exit__(self, *args):
        memory_usage = self.free - psutil.virtual_memory().free
        print(f"Memory usage {10*(memory_usage//10**7)} Mb")

with memory_usage():
    L = list(range(1000000))

We can play a bit with memory_usage to have some idea of what is costly is Python. We compare here respectively the cost of storing 1.000.000 of dict, list, and tuple.

with memory_usage():
    K = [{} for _ in range(1000000)]

with memory_usage():
    R = [[] for _ in range(1000000)]

We can also check the impact of delete pointer reference.

with memory_usage():
    del L

We don’t really have to delete the list itself. It will be automatically delete if we reuse the reference.

with memory_usage():
    K = R = 3

We can also check that reference within the scope of a function are dealt-with appropriately.

def f():
    L = [{} for _ in range(1000000)]

with memory_usage():
    f()

Small application of `gc`

As we have seen, Garbage collector do a substantial amount of work to avoid programmers to think about memory management. However, this work requires to find compromise between automatic memory management and code performance.

The garbage collector collects explicit references to python object and delete object with no references. It is simple enough to be implemented efficiently but cannot detect cycle references. Those cycle references will never be free. It can happen in many way, lets look at some examples.

with memory_usage():
    L = [ [] for _ in range(1000000) ]
    for K in L:
        K.append(K) # self reference!

We have a list of 1.000.000 list self referring to them self. If we delete the list L, nothing is freed.

with memory_usage():
    del L

The memory used by those list is gone and will never be freed except at the termination of the whole script.

If such an example is unlikely to happen in real code, it can appears without noticing it. Let say we want to implement a very simple double linked list with only a append operation (don’t do that, it is a really inefficient implementation)

class DoubleLinkList:
    def __init__(self, v):
        self.pred = None
        self.succ = None
        value = v
        self.useless_memory = list(range(10000))

    def append(self, v):
        succ = self
        while succ.succ:
            succ = succ.succ
        succ.succ = DoubleLinkList(v)
        succ.succ.pred = succ
    

with memory_usage():
    D = DoubleLinkList(0)
    for i in range(1, 1000):
        D.append(i)

The memory used by DoubleLinkList is doomed. Even if explicitly free it.

with memory_usage():
    del D

Here, I used some useless field to make the object consume more memory. Tracking memory like this is rather crude. We can’t really see small memory leakage. Here come the gc module. It can explicitly collect all the object within the memory and reference to them.

For instance, I can check how many list are in my current memory.

import gc
print(len(list(filter(lambda e:type(e) is list, gc.get_objects()))))
L = [] # Lets add one more list
print("One more", len(list(filter(lambda e:type(e) is list, gc.get_objects()))))
del L # And delete it
print("It is deleted", len(list(filter(lambda e:type(e) is list, gc.get_objects()))))

Lets package a bit gc for convenience:

def typeCount(some_type):
    return len(list(filter(lambda e:type(e) is some_type, gc.get_objects())))

A stupid class

Lets say we create a really stupid class. The only thing every instance does is to point to itself.

class IAmStupid:
    def __init__(self):
        self.self = self

L = [ IAmStupid() for _ in range(1000) ]

print(f"{typeCount(IAmStupidd)} Stupid Class")
del L
print("They can't be remove, so they are still:{typeCount(IAmStupid)}")

A more surprising example

So far we have seen how circular dependencies could be harmful to automatic management system. Those circular dependencies we have seen so far where obvious.

It can happen surprisingly.

class Closure:
    def __init__(self, v):
        self.v = v
        self.func = lambda e: self.v * e

This class show a pattern that can actually be use in real life context. We define a method on the go at object initialization. It is not a good practice but can be convenient.

As you probably already guess, this class act like the stupid one above.

C = Closure(0)

print(f"{typeCount(Closure)} Closured Class")
del C
print(f"They can't be remove, so they are still: {typeCount(Closure)}")

Here the catch came from the lambda closure itself. In its scope it encompass self and thus point toward it. Many more tricky example could exists. I would be happy to add them, so don’t hesitate to mail me.

Some investigation of the closure itself

The catch here is to understand what a lambda closure mean. When you define a function with a lambda, it can access to the whole environment. When you call the function, the environment could have change and change the output of the lambda function.

When the scope of the lambda definition is global, it respect lazyness of Python and as expected, it follows variables name and not their value.

For instance:

x = y = 1 
f = lambda: x*y
x = 0
print(f())

Within the scope of a function, the environment is no longer global and make the analysis a bit more tricky. For instance,

def some_func():
    x = 0
    y = 1
    f = lambda: x * y
    return f

other_func = some_func()
print(other_func())
x = 1
print(other_func())

What this code show is that the lambda function has somehow to remember its context. It can be further explore through two methods of the function __closure__ and __code__.

print(other_func.__code__.co_freevars, *other_func.__closure__)

If we go back to the Closure class, we can see explicitly the reference within the closure.

C = Closure(0)
print(C.func.__closure__, C.func.__code__.co_freevars)
del C

Automatic cleanup

I have said before that cyclic reference is wasting memory. It is actually not exactly the case. The garbage collector will eventually collect the memory, in particular if the main thread is idle.

It is possible to trigger it explicitly with the method gc.collect.

print(f"Instance of Closure {typeCount(Closure)}")
D = Closure(1)
print(f"Instance of Closure {typeCount(Closure)}")
del D
print(f"Instance of Closure {typeCount(Closure)}")
gc.collect()
print(f"Instance of Closure {typeCount(Closure)}")