15  Python Without the GIL

The GIL is major performance bottleneck for multi-threaded code.

We have seen that it meant that threaded CPU-bound code is no faster than running the same code in sequence.

The biggest change to Python since the Python 3 transition (arguably bigger) is the ongoing transition to free-threaded Python. This transition began in Python 3.13 with an experimental built of Python known as 3.13t

https://docs.python.org/3/howto/free-threading-python.html

This work is under active development and being rolled out gradually due to the disruptive nature of the change and the vast amount of Python code that depends on the behavior of the GIL.

The Steering Council accepts PEP 703, but with clear proviso: that the rollout be gradual and break as little as possible, and that we can roll back any changes that turn out to be too disruptive – which includes potentially rolling back all of PEP 703 entirely if necessary (however unlikely or undesirable we expect that to be).

-PEP 703 – Making the Global Interpreter Lock Optional in CPython

Motivations

Python is not well-suited to the kind of parallelism that is useful to many modern applications (neural networks, highly-paralellized big data algorithms). These algorithms often run in C/C++ with 64-128 threads, but hit GIL bottlenecks in the single-digits. Rewriting this complex code is not desirable since many of the researchers are much more comfortable with Python, and the rest of their ecosystem is already using it.

CPython libraries can release the GIL, but must do so carefully and only when they can offer certain guarantees. This means that NumPy and everything built on it is mostly single-threaded.

Another quote from PEP 703 is illustrative:

In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.

What it takes to remove the GIL

The GIL is heavily involved in many critical parts of Python’s internals:

Reference Counting/Memory Management

As we’ve seen, every object in Python has a count of how many references to it exist. If two Python threads are accessing the same object, there is a chance that these counts are the victim of a race condition bug, where one thread decrements the count before the other increments it.

The solution is a technique first described in 2018 known as Biased Reference Counting. Each object is associated with an owning thread, this thread uses an optimized (non-atomic) count, while other threads are forced to synchronize their count with that thread. Since most objects are only used on their created thread– this provides a performance speed up for common cases.

In addition to this new two-tier reference count system, all objects have a 1-byte mutex– a per-object lock instead of a per-interpreter lock.

Certain objects are now made immortal, something that had already begun in PEP 683. These objects (short strings, small integers, True, False, None) were already immutable and now no longer have a true reference count, freeing them from the GIL entirely.

Garbage Collection

Python’s garbage collector detects cycles: object A refers to B and B refers to A but nothing references A or B outside this loop. These cycles should be freed to avoid memory leaks.

This poses a challenge if a reference count is changing during this cycle-detection algorithm. The change here required introducing the concept of thread-state, each thread is either in ATTACHED, DETACHED, or GC state. ATTACHED/DETACHED are similar to threads “having the GIL” vs. not, but now multiple threads can be ATTACHED. During garbage collection all other threads must go into GC state where they are paused and lose their ability to access Python objects, but can then resume once the GC is complete.

The other change here is a shift from a generational garbage collector to non-generational.

GIL Python’s current generational GC divides objects into young, medium, old. Each object category is collected at different intervals: young objects are checked frequently as many objects are short lived (loop variables, parameters, etc.); older objects are checked less frequently– long-lived objects tend to persist.

No-GIL Python adopts a non-generational GC as part of the overhaul of how things work. This is a simplification needed to make things work, but it is likely that this will continue to improve before no-GIL becomes the default.

late-breaking update: 3.14 tried an incremental GC but this is being rolled back in 3.15, will require a proper PEP process in future.

Collection Objects

Another place where the GIL is important is in interactions with container types. list and dict for instance store metadata about their current size, capacity, and other variables– since only one thread can modify these objects at a time there is no risk of the internal item count being incorrect as one thread appends and another pops.

Some languages like Java have different thread-safe and non-thread-safe containers: HashMap vs ConcurrentHashMap but to keep Python working as it always has– all collection types now grow a per-object lock that is acquired by the operating thread for potentially dangerous operations.

Avoiding Deadlocks

One of the challenges earlier attempts faced was that these new automatic-locks on Python objects can result in deadlocks. A deadlock occurs when thread 1 requires objects A and B and attempts to acquire locks on both, but thread 2 at the same time tries to lock B and A. Thread 1 gets A, thread 2 gets B, and both freeze forever waiting for the second object.

import threading
import time

lock_a = threading.Lock()
lock_b = threading.Lock()

def thread1():
    with lock_a:
        time.sleep(0.05)   # let thread2 grab lock_b first
        with lock_b:       # deadlock: waiting for lock_b
            print("thread1 done")

def thread2():
    with lock_b:
        time.sleep(0.05)
        with lock_a:       # deadlock: waiting for lock_a
            print("thread2 done")

t1 = threading.Thread(target=thread1)
t2 = threading.Thread(target=thread2)
t1.start()
t2.start()
t1.join()  # hangs forever
t2.join()  # hangs forever

The fix is to ensure locks are acquired in the same order.

An extension to the C API allows declaring sections of code critical sections in which one or two locks can be held, acquired in a predictable order, and all other critical sections are released.

mimalloc

mimalloc is a replacement memory allocator for Python’s venerable pymalloc. It is thread-safe but uses more memory and is a factor in both speed and memory usage regressions currently holding back adoption. Another area for future improvement. (Each thread has a separate heap)

Downsides of Removing the GIL

C Extensions/new ABI

Right now, if on a free-threaded build importing a non-FT extension re-enables the GIL. This means one must be very careful in what packages are imported in a FT build:

import numpy   # numpy 1.x requires GIL (2.x does not)
import sys
print(sys._is_gil_enabled())   # True — numpy pulled it back
// requires change in the C extension init
PyUnstable_Module_SetGIL(module, Py_MOD_GIL_NOT_USED);
// as well as actually using the new C API

Extension libraries link against what is known as the ABI (a binary interface between Python and C code). The changes to the GIL mean that C libraries must be compiled explicitly for GIL and no-GIL mode. This effectively doubles the number of platforms C authors must test and build on.

Before: MacOS Intel, MacOS ARM, Windows Intel, Windows ARM, Linux Intel, Linux ARM, etc.

Now: MacOS Intel GIL, MacOS Intel FT, MacOS ARM GIL, MacOS ARM FT, Windows Intel GIL Windows Intel FT, Windows ARM GIL, Windows ARM FT, Linux Intel GIL, Linux Intel FT, Linux ARM GIL, Linux ARM FT, etc.

Performance overhead for single-threaded Python

Most Python applications are single-threaded. This is perhaps a bit of a chicken & egg problem– but the truth is most applications work fine with a single thread. One of the challenges faced with any GIL-removal plan is that it will almost certainly slow down single-threaded code.

The slowdown here is measured at 5-10%, deemed suitable given other major performance improvments to the language since 3.10, and more which are still in progress (such as the ongoing JIT work).

Installing a no-GIL Python

uv python install 3.13t
uv run python -c "import sys; print(sys.version)"

You should see 3.13.x experimental free-threading build.

import sys
print(sys._is_gil_enabled())   # False on 3.13t by default... mostly

In 3.13, the GIL can re-enable itself if a C extension that wasn’t compiled for free-threading is imported:

# Force it off even if a legacy extension tries to re-enable
PYTHON_GIL=0 python myscript.py

Check in code:

import sys
if sys._is_gil_enabled():
    print("GIL is on — probably a legacy extension triggered it")

Demo: CPU-Bound Threading Before and After

Requires 3.13t. Run both with python3.12 and python3.13t to compare.


import threading
import time


def cpu_work(n: int) -> int:
    total = 0
    for i in range(n):
        total += i * i
    return total


N = 10_000_000
THREADS = 8


def run_threaded():
    threads = [threading.Thread(target=cpu_work, args=(N,)) for _ in range(THREADS)]
    t = time.perf_counter()
    for th in threads:
        th.start()
    for th in threads:
        th.join()
    return time.perf_counter() - t


def run_sequential():
    t = time.perf_counter()
    for _ in range(THREADS):
        cpu_work(N)
    return time.perf_counter() - t


print(f"Sequential: {run_sequential():.3f}s")
print(f"Threaded:   {run_threaded():.3f}s")

python3.14 thread-demo.py

On Python 3.14: threaded ≈ sequential

python3.14t thread-demo.py

On Python 3.14t: ~4x faster

What’s Still Not Safe

Free-threaded doesn’t mean race-condition-free. Python’s objects have per-object locks but your logic doesn’t:


import threading

counter = 0


def increment():
    global counter
    for _ in range(100_000):
        counter += 1  # read-modify-write, not atomic


threads = [threading.Thread(target=increment) for _ in range(4)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(counter)

Fixed with a Lock:


import threading

counter = 0
lock = threading.Lock()


def increment():
    global counter
    for _ in range(100_000):
        with lock:
            counter += 1


threads = [threading.Thread(target=increment) for _ in range(4)]

for t in threads:
    t.start()
for t in threads:
    t.join()

print(counter)

Bigger picture:

  • iterators cannot be shared across threads safely.
  • generators next() across threads
  • check-than-act (+=)

Need to be careful with Python objects. C-level integrity is guaranteed, nothing more.

The Future of no-GIL Python

Remaining opt-in with experimental builds for another year or two (at least).

This gives major packages time to catch up (in progress), then a migration path for the rest of the ecosystem (potentially long and painful?).

The plan is still for an eventual re-unification, perhaps 3.17 or 3.18 will be GIL-free.