9 Threads, Processes, and the Notorious GIL

Your computer’s processor most likely has 4-16 cores.

Each core can execute one instruction at a time, so (to a first approximation) the number of cores is how many things your machine can be doing truly simultaneously.

In practice however: your computer is doing hundreds of things: background processes managing & monitoring your system’s hardware, a few dozen browser tabs rendering & executing JavaScript, MS Excel, VS Code, etc.

This is done by rapidly switching execution between different threads.

btop demo

The operating system is responsible for providing two key interfaces:

Processes

A process is an OS construct that has its own memory space, file handles, and one or more threads.

Most programs are a single process, but some programs spawn subprocesses. For example, each browser tab has its own process.

Threads

A thread is a sequence of instructions. Many programs are a single thread, but it is possible for a program to spawn additional threads.

Threads share memory and come with lower overhead than processes.

	Threads	Processes
Memory	Shared	Separate
Overhead	Low	Higher
Crash isolation	No	Yes
Communication	Easy (shared memory)	Harder (IPC)

What about coroutines?

The synchronous coroutines we saw cooperatively multi-task, suspending a routine (with yield/send) at a set time so another can execute. Threads instead preempt one another every sys.getswitchinterval() seconds–

import sys
sys.getswitchinterval()

0.005

`threading`

In theory, we can have multiple threads handle different parts of a task to run in parallel. The threading module provides an easy interface to writing code that should run on different threads.

import time

def timed(f):
    def inner(*args, **kwargs):
        start = time.time()
        f(*args, **kwargs)
        print(f"took {time.time() - start:.1f}")
    return inner

import threading

def sum_up_to(n):
    return sum(range(n))

def worker(n):
    print(sum_up_to(n))


data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]

@timed
def run_single_threaded():
    for x in data:
        worker(x)

@timed
def run_multi_threaded():
    threads = []
    for x in data:
        t = threading.Thread(target=worker, args=(x,))
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

run_single_threaded()

4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5

So this should be ~4x faster…

run_multi_threaded()

4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5

Wait… what?

The GIL

Since threads have shared memory, it is possible for two threads to simultaneously modify shared memory.

shared_list = []

t1 = threading.Thread(target=lambda: shared_list.append("A"))
t2 = threading.Thread(target=lambda: shared_list.append("B"))
t1.start()
t2.start()
t1.join()
t2.join()

What if both threads try to modify shared_list at the same time?

The global interpreter lock ensures that only one thread at a time can execute Python bytecode.

This is why the example above was slower.

CPU-bound code (code that is primarily executing Python instructions) therefore will get slower when threaded. Instead of executing in parallel, only one thread can run at a time, so things get slower because you are both executing the four functions in sequence, but also paying the price of overhead.

So why have threads at all?

Python C Extensions are allowed to release the GIL and typically do so when blocking– waiting on I/O.

socket operations for low-level networking
file reads/writes
urllib

Demo: I/O-bound

import threading
import urllib.request

urls = [
    "http://example.com",
    "http://example.com",
    "http://example.com",
    "http://example.com",
]

def fetch(url):
    with urllib.request.urlopen(url) as r:
        print(f"{url}: {len(r.read())} bytes")

@timed
def run_single_threaded_io():
    for url in urls:
        fetch(url)

@timed
def run_multi_threaded_io():
    threads = []
    for url in urls:
        t = threading.Thread(target=fetch, args=(url,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()

run_single_threaded_io()

http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
took 0.2

run_multi_threaded_io()

http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
took 0.1

Multiprocessing

Python also offers subprocesses with the same interface

import multiprocessing

def sum_up_to(n):
    return sum(range(n))

def worker(n):
    print(sum_up_to(n))

data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]

@timed
def run_single_process():
    for x in data:
        worker(x)

@timed
def run_multi_process():
    processes = []
    for x in data:
        p = multiprocessing.Process(target=worker, args=(x,))
        processes.append(p)
        p.start()
    for p in processes:
        p.join()

run_single_process()

4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5

run_multi_process()

241999989000000
449999985000000
4049999955000000
8449999935000000
took 1.3

Downsides of `multiprocessing`

Each process takes extra time and an independent Python interpreter. Overhead can exceed speedup for short tasks.
No shared state– no shared memory so cannot pass list/dict/class between processes. Instead must serialize (typically with pickle) and then deserialize– additional overhead and limited support for custom data types.
Hard to debug. Exceptions in child processes can be swallowed or mangled; stack traces are harder to follow.

`concurrent.futures`

A backend-agnostic interface to sending jobs to multiple threads/processes.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def sum_up_to(n):
    return sum(range(n))

def worker(n):
    print(sum_up_to(n))

data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]

@timed
def run_single_threaded():
    for x in data:
        worker(x)

@timed
def run_concurrent_futures(ExecutorCls):
    with ExecutorCls() as executor:
        executor.map(worker, data)

run_single_threaded()

4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5

run_concurrent_futures(ProcessPoolExecutor)

241999989000000
449999985000000
4049999955000000
8449999935000000
took 1.5

run_concurrent_futures(ThreadPoolExecutor)

4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5

Killing the GIL

There have been many attempts to remove the GIL, but this has required adding locks to each object in the Python interpreter.

Only recently in Python 3.13 has it become possible to run CPython with the GIL disabled. We’ll take a closer look at this in a few weeks, but it is still far from the default and not compatible with many C extensions.

What about coroutines?

threading