import sys
sys.getswitchinterval()0.005
Your computer’s processor most likely has 4-16 cores.
Each core can execute one instruction at a time, so (to a first approximation) the number of cores is how many things your machine can be doing truly simultaneously.
In practice however: your computer is doing hundreds of things: background processes managing & monitoring your system’s hardware, a few dozen browser tabs rendering & executing JavaScript, MS Excel, VS Code, etc.
This is done by rapidly switching execution between different threads.
btop demo
The operating system is responsible for providing two key interfaces:
Processes
A process is an OS construct that has its own memory space, file handles, and one or more threads.
Most programs are a single process, but some programs spawn subprocesses. For example, each browser tab has its own process.
Threads
A thread is a sequence of instructions. Many programs are a single thread, but it is possible for a program to spawn additional threads.
Threads share memory and come with lower overhead than processes.
| Threads | Processes | |
|---|---|---|
| Memory | Shared | Separate |
| Overhead | Low | Higher |
| Crash isolation | No | Yes |
| Communication | Easy (shared memory) | Harder (IPC) |
The synchronous coroutines we saw cooperatively multi-task, suspending a routine (with yield/send) at a set time so another can execute. Threads instead preempt one another every sys.getswitchinterval() seconds–
import sys
sys.getswitchinterval()0.005
threadingIn theory, we can have multiple threads handle different parts of a task to run in parallel. The threading module provides an easy interface to writing code that should run on different threads.
import time
def timed(f):
def inner(*args, **kwargs):
start = time.time()
f(*args, **kwargs)
print(f"took {time.time() - start:.1f}")
return innerimport threading
def sum_up_to(n):
return sum(range(n))
def worker(n):
print(sum_up_to(n))
data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]
@timed
def run_single_threaded():
for x in data:
worker(x)
@timed
def run_multi_threaded():
threads = []
for x in data:
t = threading.Thread(target=worker, args=(x,))
threads.append(t)
t.start()
for t in threads:
t.join()run_single_threaded()4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5
So this should be ~4x faster…
run_multi_threaded()4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5
Wait… what?
Since threads have shared memory, it is possible for two threads to simultaneously modify shared memory.
shared_list = []
t1 = threading.Thread(target=lambda: shared_list.append("A"))
t2 = threading.Thread(target=lambda: shared_list.append("B"))
t1.start()
t2.start()
t1.join()
t2.join()What if both threads try to modify shared_list at the same time?
The global interpreter lock ensures that only one thread at a time can execute Python bytecode.
This is why the example above was slower.
CPU-bound code (code that is primarily executing Python instructions) therefore will get slower when threaded. Instead of executing in parallel, only one thread can run at a time, so things get slower because you are both executing the four functions in sequence, but also paying the price of overhead.
So why have threads at all?
Python C Extensions are allowed to release the GIL and typically do so when blocking– waiting on I/O.
urllibimport threading
import urllib.request
urls = [
"http://example.com",
"http://example.com",
"http://example.com",
"http://example.com",
]
def fetch(url):
with urllib.request.urlopen(url) as r:
print(f"{url}: {len(r.read())} bytes")
@timed
def run_single_threaded_io():
for url in urls:
fetch(url)
@timed
def run_multi_threaded_io():
threads = []
for url in urls:
t = threading.Thread(target=fetch, args=(url,))
threads.append(t)
t.start()
for t in threads:
t.join()run_single_threaded_io()http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
took 0.2
run_multi_threaded_io()http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
http://example.com: 528 bytes
took 0.1
Python also offers subprocesses with the same interface
import multiprocessing
def sum_up_to(n):
return sum(range(n))
def worker(n):
print(sum_up_to(n))
data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]
@timed
def run_single_process():
for x in data:
worker(x)
@timed
def run_multi_process():
processes = []
for x in data:
p = multiprocessing.Process(target=worker, args=(x,))
processes.append(p)
p.start()
for p in processes:
p.join()run_single_process()4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5
run_multi_process()241999989000000
449999985000000
4049999955000000
8449999935000000
took 1.3
multiprocessingpickle) and then deserialize– additional overhead and limited support for custom data types.concurrent.futuresA backend-agnostic interface to sending jobs to multiple threads/processes.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def sum_up_to(n):
return sum(range(n))
def worker(n):
print(sum_up_to(n))
data = [90_000_000, 30_000_000, 22_000_000, 130_000_000]
@timed
def run_single_threaded():
for x in data:
worker(x)
@timed
def run_concurrent_futures(ExecutorCls):
with ExecutorCls() as executor:
executor.map(worker, data)run_single_threaded()4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5
run_concurrent_futures(ProcessPoolExecutor)241999989000000
449999985000000
4049999955000000
8449999935000000
took 1.5
run_concurrent_futures(ThreadPoolExecutor)4049999955000000
449999985000000
241999989000000
8449999935000000
took 2.5
There have been many attempts to remove the GIL, but this has required adding locks to each object in the Python interpreter.
Only recently in Python 3.13 has it become possible to run CPython with the GIL disabled. We’ll take a closer look at this in a few weeks, but it is still far from the default and not compatible with many C extensions.