Python Under the Hood
I avoided looking at CPython's internals for an embarrassingly long time. Every time my training script crawled, I'd throw multiprocessing at it and hope for the best. Every time someone mentioned the GIL in an interview, I'd recite the one-liner I'd memorized — "only one thread runs Python at a time" — and pray they wouldn't ask a follow-up. Finally the discomfort of not knowing what's actually happening when Python runs my code grew too great for me. Here is that dive.
CPython — the interpreter you're almost certainly using — is a C program that compiles your Python source into bytecode and then executes that bytecode on a virtual machine. Along the way it manages memory through reference counting, coordinates threads with a global lock, and provides an event loop for concurrent I/O. These aren't obscure implementation details. They're the reason your code is fast or slow, leaks memory or doesn't, scales or chokes.
Before we start, a heads-up. We're going to trace through bytecode, poke at memory allocators, and build a toy event loop from scratch. You don't need to know any of that beforehand. We'll add what we need, one piece at a time.
This isn't a short journey, but I hope you'll be glad you came.
What happens when you run python script.py
Everything is a PyObject
Reference counting — the first line of defense
The garbage collector — catching what refcounting misses
CPython's secret weapon: the memory allocator
Rest stop
The GIL — one chef, one kitchen
Where the GIL doesn't matter (and where it bites)
Threading, multiprocessing, and picking the right tool
Asyncio — cooperative multitasking from scratch
The future: free-threaded Python
Wrap-up
What Happens When You Run python script.py
Let's say we have a tiny script. Three lines. We want to build a log ingestion pipeline — something that reads log files, counts errors, and reports. Our first version is absurdly small on purpose:
logs = ["ERROR: disk full", "INFO: started", "ERROR: timeout"]
errors = [line for line in logs if line.startswith("ERROR")]
print(len(errors))
When you hit enter, CPython doesn't run this text directly. It goes through a pipeline. First, a parser reads the source and builds an Abstract Syntax Tree — a structured representation of what your code means. Then a compiler walks that tree and produces bytecode — a sequence of low-level instructions for CPython's virtual machine. Finally, the VM executes those instructions one by one in a giant loop written in C (the file is called ceval.c, and it's the beating heart of the interpreter).
We can actually see the bytecode. Python ships with a module called dis — short for disassemble — that shows you exactly what instructions the VM will execute:
import dis
def count_errors(logs):
errors = [line for line in logs if line.startswith("ERROR")]
return len(errors)
dis.dis(count_errors)
# 2 0 LOAD_CONST 1 (<code object <listcomp>>)
# 2 MAKE_FUNCTION 0
# 4 LOAD_FAST 0 (logs)
# 6 GET_ITER
# 8 CALL_FUNCTION 1
# 10 STORE_FAST 1 (errors)
# 3 12 LOAD_GLOBAL 0 (len)
# 14 LOAD_FAST 1 (errors)
# 16 CALL_FUNCTION 1
# 18 RETURN_VALUE
Each line is one bytecode instruction. LOAD_FAST pushes a local variable onto a stack. CALL_FUNCTION pops arguments off the stack and calls whatever callable is sitting on top. STORE_FAST takes the result and binds it to a local name. The VM doesn't know about your variable names at execution time — it works with indices into arrays of names and constants, stored in a code object.
That code object is what gets saved to a .pyc file in __pycache__/. The next time you import the same module, Python skips parsing and compiling entirely and loads the cached bytecode directly. That's why the first import is slower than subsequent ones.
Here's why this matters for performance: every iteration of a Python for loop is several bytecode instructions — load the iterator, call __next__, check for StopIteration, branch back. When you call np.sum(array), that's one bytecode instruction (CALL_FUNCTION) that drops into C code and rips through millions of elements without touching the bytecode loop at all. The gap between "Python loop over a million items" and "one C call over a million items" isn't 2x. It's 100x. Now you know why.
Everything Is a PyObject
At the C level — the level where CPython is actually implemented — every Python value is a C struct called PyObject. Integers, lists, functions, classes, modules, None — they all start with the same two fields:
typedef struct _object {
Py_ssize_t ob_refcnt; // how many references point here
struct _typeobject *ob_type; // what type am I?
} PyObject;
ob_refcnt is the reference count — we'll get to that in a moment. ob_type is a pointer to another object (a PyTypeObject) that describes what this thing is and what operations it supports. An integer extends this base struct by adding a field for the actual numeric value. A list extends it by adding a pointer to a resizable array of other PyObject pointers. But at the top, they all look the same to the interpreter.
This is what people mean when they say "everything in Python is an object." It's not a philosophical statement. It's a literal C struct. And every variable you create in Python is a pointer to one of these structs living on the heap — a region of memory managed by the allocator, not the stack. When you write a = [1, 2, 3], the list struct gets allocated on the heap and a is a name that points to it. When you write b = a, you don't copy the list. You create a second pointer to the same struct. One list, two names.
I'll be honest — this was the single insight that untangled years of confusion for me. Why does df2 = df not copy your DataFrame? Because assignment never copies. It creates a new pointer to the same heap object. Want an actual copy? Call df.copy(). Same story with NumPy: slicing returns a view — a new object that shares the same underlying memory buffer. Modify a view, and you've modified the original. I still get tripped up by this occasionally, usually at 2am during a deadline.
Reference Counting — The First Line of Defense
Back to that ob_refcnt field. Every PyObject carries an integer tracking how many things point to it. Create a new name pointing to the object — count goes up. Delete a name or reassign it — count goes down. The moment the count hits zero, CPython frees the memory immediately. No waiting, no background process. Instant reclamation.
Let's trace through our log pipeline to see this in action:
import sys
logs = ["ERROR: disk full", "INFO: started", "ERROR: timeout"]
print(sys.getrefcount(logs)) # 2 (one for 'logs', one for the getrefcount argument itself)
backup = logs
print(sys.getrefcount(logs)) # 3 (logs + backup + getrefcount arg)
del backup
print(sys.getrefcount(logs)) # back to 2
del logs
# refcount hits 0 → list is freed immediately, along with any strings
# only it was holding alive
This is remarkably elegant for the common case. Most objects are created, used, and discarded within a single function. Reference counting handles that with zero overhead from a garbage collector — the memory is reclaimed the instant it's no longer needed. That's why del df on a huge DataFrame often frees gigabytes instantly. The count drops to zero, the memory goes back to the allocator, done.
But there's a crack in this system. Imagine two objects that point to each other: object A has a reference to B, and B has a reference to A. Even if nothing else in your entire program uses either of them, both reference counts are stuck at one. They'll never reach zero. They'll sit there consuming memory forever, like two people each waiting for the other to leave a room first.
The Garbage Collector — Catching What Refcounting Misses
That's where CPython's cyclic garbage collector comes in. It exists for one purpose: finding and destroying reference cycles that reference counting can't handle.
The collector divides objects into three generations. Generation 0 holds newly created objects. When the collector runs and an object survives (because it's still reachable), it gets promoted to Generation 1. Survive again, and it moves to Generation 2. The key insight: most objects die young. A temporary list inside a function, a string built for a log message — these are created and destroyed within milliseconds. So Generation 0 is collected frequently (every ~700 allocations by default), while Generation 2 is collected rarely.
Think of it like a mailroom. Letters that arrive today are checked this evening. If they're still unclaimed after a week, they go to a different shelf that's checked monthly. If they survive a month, they go to a storage room that's checked once a year. Most letters are picked up the same day.
For ML work, there's a real pattern worth knowing: some training frameworks call gc.disable() during training to avoid the small overhead of periodic garbage collection. This is fine as long as you call gc.collect() between epochs. If you don't, any reference cycles that form during training — and frameworks with complex callback systems can create them — will accumulate silently until your process gets killed by the OOM reaper.
I once spent a full afternoon debugging a training run that would crash around epoch 15 — always epoch 15, regardless of batch size or model. The memory profiler showed a slow, steady climb that had nothing to do with the model or data. It was circular references between callback objects that the disabled garbage collector never cleaned up. Three lines of code fixed it: gc.collect() at the top of each epoch. The humbling part is how long it took me to even suspect the GC.
CPython's Secret Weapon: The Memory Allocator
Reference counting and the garbage collector decide when to free memory. The allocator decides how to manage the underlying blocks of memory. CPython doesn't use the system's malloc for most objects. It has its own allocator called pymalloc, optimized for the specific pattern Python exhibits: millions of tiny objects created and destroyed in rapid succession.
The architecture has three levels. At the bottom, CPython requests memory from the operating system in large chunks called arenas — each one 256 KB. Each arena is divided into pools of 4 KB. Each pool is carved into fixed-size blocks tuned for a specific object size class. Objects up to 512 bytes go through pymalloc. Anything larger goes directly to the system allocator.
Why does this matter? Because it explains a behavior that confuses everyone the first time they see it: Python processes often don't return memory to the operating system, even after you del everything. The memory is freed back to pymalloc's pools and arenas, but pymalloc holds onto those arenas for future allocations. From the OS perspective, your process is still using the same amount of RAM. From Python's perspective, the memory is available for new objects. This isn't a leak. It's an optimization — reallocating from the OS is slow.
CPython also pulls some clever tricks for common objects. Integers from -5 to 256 are pre-allocated at startup and reused forever — they're so common that creating fresh heap objects for every 0 and 1 would be absurdly wasteful. None, True, and False are singletons, permanently alive. Variable names and short string literals are interned — stored once in a global table and reused by identity, making dictionary lookups faster.
And here's a subtlety that trips everyone up at least once: sys.getsizeof() lies. It reports the shallow size of the container object itself — the internal pointer array for a list, the hash table for a dict — not the things inside it. sys.getsizeof(my_list) on a list of a million dictionaries returns maybe 8 MB (the pointer array), not the hundreds of megabytes the dictionaries themselves consume. For real memory numbers, use memory_profiler or df.info(memory_usage='deep') for DataFrames.
Congratulations on making it this far. You can stop here if you want.
You now have a working mental model of how Python manages memory: every value is a heap-allocated PyObject, reference counting handles the common case of freeing memory instantly, a generational garbage collector catches reference cycles, and pymalloc handles the physical allocation in arena → pool → block layers. That model is genuinely useful — it explains why b = a doesn't copy, why del sometimes frees memory and sometimes doesn't, and why Python processes seem to "hold onto" RAM.
What we haven't touched yet is concurrency — the reason your 8-core machine runs Python on one core, and how to fix that. That's the GIL, threading, multiprocessing, and asyncio. These build directly on the memory model we've established, because the GIL exists specifically to protect the reference counting system we've been looking at.
But if the discomfort of not knowing what's underneath is nagging at you, read on.
The GIL — One Chef, One Kitchen
Imagine a kitchen with one stove. No matter how many chefs you hire, only one can use the stove at any given time. The others have to wait. That's the Global Interpreter Lock, or GIL — a mutex inside CPython that ensures only one thread executes Python bytecode at a time.
The reason it exists goes straight back to ob_refcnt. Remember, every object has a reference count that gets incremented and decremented constantly — on every assignment, every function call, every scope exit. If two threads decrement the same ob_refcnt simultaneously without any coordination, the count could go wrong. Object freed while still in use. Segfault. Memory corruption. The GIL prevents this by serializing all bytecode execution through a single lock.
At the C level, the GIL lives in ceval.c alongside the bytecode evaluation loop. It's a mutex with a condition variable. A thread holds the GIL, executes bytecode, and after a configurable interval — about 5 milliseconds by default, tunable via sys.setswitchinterval() — it checks whether another thread is waiting. If so, it releases the lock, signals the other thread, and waits to reacquire it. The handoff is controlled by a gil_drop_request flag to avoid unnecessary context switches when no other thread cares.
Could CPython use fine-grained per-object locks instead? In theory, yes. In practice, adding a lock to every PyObject slows down the single-threaded case — which is the overwhelmingly common case — by 30-40%. That tradeoff was rejected decades ago and has been re-evaluated several times since. The single coarse lock keeps the common path fast.
Our kitchen analogy extends further than you might expect. The chef can step away from the stove to check if a delivery has arrived (I/O), and while they're away from the stove, another chef can use it. That's exactly what happens: the GIL is released during I/O operations. Network calls, disk reads, database queries — the thread performing I/O drops the GIL, waits for the OS to respond, and reacquires it when the data arrives. This isn't a loophole. It's the design. I/O-bound multithreaded Python works well precisely because the stove is free while chefs are waiting for deliveries.
Where the GIL Doesn't Matter (and Where It Bites)
Here's the thing about the GIL that everyone complains about but that, for most ML work, doesn't actually matter: NumPy, PyTorch, TensorFlow, and XGBoost all release the GIL when they drop into their C/C++/CUDA kernels. When NumPy is crunching a matrix multiplication, the GIL is released. When PyTorch is running a forward pass on the GPU, the GIL is released. Your model.fit() in scikit-learn with n_jobs=-1 is using multiprocessing internally. The GIL is not your bottleneck during training.
Where it does bite: pure Python loops doing computation. If you've written a for loop that processes ten million rows in Python — not vectorized through NumPy or Pandas — threading won't help. Threads will take turns on the GIL, executing one at a time, potentially slower than a single thread due to the overhead of lock acquisition and context switching. It's like hiring three chefs to use one stove sequentially — you're paying three salaries for one stove's throughput, minus the time they spend coordinating who goes next.
The practical test is dead simple. If you're profiling and your CPU usage on a multi-threaded Python program never exceeds 100% (one core), you've hit the GIL. If it does exceed 100%, you're either in C extension code that releases the GIL, or using multiprocessing.
| Scenario | Bottleneck | Right Tool | Why |
|---|---|---|---|
| Downloading 500 datasets from S3 | I/O (network) | ThreadPoolExecutor or asyncio |
GIL released during I/O — threads genuinely overlap |
| Tokenizing 10M documents in Python | CPU (Python loops) | ProcessPoolExecutor |
Each process has its own GIL — true parallelism |
| Training an XGBoost model | CPU (C code) | Let the library handle it (n_jobs) |
Library releases GIL internally, uses threads or processes |
| Serving 1000 concurrent inference requests | I/O + CPU | asyncio + ProcessPoolExecutor |
Async for HTTP concurrency, processes for CPU inference |
| 4 DataFrames in memory, only need 1 | Memory | Delete references, profile with memory_profiler |
Not a concurrency problem — a reference management problem |
Threading, Multiprocessing, and Picking the Right Tool
Let's expand our log ingestion pipeline. We now need to process 10,000 log files: download them from a remote server, parse each file for error patterns (CPU-intensive regex work), and push results to a database. Three stages, three different bottleneck profiles. This is where picking the right concurrency tool stops being theoretical.
Stage 1: Downloading — I/O-bound. The GIL is released while waiting for network responses, so threads work perfectly. Twenty threads, twenty concurrent downloads. Each thread spends 95% of its time waiting for data, and the GIL is free during that wait.
from concurrent.futures import ThreadPoolExecutor
import requests
def download_log(url: str) -> str:
"""I/O-bound: network wait dominates. GIL is released during .get()."""
return requests.get(url, timeout=30).text
urls = [f"https://logs.example.com/files/{i}.log" for i in range(10_000)]
with ThreadPoolExecutor(max_workers=20) as pool:
raw_logs = list(pool.map(download_log, urls))
# 20 downloads happening concurrently. Wall time: ~500x faster than sequential.
Stage 2: Parsing — CPU-bound. Regex over megabytes of text is pure computation. Threads would take turns on the GIL. We need separate processes, each with its own interpreter and its own GIL.
from concurrent.futures import ProcessPoolExecutor, as_completed
import re
ERROR_PATTERN = re.compile(r"^(ERROR|CRITICAL):.*", re.MULTILINE)
def parse_log(text: str) -> list[str]:
"""CPU-bound: regex matching across the full text."""
return ERROR_PATTERN.findall(text)
with ProcessPoolExecutor(max_workers=8) as pool:
futures = {pool.submit(parse_log, log): i for i, log in enumerate(raw_logs)}
results = {}
for future in as_completed(futures):
idx = futures[future]
try:
results[idx] = future.result()
except Exception as e:
results[idx] = f"FAILED: {e}"
# 8 processes, true parallelism. One crashed file doesn't take down the others.
Notice a key design choice: submit() with as_completed() instead of map(). With map(), you wait for results in order — one slow file blocks everything behind it. With as_completed(), you handle results as they finish, and individual failures don't sink the batch.
The beautiful thing about concurrent.futures: ThreadPoolExecutor and ProcessPoolExecutor share the exact same interface. Start with threads. If profiling shows CPU is the bottleneck, swap Thread for Process — one word change. This won't be the last time we'll see how practical design decisions in Python's standard library pay off.
Multiprocessing sends data between processes using pickle serialization. If you send a 500MB DataFrame to 8 workers, that's 8 copies × 500MB = 4GB of memory consumed in serialization alone. The fix: don't send data. Send file paths or row indices and let each worker load its own slice. This is the difference between a pipeline that scales and one that gets OOM-killed at 16 workers.
Asyncio — Cooperative Multitasking from Scratch
Threads work well for tens or even hundreds of concurrent I/O tasks. But what happens when you need thousands? A web server handling 5,000 concurrent inference requests. A scraper hitting 10,000 endpoints. Each OS thread consumes memory (typically ~8 MB of stack space) and context-switching between thousands of threads gets expensive. We need a different approach.
Asyncio is Python's answer: cooperative multitasking on a single thread. Instead of the operating system preemptively switching between threads, coroutines voluntarily yield control at await points. One thread runs an event loop that juggles thousands of tasks by resuming whichever one has data ready.
To understand what's happening under the hood, it helps to know that coroutines evolved from generators. A generator pauses at yield and resumes when you call next(). A coroutine (defined with async def) pauses at await and resumes when the event loop decides it's ready. Under the C hood, both are state machines — the function's local variables and instruction pointer are saved on the heap (not the stack), so resumption is cheap.
The event loop itself is built on the operating system's I/O multiplexing primitives — epoll on Linux, kqueue on macOS, IOCP on Windows. These let a single thread ask the OS: "wake me up when any of these 5,000 sockets has data." The loop calls selector.select(), blocks until events arrive, then resumes the corresponding coroutines. No polling, no spinning, no wasted CPU cycles.
I'm still developing my intuition for when async beats threading in practice. The rule of thumb I've settled on: if you're managing more than a few hundred concurrent I/O operations, or if you need fine-grained control over cancellation and timeouts, async is worth the cognitive overhead. For simpler workloads, threads are easier to reason about and work fine.
Here's the production pattern you'll see in ML serving — async for HTTP traffic, a process pool for the actual computation:
import asyncio
from concurrent.futures import ProcessPoolExecutor
from fastapi import FastAPI
import numpy as np
app = FastAPI()
executor = ProcessPoolExecutor(max_workers=4)
def predict_sync(features: list[float]) -> float:
"""CPU-bound inference — cannot be awaited directly."""
X = np.array(features).reshape(1, -1)
return float(X.sum() * 0.42) # stand-in for model.predict(X)[0]
@app.post("/predict")
async def predict(features: list[float]):
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(executor, predict_sync, features)
return {"prediction": result}
# The event loop handles thousands of concurrent HTTP connections.
# run_in_executor offloads each prediction to a separate process.
# The loop is never blocked. That's the whole trick.
run_in_executor bridges the two worlds: the async event loop stays responsive to incoming requests, while the CPU-bound work happens in separate processes where the GIL doesn't matter. Without this, one slow prediction blocks every other request in the server. With it, the chef is free to take orders while the prep cooks handle the heavy work in their own kitchens.
The Future: Free-Threaded Python
Python 3.13, released in October 2024, ships with an experimental free-threaded build — a version of CPython where the GIL can be disabled entirely. This is the result of PEP 703, years of work to make reference counting thread-safe without a global lock, using a technique called biased reference counting where each object tracks its "owning" thread separately from other threads.
The tradeoffs are real. Single-threaded code runs about 10% slower in the free-threaded build, because the more complex reference counting operations — atomic increments, deferred decrements — have overhead that the GIL previously avoided. Multi-threaded CPU-bound code, on the other hand, can finally achieve true parallelism, with dramatic speedups on multi-core machines.
As of mid-2025, the ecosystem is catching up. NumPy and pandas are being updated. Many C extensions that relied on the GIL for thread safety need auditing and modification. The build is opt-in, and if an extension hasn't been updated, importing it will silently re-enable the GIL with a warning.
My honest assessment: don't redesign your production systems around it yet. Use multiprocessing for CPU-bound parallelism today — it works, it's battle-tested, and it'll continue to work regardless of what happens with the GIL. Keep an eye on the ecosystem tracking pages. When the libraries you depend on are all certified free-threading-safe, that'll be the time to switch. Probably late 2025 at the earliest for adventurous teams, 2026 for most of us.
Wrap-Up
If you're still with me, thank you. I hope it was worth it.
We started with what happens when you type python script.py — source to AST to bytecode to the VM's evaluation loop. We went down to the C level and met PyObject, the struct that every Python value is built on. We traced how reference counting handles the common case of memory reclamation, and how the generational garbage collector catches the cycles that refcounting misses. We looked at pymalloc's arena-pool-block architecture for fast small-object allocation. Then we crossed into concurrency: the GIL and why it exists (to protect ob_refcnt), how threads and processes navigate around it, how asyncio's event loop achieves thousands of concurrent I/O operations on a single thread, and where the free-threaded future is headed.
My hope is that the next time your training script crawls or your inference server leaks memory, instead of guessing and throwing tools at the problem, you'll know where to look — is it the GIL serializing CPU-bound threads? Is it dangling references keeping old DataFrames alive? Is it synchronous I/O blocking an event loop? — having a pretty solid mental model of what's going on under the hood.
CPython Internals: Your Guide to the Python 3 Interpreter by Anthony Shaw — the most accessible deep dive into CPython's C code. If this section left you wanting more, this is where to go next.
"Python Behind the Scenes" blog series by Victor Skvortsov at tenthousandmeters.com — a 13-part series that traces through the CPython source, from the parser to the evaluation loop. Insightful and wildly thorough.
PEP 703: Making the Global Interpreter Lock Optional — the formal proposal for free-threaded Python. Dense reading, but it's the authoritative source on how biased reference counting works and what changes it requires.
dis module documentation — Python's built-in bytecode disassembler. The fastest way to understand why one piece of code is faster than another.
memory_profiler — line-by-line memory profiling for Python. The tool that actually tells you where your memory is going, unlike sys.getsizeof().