Algorithms and Data Structures for ML

Chapter 1: Python & Programming Foundations 1 table

I avoided data structures and algorithms for an embarrassingly long time. Every time someone brought up "KD-trees" or "topological sort" in the context of ML, I'd nod, mumble something about Big-O, and change the subject. It felt like academic gatekeeping — the kind of thing competitive programmers care about, not people building real ML systems. Then one day my nearest-neighbor search took 40 minutes on 2 million embeddings, and my feature store started timing out in production, and I couldn't explain why. The discomfort of not knowing what was actually happening under the hood grew too great. Here is that dive.

Algorithms and data structures are the invisible scaffolding that every ML system sits on. Not the kind you implement from scratch in a whiteboard interview — the kind that's already running inside NumPy, PyTorch, FAISS, and every feature store you've ever used. Understanding them is what separates "it works on my laptop" from "it works on 100 million users."

Before we start, a heads-up. We're going to cover arrays, hash maps, trees, graphs, heaps, bloom filters, and more — touching on topics from computational complexity to approximate nearest-neighbor search. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Arrays and the gospel of contiguous memory
Hash maps — the backbone of serving
Trees — three different lives in ML
The curse of dimensionality and the death of KD-trees
Approximate nearest neighbors and HNSW
Graphs — autograd is a graph algorithm
Heaps, priority queues, and beam search
Tries — what's inside your tokenizer
Dynamic programming — Viterbi to beam search
Bloom filters — deduplicating a billion documents
Rest stop
Algorithmic complexity — the one skill that prevents production fires
The quadratic wall of self-attention
Top-K selection — why you should stop sorting
Wrap-up

Arrays and the Gospel of Contiguous Memory

Let's start with something so fundamental it feels boring. Imagine we're building a tiny movie recommendation engine. We have three movies and five users. Each user rates each movie on a scale of 1 to 5. That's a grid of 15 numbers — a matrix. We could store it as a Python list of lists, or we could store it as a NumPy array.

Here's the thing that tripped me up for years: the math is identical. The algorithm doesn't change. What changes is where those 15 numbers live in your computer's memory.

A Python list of lists stores each number as a separate Python object, scattered across memory like books thrown on a floor. To add two lists element-by-element, Python has to chase 15 different pointers, unwrap 15 different objects, do the arithmetic, then wrap 15 new objects and scatter them back. NumPy's ndarray stores all 15 numbers in a single contiguous block, like books lined up on a shelf. The CPU can sweep through the entire block in one motion using SIMD (Single Instruction, Multiple Data) instructions — processing multiple numbers per clock cycle.

import numpy as np

# Python list: each element is a pointer to an object scattered in memory
ratings_list = [[4, 5, 2], [3, 1, 4], [5, 5, 3], [2, 3, 5], [4, 2, 1]]

# NumPy array: one contiguous block of memory, 15 floats side by side
ratings = np.array(ratings_list, dtype=np.float32)

# This triggers optimized C code operating on that contiguous block
avg_per_movie = ratings.mean(axis=0)  # [3.6, 3.2, 3.0]

That single line at the end runs roughly 100x faster than the equivalent Python loop. Not because the math is smarter, but because contiguous memory lets hardware do what hardware does best. PyTorch tensors, TensorFlow tensors, JAX arrays — they all follow the same principle. The contiguous array is the bedrock data structure of ML. Everything else sits on top of it.

But contiguous arrays have a limitation that matters. They're great when you know the shape of your data ahead of time and you're doing bulk numerical operations. They're terrible when you need to look up "what are user_42's features?" from a pool of 10 million users. For that, we need a different kind of structure.

Hash Maps — The Backbone of Serving

Back to our recommendation engine. It's grown. We now have 10 million users and we need to look up any user's features in under a millisecond for real-time serving. If our features were in a list, finding user_42 would mean scanning through potentially all 10 million entries — O(n) time, where n is the number of users. At 10 million users, that's a death sentence for latency.

A hash map (Python's dict) solves this by computing a hash of the key — a mathematical fingerprint — and using it to jump directly to the right memory location. No scanning. No searching. O(1) average time, whether you have 100 users or 100 million.

# Our recommendation engine's feature store
# O(1) lookup regardless of how many users exist
feature_store = {
    "user_42": {"avg_rating": 3.8, "genre_prefs": [0.9, 0.2, 0.7], "active_days": 145},
    "user_99": {"avg_rating": 2.1, "genre_prefs": [0.3, 0.8, 0.1], "active_days": 12},
    # ... imagine 10 million more entries
}

features = feature_store["user_42"]  # instant, regardless of store size

This pattern shows up everywhere in ML systems. Vocabulary mappings in NLP models (token_to_index) are hash maps. Prediction caches are hash maps. Config lookups, deduplication checks, embedding tables in recommendation systems where you map a user ID to a dense vector — all hash maps. Every production feature store you'll encounter (Feast, Tecton, Uber's Michelangelo) is fundamentally a distributed hash map with persistence bolted on.

I'll be honest — I took O(1) dictionary lookups for granted for years. It was only when I saw the difference in production between a dict-backed feature store and a naive list-based one that the practical impact became visceral. The dict version served 50,000 requests per second. The list version could barely manage 200.

Hash maps are brilliant for exact lookups. But what happens when the question isn't "give me user_42's data" but "give me the 10 users most similar to user_42"? For that, we need trees.

Trees — Three Different Lives in ML

Trees show up in three completely separate contexts in ML, and conflating them is a reliable source of confusion. Let me untangle them.

Decision trees as models. In Random Forest, XGBoost, and LightGBM, the tree is the model. Each internal node asks a question about a feature ("is the user's average rating above 3.5?"), each branch represents an answer, and each leaf holds a prediction. These models still dominate tabular ML in production. In our recommendation engine, if we're predicting whether a user will click on a movie based on structured features like age, watch history, and time of day — tree-based models are often the first thing to try, and frequently the last thing you need.

B-trees in databases. When our training pipeline queries PostgreSQL for 50 million user-movie interaction rows and it comes back in seconds instead of hours, that's a B-tree index at work. B-trees keep data sorted in a balanced structure that allows O(log n) lookups. Without an index, the database would scan every row — O(n). With a B-tree index on user_id, finding all of user_42's interactions is nearly instantaneous.

KD-trees for nearest-neighbor search. This is where the story gets interesting — and where I got burned. A KD-tree (k-dimensional tree) partitions space recursively, splitting data along one dimension at a time. It's what sklearn.neighbors.KNeighborsClassifier uses under the hood. For our recommendation engine, if each movie is represented as a point in, say, 3-dimensional space (genre scores), a KD-tree finds the 10 nearest movies to a query movie in O(log n) time. Beautiful.

But there's a catch that took me an embarrassingly long time to discover.

The Curse of Dimensionality and the Death of KD-Trees

I once confidently used a KD-tree to find nearest neighbors in a set of 768-dimensional sentence embeddings. It worked. It was also slower than brute-force linear scan. I assumed my code was wrong. It wasn't. The KD-tree was doing what it always does in high dimensions — falling apart.

Here's the problem. A KD-tree works by recursively splitting space along one dimension at a time. In 3 dimensions, each split eliminates roughly half the remaining points. But as dimensions increase, each split becomes less and less informative. By the time you're in 100+ dimensions, the splits are essentially random — the tree has to search nearly every branch, and query time degenerates from O(log n) toward O(n). This is the curse of dimensionality: in high-dimensional spaces, all points are roughly equidistant from each other, and spatial partitioning strategies stop working.

The practical threshold is surprisingly low. KD-trees perform well up to around 20-30 dimensions. Modern ML embeddings — word embeddings at 300 dimensions, sentence embeddings at 768, image embeddings at 2048 — blow past that threshold by an order of magnitude.

This means the tool that every intro ML course teaches you for nearest-neighbor search is essentially useless for the nearest-neighbor search you'll actually need to do in practice. That's frustrating. But it's also what motivated one of the most important families of algorithms in production ML.

Approximate Nearest Neighbors and HNSW

The insight that unlocks high-dimensional similarity search is this: you don't need the exact nearest neighbor. You need a neighbor that's very close, and you need it fast. Trading a tiny amount of accuracy for a massive speedup is almost always the right deal.

Several algorithms exploit this tradeoff. Locality-Sensitive Hashing (LSH) uses hash functions designed so that similar items are likely to land in the same bucket — the opposite of what cryptographic hashes do. Spotify's Annoy library builds forests of random projection trees. But the algorithm that has become the industry standard in 2024 is HNSW — Hierarchical Navigable Small World graphs.

The analogy that made HNSW click for me is navigating a city. Imagine you're trying to get from one side of a large city to a specific address. You start on the highway — long-range connections, few exits, fast travel. Once you're in the right neighborhood, you exit to a main road. More connections, shorter distances. Then a side street. Then the exact address. You got there fast because you used different scales of connections.

HNSW builds exactly this. It constructs a multi-layered graph where each node is a data point (an embedding vector). The top layers are sparse — few nodes, connected by long-range links, like highways. Each lower layer is denser, with more nodes and shorter-range connections. The bottom layer contains every point.

To search, you enter at the top layer and greedily walk toward nodes that are closer to your query vector. When you can't get closer in the current layer, you drop down to the next one. By the time you reach the bottom layer, you're already in the right neighborhood, so the final search is fast.

I'm still building my full intuition for why this layered structure works as well as it does — the theory involves random graph properties and "small world" network phenomena. But the practical result is staggering: HNSW can find approximate nearest neighbors in a dataset of 100 million 768-dimensional vectors in single-digit milliseconds. This is what powers vector databases like Pinecone, Milvus, and the vector search capabilities in Redis and Elasticsearch. It's what FAISS (Facebook's similarity search library) uses as its go-to index for production workloads.

Back to our recommendation engine: once we represent each movie as a 256-dimensional embedding, HNSW is how we find "movies similar to this one" at serving time, across a catalog of millions, in milliseconds. Without it, modern embedding-based recommendation and retrieval systems wouldn't exist at production scale.

But HNSW, for all its power, solves one specific problem: finding similar items. Our ML systems have other structural needs that require a different way of thinking about connections.

Graphs — Autograd Is a Graph Algorithm

Here's something that clicked for me once and never un-clicked: when you call loss.backward() in PyTorch, you're running a graph algorithm. Not metaphorically. Literally.

Every operation you perform on PyTorch tensors — every addition, multiplication, activation function — builds up a directed acyclic graph (DAG) called the computation graph. The tensors are nodes. The operations are edges. When you call .backward(), PyTorch performs a reverse topological sort on this graph and walks it backward, applying the chain rule at each node to compute gradients.

A topological sort is an ordering of a directed graph's nodes such that every node comes after all the nodes it depends on. Picture a recipe: you can't frost the cake before you bake it, and you can't bake it before you mix the batter. The order "mix → bake → frost" is a topological sort of the dependency graph. Reverse that, and you get the order for backpropagation: start at the loss (the frosted cake), and trace backward through every operation to figure out how each ingredient (parameter) contributed.

import torch

# Every operation here builds a computation graph
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1  # graph: x → x² → +3x → +1 → y

# .backward() does a reverse topological sort of this graph
# then computes gradients via chain rule
y.backward()
print(x.grad)  # tensor([7.]) — dy/dx = 2x + 3 = 7

Graphs show up in other critical places too. Airflow and Kubeflow orchestrate ML pipelines as DAGs — each node is a task (data ingestion, feature engineering, training, evaluation), and edges encode dependencies. Knowledge graphs store entity relationships for use in recommendation and search. And the entire field of Graph Neural Networks (GNNs) operates on graph-structured data directly — social networks, molecular structures, citation networks — where nodes pass messages to their neighbors to learn representations. The message-passing mechanism in GNNs is itself a graph algorithm: each node aggregates information from its neighbors, updates its own state, and repeats for several rounds.

Graphs give us the language to describe relationships and dependencies. But sometimes the question isn't about relationships — it's about selection. "Out of a million candidates, which 10 are best?" That's where heaps come in.

Heaps, Priority Queues, and Beam Search

Back to our recommendation engine. We've scored all 5 million movies in our catalog for user_42. Now we need the top 10. The naive approach: sort all 5 million scores and take the first 10. That's O(n log n) — about 112 million operations for n = 5 million.

A heap (also called a priority queue) does it in O(n log k), where k is the number of items we want. For our case, that's about 17 million operations — roughly 6x faster. The difference: a heap maintains a small sorted structure of size k, and as it scans through the data, it only does work when it finds something better than the current worst item in the heap.

import heapq

# 5 million movie scores (simulated)
import random
scores = [random.random() for _ in range(5_000_000)]

# Top 10 without sorting all 5 million
top_10 = heapq.nlargest(10, scores)  # O(n log k), not O(n log n)

Where heaps become essential in ML is beam search. When an LLM generates text, at each step it produces a probability distribution over the entire vocabulary (often 50,000+ tokens). Beam search keeps the top-k most promising partial sequences at each step, expanding each one and keeping the best k again. That "keep the best k" operation at each step is a heap operation. Without heaps, beam search would require sorting the entire vocabulary at every generation step — far more expensive.

NumPy has an even more aggressive optimization for top-K: np.argpartition, which finds the top-k indices in O(n) average time using a selection algorithm (related to quickselect). It doesn't give you the items in sorted order — it partitions the array so the top-k items are on one side — but that's often enough.

Heaps solve the "find the best few" problem. But there's a related problem that's more subtle: "have I seen this before?" When "this" is one of a billion training documents, even hash maps get expensive in memory. For that, we need something unusual.

Tries — What's Inside Your Tokenizer

Every time you run a sentence through a language model, the first thing that happens is tokenization — breaking text into subword tokens. If the model uses BPE (Byte Pair Encoding), the tokenizer needs to find the longest matching token from its vocabulary at every position in the input. With a vocabulary of 50,000 tokens, scanning the entire vocabulary at every position would be ruinously slow.

A trie (pronounced "try," short for retrieval) is a tree where each node represents a character, and paths from root to marked nodes spell out valid entries. All tokens sharing a common prefix share the same initial path. To find the longest matching token starting at any position, you walk down the trie character by character until you can't go further, then take the last valid token you passed through.

Imagine our recommendation engine has grown to include a natural-language search feature: users type "movies like Inception but darker." The tokenizer breaks this into tokens. With a trie-backed vocabulary, finding each token takes O(L) time where L is the token's length in characters — regardless of vocabulary size. A naive list scan would take O(V) where V is the vocabulary size. For a 50,000-token vocabulary, that's a difference you can feel.

HuggingFace's tokenizers library implements its BPE tokenizer using a trie written in Rust. It's not the kind of data structure you'll build yourself, but understanding that it's there helps you reason about why tokenization is fast for some models and slow for others, and why vocabulary size matters beyond model quality.

Dynamic Programming — Viterbi to Beam Search

Dynamic programming (DP) is not a data structure — it's a technique for solving problems by breaking them into overlapping subproblems and caching the results. The reason it belongs in this section is that it powers two of the most important decoding algorithms in ML.

The Viterbi algorithm is classic DP applied to sequence models. In a Hidden Markov Model (the kind used in speech recognition before deep learning took over), you're trying to find the most likely sequence of hidden states given an observed sequence. Trying every possible path is exponential. Viterbi exploits the fact that the best path to any state at time t depends only on the best paths to all states at time t-1. It builds a table of partial solutions, working forward through time, and then traces back to reconstruct the optimal path. The complexity drops from exponential to O(T × S²), where T is the sequence length and S is the number of hidden states.

Beam search is what replaced Viterbi when neural networks made the state space too large for exact DP. Instead of tracking the best path to every possible state, beam search tracks the top-k partial sequences at each step and expands from there. It's approximate — no guarantee of finding the globally optimal sequence — but it's practical. Every LLM you've used generates text with some variant of beam search (or its cousins: greedy decoding, nucleus sampling).

The connection: Viterbi is exact DP, beam search is approximate DP with a fixed-width frontier. Understanding this lineage helps you see why beam width is a quality-vs-speed tradeoff, and why increasing beam width eventually stops helping — you're approximating an exponentially large space, and a wider beam doesn't change that fundamental limitation.

Bloom Filters — Deduplicating a Billion Documents

The idea behind a Bloom filter felt wrong to me the first time I encountered it. A data structure that can tell you "this item is definitely NOT in the set" but can only tell you "this item is PROBABLY in the set"? Why would anyone want that?

Here's why. When training a large language model, you're working with billions of web-crawled documents. Duplicates in training data cause the model to memorize specific passages, degrade generalization, and waste compute. You need to check, for each incoming document, whether you've already seen it. Storing hashes of a billion documents in a regular hash set would consume tens of gigabytes of memory. A Bloom filter stores the same membership information in a fraction of that space — at the cost of occasionally saying "yes, this is a duplicate" when it actually isn't (a false positive). But it never says "no, this is new" when the document is actually a duplicate (no false negatives).

Mechanically, a Bloom filter is a bit array. To add an item, you hash it with several different hash functions and set the corresponding bit positions to 1. To check membership, you hash the item again and check if all the corresponding positions are 1. If any position is 0, the item is definitely new. If all are 1, it's probably a duplicate — the positions might have been set by different items.

This is how OpenAI, Google, and others deduplicate training corpora at scale. The false positive rate is tunable: use more bits and more hash functions, and false positives drop. A Bloom filter with 10 bits per element and 7 hash functions has a false positive rate under 1%. For deduplication, that's perfectly acceptable — you might discard a few unique documents, but you'll catch essentially all the duplicates, using a tiny fraction of the memory a hash set would require.

Rest Stop

If you've made it this far, congratulations. You now have a mental map of the key data structures and algorithms that underlie real ML systems: arrays for computation, hash maps for serving, trees for structured search and models, graphs for autograd and dependencies, heaps for top-K selection, tries for tokenization, DP for sequence decoding, and bloom filters for deduplication. That's a practical toolkit that covers most of what you'll encounter in production.

The mental model is incomplete — we haven't talked about how fast these things are, or what happens when your data gets large enough that the wrong choice turns a 10-second job into a 10-hour one. If you want to stop here, you're in good shape. The short version of what's coming: know your Big-O, beware quadratic operations, and use partial sorts instead of full sorts when you only need the top-K.

But if the nagging question of "why did my pipeline suddenly get 100x slower when the data only got 10x bigger" keeps you up at night, read on.

Algorithmic Complexity — The One Skill That Prevents Production Fires

I'll be direct about something: I once shipped code with an O(n²) pairwise distance computation hidden inside a feature engineering step. On our test dataset of 5,000 items, it ran in 2 seconds. On the production dataset of 200,000 items, it took over 22 hours. I didn't understand why until I thought about it in terms of algorithmic complexity.

The idea behind Big-O notation is deceptively straightforward. It describes how the time (or memory) an algorithm needs grows as the input size grows, ignoring constant factors. When we say an operation is O(n), we mean doubling the input roughly doubles the time. O(n²) means doubling the input quadruples the time. O(log n) means doubling the input adds one more step.

The practical question Big-O answers is: "If my data grows 10x, how much longer will this take?"

Complexity	Name	ML Example	10x Data →
O(1)	Constant	Dict lookup in a feature store	Same time
O(log n)	Logarithmic	B-tree index lookup, `bisect` on sorted data	One more step
O(n)	Linear	Single pass over a dataset, vectorized NumPy op	10x time
O(n log n)	Linearithmic	Sorting predictions for ranking, AUC-ROC	~13x time
O(n²)	Quadratic	Pairwise distances, full self-attention	100x time
O(n³)	Cubic	Naive matrix multiplication, matrix inversion	1000x time

That last row — O(n³) — is why matrix multiplication is such a hot topic in ML infrastructure. Standard matrix multiplication of two n×n matrices requires n³ multiplications. Strassen's algorithm brings that down to approximately O(n^2.81), and it's what optimized BLAS libraries use for large matrices. The difference matters: for n = 10,000 (a perfectly normal matrix size in deep learning), the gap between n³ and n^2.81 is billions of operations saved.

The practical stance on complexity is different for offline and online workloads. Offline (training, batch jobs): O(n log n) is almost always fine. O(n²) works up to around 50,000 items. Beyond that, you need to rethink your approach. Online (inference, serving): you need O(1) or O(log n) per request. O(n) per request is tolerable only if n is small — hundreds, not millions. If your serving path has an O(n²) operation, you have a scaling fire waiting to ignite.

The Quadratic Wall of Self-Attention

The most consequential O(n²) operation in modern ML is self-attention in transformers. The standard attention mechanism computes a relevance score between every pair of tokens in the input sequence. If your sequence has n tokens, that's an n × n attention matrix — O(n²) in both time and memory.

For a sequence of 512 tokens, that's about 260,000 pairwise computations per attention head — very manageable. For 4,096 tokens, it's nearly 17 million. For 32,000 tokens (the context length of GPT-4-level models), it's over a billion. The quadratic growth is the fundamental reason why longer context windows are so expensive, and why extending context length is an active area of research.

Several families of "efficient attention" mechanisms have been developed to break this quadratic barrier. Longformer uses sparse attention patterns — each token attends to a local window plus a few global tokens — bringing complexity to O(n). Performer uses kernel approximations to decompose the attention matrix without ever computing it fully, also achieving O(n). Reformer uses locality-sensitive hashing (our friend from the ANN section) to group similar tokens and only compute attention within groups, achieving O(n log n).

None of these fully replace standard attention in practice — there are quality tradeoffs — but understanding why they exist requires understanding the quadratic wall. And understanding the quadratic wall requires understanding Big-O. Which is why we built up to this point the way we did.

Top-K Selection — Why You Should Stop Sorting

One last algorithmic pattern that comes up constantly and is worth making explicit. In our recommendation engine, after scoring 5 million movies, we need the top 10. In beam search, at each step we need the top-k sequences from an expanded candidate set. In evaluation, we compute Recall@K, Precision@K, NDCG@K — all requiring the top-K items.

The instinct is to sort. Sorting gives you the top-K in sorted order, and it works. But sorting is O(n log n). For finding top-K alone, you can do better.

NumPy's np.argpartition uses a selection algorithm (related to quickselect) that runs in O(n) average time. It doesn't sort the array — it rearranges it so the K largest items are at the end, in arbitrary order. If you also need them sorted, you sort only those K items afterward — O(K log K), which is negligible when K is small.

import numpy as np

scores = np.random.rand(5_000_000)  # 5 million movie scores

# Full sort: O(n log n) ≈ 112 million operations
top_10_sorted = np.argsort(scores)[-10:][::-1]

# Partial selection: O(n) ≈ 5 million operations, then sort 10 items
top_10_idx = np.argpartition(scores, -10)[-10:]
top_10_idx = top_10_idx[np.argsort(scores[top_10_idx])[::-1]]

PyTorch's torch.topk and TensorFlow's tf.math.top_k provide the same operation on GPU, which is critical for beam search and ranking operations that happen inside the training loop. The pattern is universal: if you only need K items, don't sort n.

I still occasionally catch myself writing sorted(scores)[-10:] out of habit. Every time I do, I think about the 5 million unnecessary comparisons I wasted.

Wrap-Up

If you're still with me, thank you. I hope it was worth the trip.

We started with 15 numbers in a contiguous block of memory and worked our way through hash maps that power feature stores, trees that serve as both models and search structures, the curse of dimensionality that killed KD-trees, HNSW graphs that rescued high-dimensional search, computation graphs that make backpropagation possible, heaps that make beam search practical, tries that make tokenization fast, dynamic programming that connects Viterbi to modern text generation, bloom filters that deduplicate training data at billion-document scale, and the complexity analysis that tells you which of these choices will survive contact with production data.

My hope is that the next time you see your training pipeline slow to a crawl, or your serving latency spike, or your nearest-neighbor search take minutes instead of milliseconds — instead of guessing and throwing hardware at it, you'll reach for the right mental model, identify the data structure or algorithm that's the bottleneck, and know what to reach for instead. Not because you memorized implementations, but because you understand what's running underneath.

← PreviousPython Under the Hood Next →Nice to Know