NumPy

Chapter 1: Python & Programming Foundations 10 concept areas

I avoided looking under NumPy's hood for years. I used it every day — np.array, .reshape(), broadcasting with a prayer — but whenever something broke, I'd stare at the shape mismatch error, try random transposes until it worked, and move on. I told myself the internals didn't matter. That the API was all I needed. But every time I hit a bug I couldn't explain, or watched someone write a vectorized operation in one line that would have taken me twenty, the discomfort of not really understanding what was happening underneath grew a little more unbearable. Here is that dive.

NumPy — short for Numerical Python — was created by Travis Oliphant in 2005 by unifying two earlier array libraries. It provides a single data structure, the ndarray, that stores numbers in a contiguous block of memory and operates on them in compiled C. Every major ML library — Pandas, scikit-learn, PyTorch, TensorFlow — is built on top of it. When you call model.fit(X), when you compute a loss, when you normalize a feature matrix, you're using NumPy whether you realize it or not.

Before we start, a heads-up. We're going to look at memory layouts, byte-level strides, C code, and BLAS libraries. You don't need to know any of that beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Why arrays exist at all
The four ingredients of an ndarray
Strides — the trick that makes everything work
Views and copies — the consequence of strides
Broadcasting from scratch
Why NumPy is fast — the assembly line under the hood
Rest stop and an off-ramp
Reshaping and the stride connection
einsum — the universal tensor language
Production patterns and the gotchas that bite
Resources and credits

Why Arrays Exist at All

To see why NumPy exists, we need to feel the pain of not having it. Imagine we're building a tiny movie recommendation engine. We have three users and four movies, and each user has rated each movie on a scale of 1 to 5. In pure Python, we'd store this as a list of lists.

ratings = [
    [5, 3, 1, 4],   # Alice
    [4, 5, 3, 2],   # Bob
    [1, 2, 5, 4],   # Carol
]

Looks fine. Now suppose we want to compute each user's average rating. We'd write a loop, iterate through each inner list, sum the values, divide by the length. Three users? No problem. Three million users with ten thousand movies each? That loop is going to hurt.

The reason it hurts is architectural. A Python list is not a block of numbers. It's a collection of pointers, each pointing to a separate Python object somewhere in memory. Think of it like a filing cabinet full of sticky notes, where each sticky note says "the actual number is in room 4,207 on shelf B." To do math on those numbers, Python has to follow each pointer, find the object, check what type it is, unwrap the actual value, do the arithmetic, wrap the result back into a new Python object, and store a pointer to it. For every. single. element.

NumPy takes a completely different approach. It stores all the numbers side by side in one continuous block of memory — like books arranged in order on a shelf, spine to spine. No pointers. No wrapping. No type-checking per element. When you say "add 1 to every element," a single C function walks through that block of memory and adds 1 to each raw number in place. No Python interpreter involved.

import numpy as np

ratings = np.array([
    [5, 3, 1, 4],
    [4, 5, 3, 2],
    [1, 2, 5, 4],
], dtype=np.float32)

# Average rating per user — one line, no loop
avg = ratings.mean(axis=1)
# array([3.25, 3.5 , 3.  ])

That axis=1 means "compute the mean along axis 1" — the columns. We'll come back to axes in a moment, but the key thing is: no Python loop. The entire computation happened in compiled C, on a contiguous block of memory, in one call. That's the entire promise of NumPy.

The Four Ingredients of an ndarray

Every NumPy array, no matter how complex, is made of exactly four things. Understanding these four things is what separates people who use NumPy from people who understand it.

Let's inspect our ratings array.

print(ratings.shape)    # (3, 4)
print(ratings.dtype)    # float32
print(ratings.strides)  # (16, 4)
print(ratings.data)     # <memory at 0x...>

The data pointer is just a memory address — "the numbers start here." It points to the beginning of that contiguous block of bytes we talked about. Think of it as a street address.

The shape is a tuple that tells NumPy how to interpret that flat block of bytes as a multi-dimensional structure. Our (3, 4) shape means "3 rows, 4 columns." The shape doesn't live in the data. It's metadata — a label on the block that says "read this as a 3×4 grid."

The dtype tells NumPy how many bytes each element occupies and how to interpret them. Our float32 means each number takes 4 bytes. If we'd used float64 (NumPy's default), each number would take 8 bytes — double the memory for precision we almost never need in ML.

And then there are strides. This is the one that unlocks everything.

Strides — The Trick That Makes Everything Work

I'll be honest — when I first saw strides in a NumPy tutorial, I skipped right past them. They looked like an implementation detail. It took me an embarrassingly long time to realize that strides are the central idea that makes views, broadcasting, transpose, and reshape all work without copying data.

Here's what strides mean. Our ratings array is stored as 12 numbers, side by side, in memory:

# In memory (float32 = 4 bytes each):
# [5.0][3.0][1.0][4.0][4.0][5.0][3.0][2.0][1.0][2.0][5.0][4.0]
#  ^--- byte 0          ^--- byte 16         ^--- byte 32

The strides tuple (16, 4) says: to move to the next row, jump 16 bytes forward. To move to the next column, jump 4 bytes forward. That's it. NumPy knows the data starts at some memory address, and it uses strides to compute where any element lives.

Want element [1, 2] — Bob's rating of the third movie? Start at the base address, jump 1 × 16 + 2 × 4 = 24 bytes forward. There it is. No search. No traversal. Direct computation.

print(ratings[1, 2])  # 3.0 — Bob's rating of movie #3
# NumPy computed: base_address + 1*16 + 2*4 = base + 24 bytes

Why 16 bytes per row? Because each row has 4 elements, each element is 4 bytes (float32), and 4 × 4 = 16. The stride is literally "how many bytes until the next element along this axis."

Let me bring back our bookshelf analogy. If the books (numbers) are arranged spine-to-spine on one long shelf, the strides tell you how far to reach to grab the next book in each direction. "Every 4 inches for the next column, every 16 inches for the next row." The books themselves don't move. You're changing the rule for which book to grab next.

This seems like a small thing, but it has a massive consequence: if you can change the strides without changing the data, you can change the shape of the array without copying anything.

Views and Copies — The Consequence of Strides

When you slice an array with basic indexing — ratings[0], ratings[:, 1:3], ratings[::2] — NumPy doesn't copy any data. It creates a new array object that points to the same block of memory, with different shape and strides. This new object is called a view.

alice = ratings[0]        # Alice's ratings
print(alice.shape)         # (4,)
print(alice.strides)       # (4,)
print(np.shares_memory(alice, ratings))  # True — same data!

Alice's ratings live inside the same memory block as the full ratings array. NumPy adjusted the starting pointer (to the beginning of row 0) and the shape (just 4 elements now). Modifying alice[0] will modify ratings[0, 0]. They're the same bytes.

I've been bitten by this more times than I'd like to admit. You slice an array, modify the slice thinking you have your own copy, and then wonder why the original array changed. The fix is straightforward — use .copy() when you need independence.

alice_copy = ratings[0].copy()   # independent data
alice_copy[0] = 999
print(ratings[0, 0])              # still 5.0 — original untouched

The rule for when you get a view versus a copy is important enough to commit to memory. Basic slicing (using colons and integers) gives you a view. Fancy indexing (using a list of indices like arr[[0, 2]]) and boolean masking (like arr[arr > 3]) give you copies. The reason is that basic slicing can always be expressed as a change in strides and offset, while fancy indexing picks arbitrary elements that may not form a regular pattern in memory.

When you're unsure, np.shares_memory(a, b) will tell you the truth.

Now — transpose. When you call ratings.T, NumPy doesn't rearrange any data. It swaps the strides.

print(ratings.strides)     # (16, 4) — rows are 16 bytes apart, columns 4
print(ratings.T.strides)   # (4, 16) — now columns are 4 bytes apart, rows 16

print(np.shares_memory(ratings, ratings.T))  # True — same data, different strides

A transpose is free. Zero cost. It's a metadata change. But here's the subtle consequence: the transposed array is no longer C-contiguous — the rows aren't laid out consecutively in memory anymore. Some operations on non-contiguous arrays are slower because the CPU's cache can't predict which memory addresses you'll access next. We'll come back to this when we talk about why NumPy is fast.

Broadcasting from Scratch

Broadcasting confused me for a long time. Not the rules — the rules fit on an index card. What confused me was how it worked without copying data. Now that we understand strides, we can see the trick.

Back to our recommendation engine. We have our 3×4 ratings matrix, and we want to center each user's ratings — subtract their mean so that a 3.5 becomes 0.0, a 5 becomes +1.5, and so on. This is a common preprocessing step.

user_means = ratings.mean(axis=1, keepdims=True)
print(user_means.shape)  # (3, 1)
print(user_means)
# [[3.25],
#  [3.5 ],
#  [3.  ]]

centered = ratings - user_means
print(centered.shape)  # (3, 4) — it worked. But how?

We subtracted a (3, 1) array from a (3, 4) array and got a (3, 4) result. NumPy "broadcast" the means across the columns. It behaved as if the (3, 1) means were repeated 4 times to become (3, 4), but it never actually copied the data. Internally, it set the stride along axis 1 to zero — meaning "when you move to the next column, don't move in memory." Same number, read four times.

The broadcasting rules determine when this is allowed. They're applied from the rightmost dimension, working left.

Compare the shapes dimension by dimension, starting from the right. Two dimensions are compatible if they're equal, or if one of them is 1. If neither condition holds, NumPy raises a ValueError. If one array has fewer dimensions, pad its shape with 1s on the left.

Let's trace through our example to see the rules in action.

# ratings shape:     (3, 4)
# user_means shape:  (3, 1)
#
# Rightmost dim: 4 vs 1 → 1 stretches to 4 ✓
# Next dim:      3 vs 3 → equal ✓
# Result shape:  (3, 4)

Another example — suppose we want to add a "movie popularity bonus" to each column. We have a 1D array of four bonuses, one per movie.

bonus = np.array([0.5, 0.0, 0.2, 0.1])   # shape (4,)
adjusted = ratings + bonus

# ratings shape:  (3, 4)
# bonus shape:    (4,)    → padded to (1, 4)
#
# Rightmost dim: 4 vs 4 → equal ✓
# Next dim:      3 vs 1 → 1 stretches to 3 ✓
# Result shape:  (3, 4)

The bonus array was treated as if it had a leading dimension of 1, then stretched to 3 rows. No loop. No copy. One underlying C operation.

Here's where it gets dangerous. Shapes like (1000, 1) and (1, 5000) are broadcast-compatible. Both dimensions are 1, so both stretch. The result is (1000, 5000) — five million elements materialized from two small arrays. I've seen this silently allocate gigabytes in production code when someone forgot a keepdims=True and the shapes accidentally aligned.

I still print shapes compulsively. print(a.shape, b.shape) before any binary operation. It's the single most effective debugging habit I've picked up in years of NumPy work.

Why NumPy Is Fast — The Assembly Line Under the Hood

We've been saying "NumPy runs in C" and "no Python loop." Let's make that concrete, because the speed difference isn't 2× or 5× — it's often 100× or more, and the reason goes deeper than "C is faster than Python."

Think of a Python for loop over a list of numbers as a craftsman hand-building each item. For every number, the Python interpreter has to: look up the variable in a dictionary, check its type, unbox the raw value from the Python object, perform the arithmetic, box the result back into a new Python object, and update the reference count for garbage collection. That's six or seven operations of overhead for one addition.

NumPy is an assembly line. When you write a + b on two arrays, Python makes one call into C. The C function knows the type (no checking needed — every element has the same dtype), knows the memory layout (contiguous, predictable), and blasts through the data in a tight loop. Modern CPUs can even use SIMD instructions — Single Instruction, Multiple Data — to add 4 or 8 float32 values in a single CPU cycle. The assembly line doesn't stop and inspect each item. It processes them in bulk, on a conveyor belt.

For matrix operations — A @ B, np.linalg.solve, np.linalg.svd — NumPy goes even further. It delegates to BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package). These are libraries that have been optimized for decades, with hand-tuned assembly code for specific CPU architectures. OpenBLAS and Intel's MKL use cache-blocking strategies, multi-threaded execution, and instruction-level parallelism that no Python code could ever touch.

My favorite thing about BLAS is that, aside from the high-level algorithmic ideas (like Strassen's algorithm and cache-oblivious blocking), nobody is completely certain why the fastest implementations are quite as fast as they are. The hand-tuned assembly in something like GotoBLAS was written by one person who understood a specific CPU's pipeline deeply enough to squeeze out the last few percent. It's part science, part craft.

Let's make the speed tangible with our movie ratings example. Suppose we scale up to 10,000 users and 5,000 movies. We want each user's average rating.

big_ratings = np.random.randn(10000, 5000).astype(np.float32)

# The Python loop way
def mean_loop(X):
    result = np.empty(X.shape[0])
    for i in range(X.shape[0]):
        total = 0.0
        for j in range(X.shape[1]):
            total += X[i, j]
        result[i] = total / X.shape[1]
    return result

# The NumPy way
means = big_ratings.mean(axis=1)

# The loop takes ~15 seconds. NumPy takes ~5 milliseconds.
# That's a 3000× difference. Not a typo.

The gap comes from all three layers stacking up: no interpreter overhead, cache-friendly memory access on a contiguous block, and SIMD instructions processing multiple values at once. The Python loop hits none of those. The NumPy call hits all three.

This has a practical consequence that becomes a kind of design principle: if you're writing a for loop over array elements in NumPy, you're almost certainly doing it wrong. The game is to express your computation as operations on whole arrays — what's called vectorization. Not because it's elegant (though it is), but because it's the only way to access the fast path.

Rest Stop and an Off-Ramp

Congratulations on making it this far. You can stop here if you want.

You now have a mental model of what a NumPy array actually is: a raw block of memory, plus four pieces of metadata (pointer, shape, strides, dtype) that tell NumPy how to interpret that block. You understand that views share memory, that broadcasting works by manipulating strides, and that the speed comes from delegating tight loops to compiled C and BLAS. That model is genuinely useful — it will help you debug shape errors, understand why .T is free, and write vectorized code that doesn't fight the library.

It doesn't tell the complete story. We haven't talked about how reshape connects to strides, how einsum can express any tensor operation in one line, or the production gotchas that separate working code from fast code. But the core mental model? You have it.

The short version of everything that follows: reshape changes strides without copying when it can; einsum is a compact language for saying "contract these tensor axes"; and in production, watch your dtypes, avoid appending in loops, and use in-place operations when memory matters. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Reshaping and the Stride Connection

Now that we understand strides, reshaping becomes transparent. When you call .reshape(), NumPy tries to find new strides that make the same flat block of memory look like a different shape. If it can, it returns a view — zero cost. If it can't (because the data isn't contiguous in the right way), it copies.

# Our ratings: shape (3, 4), strides (16, 4)
flat = ratings.reshape(12)      # shape (12,), strides (4,)
print(np.shares_memory(flat, ratings))  # True — same data, new strides

# Reshape to (4, 3)
reshaped = ratings.reshape(4, 3)  # shape (4, 3), strides (12, 4)
print(np.shares_memory(reshaped, ratings))  # True — still a view

The -1 shorthand lets NumPy compute one dimension for you. ratings.reshape(-1, 2) means "give me 2 columns, figure out the rows yourself." NumPy divides the total number of elements (12) by 2 and gets 6 rows. This is essential when you know one dimension but the other depends on the data size — batching, for instance.

np.newaxis (which is literally None) adds a dimension of size 1 by inserting a stride of 0. This is the tool for making broadcasting work when shapes don't quite line up.

# user_means has shape (3,) — we need (3, 1) for broadcasting
user_means = ratings.mean(axis=1)          # shape (3,)
user_means_col = user_means[:, np.newaxis]  # shape (3, 1)

# Now we can subtract: (3, 4) - (3, 1) → (3, 4)
centered = ratings - user_means_col

The keepdims=True parameter does the same thing inline — ratings.mean(axis=1, keepdims=True) directly gives shape (3, 1) instead of (3,). I use keepdims almost always now. It makes the broadcasting intention explicit and avoids the newaxis dance.

One subtlety: .reshape() returns a view, but .flatten() always returns a copy. .ravel() returns a view when it can, a copy when it must. If you want guaranteed independence, use .flatten(). If you want speed and don't mind shared data, use .ravel().

And .T — transpose — is the reshape that swaps strides, as we saw earlier. For 2D arrays it's rows-to-columns. For higher dimensions, .transpose(2, 0, 1) lets you specify the axis order. This comes up constantly when converting between image formats: (height, width, channels) to (channels, height, width) is a single .transpose(2, 0, 1) — zero copy, instant.

einsum — The Universal Tensor Language

For a long time I avoided np.einsum. It looked like hieroglyphics. Then I watched someone rewrite five separate operations — transpose, batch matrix multiply, trace, outer product — each in a single einsum call, and I realized I'd been doing things the hard way.

einsum stands for Einstein summation. The idea is that you label each axis of each input tensor with a letter, and the letters that appear in the output are kept. Letters that appear only in the input get summed over. That's the entire language.

Let's start with our movie ratings. We want the dot product between Alice's ratings and Bob's ratings — a measure of how similar their tastes are.

alice = ratings[0]   # shape (4,)
bob = ratings[1]      # shape (4,)

# The familiar way
similarity = np.dot(alice, bob)   # 39.0

# The einsum way
similarity = np.einsum('i,i->', alice, bob)   # 39.0

Read the subscript string: both inputs have one axis labeled i. The output (after the ->) has no axes — so i gets summed over. Multiply corresponding elements, sum them up. That's a dot product.

Now suppose we want all pairwise similarities — a 3×3 matrix where entry [a, b] is the dot product of user a's ratings with user b's ratings.

# The matmul way
sim_matrix = ratings @ ratings.T   # (3, 4) @ (4, 3) → (3, 3)

# The einsum way
sim_matrix = np.einsum('ik,jk->ij', ratings, ratings)   # (3, 3)

The string says: first array has axes i and k, second array has axes j and k. Output keeps i and j, sums over k. That's matrix multiplication — and the einsum version makes the contraction axis (k) explicit. No need to mentally track which dimension is which in a transpose.

Where einsum becomes truly essential is batch operations. Imagine we have embeddings for a batch of sequences and we need attention scores — the dot product of every query with every key, separately for each item in the batch.

Q = np.random.randn(8, 10, 64).astype(np.float32)   # (batch, seq_len, d_model)
K = np.random.randn(8, 10, 64).astype(np.float32)

# Without einsum: transpose K, then batch matmul
scores = np.matmul(Q, K.transpose(0, 2, 1))   # (8, 10, 10)

# With einsum
scores = np.einsum('bid,bjd->bij', Q, K)   # (8, 10, 10)

The einsum version says exactly what we mean: for each batch b, take the dot product of query position i with key position j, summing over the model dimension d. No transposing, no worrying about which axis to swap. The letters are the documentation.

I'm still developing my intuition for when einsum is faster versus when it's equivalent to a well-written @ call. For simple matrix multiplies, @ often delegates to BLAS more directly. For complex multi-axis contractions — especially with batch dimensions — einsum can be faster because it avoids creating intermediate arrays. The optimize=True flag helps einsum find efficient contraction orderings for large expressions.

Production Patterns and the Gotchas That Bite

There's a gap between "I can use NumPy" and "I can use NumPy in production without burning money on compute." These are the patterns I've learned the hard way, mostly by watching monitoring dashboards spike at 3 AM.

Cast your dtypes early. NumPy defaults to float64. ML frameworks use float32. A (100000, 768) embedding matrix in float64 is 587 MB; in float32 it's 293 MB. That difference compounds across every operation, every intermediate array, every gradient. Cast on ingest: X = np.array(data, dtype=np.float32).

# Don't:
X = np.array(data)               # float64 by default
X = X.astype(np.float32)          # wasteful — allocated float64 first, then copied

# Do:
X = np.array(data, dtype=np.float32)   # right from the start

Never grow arrays in a loop. Every call to np.append() or np.concatenate() inside a loop allocates a brand new array and copies everything. For n iterations, that's O(n²) time and memory. Collect results in a Python list, then convert once at the end.

# This is O(n²) and will destroy your performance:
result = np.array([])
for x in stream:
    result = np.append(result, process(x))

# This is O(n):
results = []
for x in stream:
    results.append(process(x))
result = np.array(results, dtype=np.float32)

Use in-place operations when you can. a += b modifies a in place. a = a + b creates a new array for the result, then points a at it, leaving the old one for garbage collection. For large arrays in tight loops — like a training step — the memory allocation overhead adds up.

Boolean masks compose. When filtering data, combine masks with & (and), | (or), and ~ (not) before applying them, rather than filtering multiple times.

# Filter to confident positive predictions
mask = (predictions > 0.9) & (labels == 1)
selected = X[mask]   # one copy, one filter

# Not this — two intermediate arrays:
step1 = X[predictions > 0.9]      # copy
step2 = step1[labels_subset == 1]   # another copy

Understand contiguity. After a transpose or non-trivial slice, your array may not be C-contiguous. BLAS operations expect contiguous data; if your array isn't, NumPy will silently copy it before calling BLAS. If you're doing repeated matrix operations on transposed data, call np.ascontiguousarray() once up front rather than letting NumPy copy it on every operation.

Random numbers: use the modern API. The legacy np.random.seed(42) sets a global state — any library you import could read from or write to it, making your experiments non-reproducible in subtle ways. The modern approach creates a local generator.

rng = np.random.default_rng(seed=42)
batch = rng.choice(dataset, size=256, replace=False)
noise = rng.normal(0, 0.1, size=(256, 768))

The default_rng generator uses the PCG64 algorithm — statistically better than the old Mersenne Twister, and crucially, its state is local to that object. Pass the rng object through your code, and your randomness is fully controlled and reproducible.

Memory-mapped arrays for datasets that exceed RAM. np.memmap lets you create an array backed by a file on disk. Only the pages you access get loaded into memory. For a 50 GB dataset, you can work with it as if it were a normal array, and the operating system handles the paging.

mmap = np.memmap('embeddings.dat', dtype=np.float32,
                  mode='r', shape=(10_000_000, 768))
batch = mmap[1000:2000]   # only loads these 1000 rows from disk

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started from the question of why Python lists are too slow for numerical work and discovered that the answer is a single contiguous block of memory, interpreted by four pieces of metadata. We saw how strides let NumPy create views, transposes, and broadcasts without copying data. We followed the call from Python down into C and BLAS to understand where the 100× speedup actually comes from. We used einsum to express tensor contractions in a language that reads like a recipe. And we looked at the production patterns that separate code that works from code that works at scale.

My hope is that the next time you see a broadcasting error, or a mysterious shape mismatch, or a training loop that's slower than it should be, instead of trying random transposes until something works, you'll check the strides, print the shapes, and trace the path from your Python call down to the contiguous memory underneath — having a pretty darn good mental model of what's going on under the hood.

Resources and Credits

  • NumPy Internals Documentation — the official guide to array internals. Dry but authoritative. This is where I finally understood the PyArrayObject C struct.
  • NumPy Broadcasting Guide — the official broadcasting rules, with diagrams. Wildly helpful once you have the strides intuition.
  • Travis Oliphant, Guide to NumPy (2006) — the O.G. NumPy book by its creator. Dense, but it's the primary source for understanding design decisions.
  • IPython Cookbook: NumPy Internals — an insightful walkthrough of views, copies, and when data gets silently duplicated.
  • Stride Guide by Alex Riley — the single best visual explanation of strides I've found. Made the concept click for me permanently.
  • NEP 19: Random Number Generator Policy — the design document behind default_rng. If you've ever wondered why the legacy API is bad, this is unforgettable reading.