CNN Building Blocks
I avoided understanding convolutions deeply for longer than I'd care to admit. For months I treated them as magical black boxes: you feed in an image, some "filters" slide around, features come out. I could import a pretrained ResNet, swap the final layer, and get 95% accuracy on whatever classification task I threw at it. Why bother understanding the internals? The answer hit me the first time something went wrong and I had no idea where to look. My model wouldn't detect large objects. I couldn't reason about parameter counts. I didn't know if the architecture was even capable of solving the problem I'd given it. That's when I sat down and rebuilt the whole thing from scratch, piece by piece.
A convolutional neural network (CNN) is a neural network that uses a particular kind of layer — the convolutional layer — designed to exploit the spatial structure of grid-like data, especially images. The core idea was introduced by Yann LeCun in the late 1980s (LeNet for handwritten digit recognition), but CNNs didn't dominate until AlexNet won the ImageNet competition in 2012 by a staggering margin. Since then, they've become the backbone of virtually every computer vision system: image classification, object detection, segmentation, medical imaging, autonomous driving, and more.
Before we get into it — you'll want a working understanding of how basic neural networks function (forward pass, backpropagation, weight updates) and enough Python/PyTorch comfort to read a class definition. If those are shaky, chapters 7 and 8 have you covered. If they're solid, you're ready. Don't worry about needing a math degree for this; the hardest formula we'll encounter is a four-variable division.
This isn't a short journey. We're going to start with a comically tiny 5×5 image, walk through what a convolution actually computes at every single position, and gradually build up to the complete toolkit that modern architectures are made from. By the end, you'll understand every building block well enough to design your own architectures or, more importantly, debug the ones that aren't working. We'll thread a single running example through the whole section: imagine you're building a tiny security camera system that needs to decide whether a person is in the frame. That's our motivating problem.
What We'll Cover
Why Images Break Fully-Connected Networks
The Convolution Operation
Filters and Kernels — What the Network Actually Learns
Stride: How Far the Kernel Jumps
Padding: Keeping Things the Same Size
The Output Size Formula
Dilation: Seeing Wider Without More Weights
1×1 Convolutions — The Surprisingly Powerful Pixel-Wise Mixer
Rest Stop
Pooling: Throwing Away the Right Information
Global Average Pooling — The FC Layer Killer
Feature Hierarchies — The Deep Reason CNNs Work
Receptive Field — What Each Neuron Can "See"
Depthwise Separable Convolutions — The Efficiency Revolution
Transposed Convolutions — Going Back Up
The Conv-BN-ReLU Block — Putting It All Together
Parameter Count: Where the Memory Goes
Wrap-Up and Resources
Why Images Break Fully-Connected Networks
Let's start with the problem that makes convolutional layers necessary. Imagine our security camera captures a tiny grayscale image that's 5 pixels wide and 5 pixels tall. That's 25 numbers. A fully-connected layer with, say, 10 neurons would need 25 × 10 = 250 weights. Manageable. Now think about a real camera image: 224 × 224 pixels, 3 color channels (red, green, blue). That's 224 × 224 × 3 = 150,528 input values. A single FC layer with 1,000 neurons would need 150 million weights. For one layer. The model would be enormous, slow to train, and would overfit on anything smaller than a massive dataset.
But the parameter explosion isn't even the worst part. A fully-connected layer connects every input pixel to every neuron with its own unique weight. Pixel (0, 0) at the top-left corner and pixel (223, 223) at the bottom-right get treated as equally related. The layer has no concept of "nearby." It doesn't know that pixels next to each other tend to be part of the same object. There's a third problem too: if our security camera catches a person standing in the left half of the frame, a fully-connected network activates one set of weights. If the same person walks to the right half, a completely different set of weights gets activated. The network would need to learn separate detectors for every possible position of every object. That's absurdly wasteful.
We need something that exploits three key facts about images. Nearby pixels are more related than distant ones (local connectivity). The same pattern — a horizontal edge, say — looks the same regardless of where it appears (weight sharing). And if a pattern shifts position, the detector's output should shift by the same amount (translation equivariance). Convolutional layers give us all three.
The Convolution Operation
Here's our security camera's first test image — a 5×5 grayscale grid. Each cell holds a brightness value between 0 (black) and 1 (white). There's a rough vertical edge in the middle where the left side is dark and the right side is light, sort of like a person's silhouette against a bright wall.
# Our 5×5 "security camera" image
# Left side dark, right side bright — a vertical edge
image = [
[0.0, 0.0, 1.0, 1.0, 1.0],
[0.0, 0.0, 1.0, 1.0, 1.0],
[0.0, 0.0, 1.0, 1.0, 1.0],
[0.0, 0.0, 1.0, 1.0, 1.0],
[0.0, 0.0, 1.0, 1.0, 1.0],
]
Now here's a 3×3 kernel (also called a filter) — a tiny grid of weights that acts as a pattern detector. This particular kernel is designed to detect vertical edges: it has negative weights on the left and positive weights on the right.
# A vertical edge detection kernel
kernel = [
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1],
]
The convolution operation works like this: we place the kernel on top of the image, lining up their top-left corners. We multiply each kernel weight by the image pixel it overlaps, then sum all nine products into a single output number. Let's walk through the very first position — the kernel sitting over the top-left 3×3 patch of the image.
# Position (0,0): kernel overlaps image rows 0-2, cols 0-2
#
# image patch: kernel:
# 0.0 0.0 1.0 -1 0 1
# 0.0 0.0 1.0 -1 0 1
# 0.0 0.0 1.0 -1 0 1
#
# element-wise multiply and sum:
# (0.0×-1) + (0.0×0) + (1.0×1) +
# (0.0×-1) + (0.0×0) + (1.0×1) +
# (0.0×-1) + (0.0×0) + (1.0×1)
# = 0 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1
# = 3.0 ← strong vertical edge detected here!
A value of 3.0 — that's the maximum this kernel can produce (three 1s multiplied by three +1 weights). It means "there is a strong vertical edge at this location." Now we slide the kernel one pixel to the right and repeat.
# Position (0,1): kernel overlaps rows 0-2, cols 1-3
#
# image patch: kernel:
# 0.0 1.0 1.0 -1 0 1
# 0.0 1.0 1.0 -1 0 1
# 0.0 1.0 1.0 -1 0 1
#
# (0.0×-1) + (1.0×0) + (1.0×1) +
# (0.0×-1) + (1.0×0) + (1.0×1) +
# (0.0×-1) + (1.0×0) + (1.0×1)
# = 0 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1
# = 3.0 ← still on the edge
# Position (0,2): kernel overlaps rows 0-2, cols 2-4
#
# image patch: kernel:
# 1.0 1.0 1.0 -1 0 1
# 1.0 1.0 1.0 -1 0 1
# 1.0 1.0 1.0 -1 0 1
#
# (1.0×-1) + (1.0×0) + (1.0×1) +
# (1.0×-1) + (1.0×0) + (1.0×1) +
# (1.0×-1) + (1.0×0) + (1.0×1)
# = -1 + 0 + 1 + (-1) + 0 + 1 + (-1) + 0 + 1
# = 0.0 ← uniform region, no edge
We continue sliding the kernel across every valid position. With a 5×5 image and a 3×3 kernel, the kernel fits into 3 horizontal × 3 vertical positions, producing a 3×3 output called a feature map. This feature map tells us where the vertical edge pattern was detected and how strongly.
# Complete feature map (output of convolving our 5×5 image with the 3×3 kernel)
output = [
[3.0, 3.0, 0.0],
[3.0, 3.0, 0.0],
[3.0, 3.0, 0.0],
]
# The left two columns light up (edge present), right column doesn't (uniform area)
Here's the crucial insight that makes this whole thing work: the same 9 kernel weights are used at every position. We didn't use one set of weights for the top-left and a different set for the bottom-right. That's parameter sharing, and it's massive. Our vertical edge detector works identically whether the edge is at the top of the frame or the bottom. If a person walks across the camera's field of view, the edge features follow them automatically.
A Technical Footnote on "Convolution"
If you've taken a signal processing course, you might be bothered by something. True mathematical convolution requires flipping the kernel 180 degrees before sliding it across the input. What deep learning frameworks call "convolution" is actually cross-correlation — no flip. Does this matter? Not at all for neural networks, because the kernel weights are learned. If the optimal kernel after flipping is [−1, 0, 1], then backpropagation will learn [1, 0, −1] and the flip will produce the same result. The distinction matters in signal processing textbooks but not in practice. Every major framework — PyTorch, TensorFlow, JAX — implements cross-correlation and calls it convolution.
Multi-Channel Convolution
Our security camera doesn't shoot in grayscale. It captures color images with three channels: red, green, and blue. So our input isn't a 2D grid — it's a 3D volume with shape (height, width, channels), like (224, 224, 3). How does convolution handle this?
The kernel becomes 3D too. Instead of a 3×3 grid of weights, we have a 3×3×3 block — one 3×3 slice per input channel. At each position, we multiply all 27 values (9 per channel), sum them all up, and get one output number. The kernel sees all three color channels simultaneously at each spatial location.
import torch
import torch.nn as nn
# Single conv layer: 3 input channels (RGB), 16 output filters
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
# Each of the 16 filters is a 3×3×3 weight block (3×3 spatial × 3 channels)
# plus one bias term per filter
# Parameters: (3 × 3 × 3 + 1) × 16 = 28 × 16 = 448
print(sum(p.numel() for p in conv.parameters())) # 448
# Feed in a batch of one 224×224 RGB image
x = torch.randn(1, 3, 224, 224)
out = conv(x)
print(out.shape) # torch.Size([1, 16, 224, 224])
# 16 feature maps, each 224×224 — one per filter
And here's a number worth internalizing. Those 16 filters that process the entire 224×224×3 image use 448 parameters. A fully-connected layer doing the same job — 150,528 inputs to 16 outputs — would need 2,408,448 weights. That's a 5,000× difference, and it only gets more dramatic as the network gets deeper.
Filters and Kernels — What the Network Actually Learns
We hand-crafted that vertical edge kernel, but in a real CNN, the kernel weights are learned through backpropagation. The network starts with random weights and gradually adjusts them to minimize the loss function. Nobody tells it to learn edges.
So what does it learn? Zeiler and Fergus published a landmark paper in 2013 where they visualized the filters at every layer of a trained CNN. The results were striking and have been replicated many times since. The first layer learns filters that look like Gabor filters — oriented edges and color blobs. Think horizontal lines, vertical lines, diagonals at various angles, and patches sensitive to specific colors. These filters are nearly identical across different networks trained on completely different datasets. They're the visual alphabet.
Layers 2 and 3 start combining those edges into corners, T-junctions, simple textures like grids and stripes, and small circular shapes. By layers 4 and 5, the filters respond to recognizable object parts — dog faces, wheels, eyes, text. Nobody is entirely certain why this particular hierarchy of features emerges so consistently. The network is given nothing but pixel-level classification labels, and backpropagation alone drives it to rediscover what feels like a fundamental structure of visual information. It's one of those results that still gives me pause.
For our security camera, this means the first layer would learn to detect the edges of a person's silhouette, the next few layers would pick up body parts like heads and limbs, and the deepest layers would assemble those into "person" vs "not person." All from labeled examples and gradient descent.
Stride: How Far the Kernel Jumps
So far our kernel has moved one pixel at a time — that's stride 1. But we can tell it to skip positions. With stride 2, the kernel jumps two pixels at a time, visiting only every other position in both directions. The immediate effect: the output feature map is roughly half the height and half the width of the input.
Think back to our security camera. The raw image might be 224×224 — way more resolution than we need. A stride-2 convolution in the first layer takes us down to 112×112, cutting the amount of data the subsequent layers have to process by 4×. It's doing feature extraction and downsampling in one operation.
# Stride 1: output stays the same size (with padding)
conv_s1 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
x = torch.randn(1, 64, 224, 224)
print(conv_s1(x).shape) # torch.Size([1, 128, 224, 224])
# Stride 2: output halved
conv_s2 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
print(conv_s2(x).shape) # torch.Size([1, 128, 112, 112])
ResNet, one of the most influential architectures, uses stride-2 convolutions instead of pooling layers at its downsampling points. The reasoning: a stride-2 convolution has learnable weights, so the network can figure out the best way to compress spatial information. A pooling layer is a fixed operation — max or average — with no learning involved. We'll come back to this comparison when we discuss pooling.
Padding: Keeping Things the Same Size
You probably noticed something in the toy example: a 5×5 image convolved with a 3×3 kernel produced a 3×3 output. We lost 2 pixels of width and 2 pixels of height. That's because the kernel can't be centered on the border pixels without hanging off the edge of the image.
There are two standard approaches. Valid padding (also called "no padding") accepts the shrinkage. You only compute outputs where the kernel fully fits. A 3×3 kernel on a 5×5 input gives a 3×3 output. Lose 1 pixel per side. Same padding adds zeros around the border of the input so the output keeps the same spatial dimensions as the input. For a 3×3 kernel, we add 1 pixel of zeros on every side, turning our 5×5 input into a 7×7 padded input, and the 3×3 kernel fits into exactly 5 positions per dimension — giving us back our 5×5 output.
Why does this matter so much? Without padding, the feature map shrinks by (kernel_size − 1) pixels at every layer. In a 20-layer network with 3×3 kernels, that's a loss of 40 pixels of width. A 224-pixel-wide image would be down to 184 pixels after 20 layers — and you'd lose all the border information progressively. Same padding lets you stack dozens of conv layers without any shrinkage, then downsample intentionally with stride or pooling when you decide to.
# Valid padding: output shrinks
conv_valid = nn.Conv2d(1, 1, kernel_size=3, padding=0)
x = torch.randn(1, 1, 32, 32)
print(conv_valid(x).shape) # torch.Size([1, 1, 30, 30]) — lost 2 pixels
# Same padding: output size preserved
conv_same = nn.Conv2d(1, 1, kernel_size=3, padding=1)
print(conv_same(x).shape) # torch.Size([1, 1, 32, 32])
The Output Size Formula
I'll be honest — the first time I saw the output size formula, I memorized it without understanding it. It felt like a piece of trivia. But once you start designing architectures or debugging shape mismatches, you realize this one formula is doing a lot of heavy lifting.
output_size = (W - K + 2P) / S + 1
# W = input width (or height)
# K = kernel size
# P = padding
# S = stride
Let's break it apart instead of memorizing blindly. Start with the input width W. The kernel needs K pixels of room, so the first kernel position starts at pixel 0 and its rightmost element lands at pixel K−1. That means the kernel can start at positions 0, 1, 2, ..., up to W−K (after which it'd fall off the edge). That's W−K+1 possible positions. Padding adds P extra pixels on each side, effectively making the input W+2P wide. So the formula without stride is (W + 2P − K + 1) positions. Stride S means we only visit every S-th position, so we divide by S: (W − K + 2P) / S + 1.
# Let's verify with our running example
# Security camera image: W=224, 3×3 kernel, padding=1, stride=1
# (224 - 3 + 2×1) / 1 + 1 = 224 ← same padding preserves size
# First downsample: W=224, 3×3 kernel, padding=1, stride=2
# (224 - 3 + 2) / 2 + 1 = 223/2 + 1 = 111.5 + 1 = 112 (floor division)
# ResNet's opening: W=224, 7×7 kernel, padding=3, stride=2
# (224 - 7 + 6) / 2 + 1 = 223/2 + 1 = 112 ← exactly what ResNet does
When the division isn't exact, PyTorch takes the floor. This formula also extends to dilated convolutions — we'll get there shortly — where you replace K with the effective kernel size K + (K−1) × (dilation−1).
Dilation: Seeing Wider Without More Weights
A standard 3×3 kernel looks at a 3×3 patch. What if you need a wider field of view but can't afford a bigger kernel? Dilated convolution (also called atrous convolution, from the French "à trous" meaning "with holes") spaces out the kernel elements with gaps. A 3×3 kernel with dilation=2 doesn't look at 9 adjacent pixels — it skips every other pixel, so its 9 weights are spread across a 5×5 area. Dilation=3 spreads them across a 7×7 area. Same 9 parameters, much bigger field of view.
# Standard 3×3: covers 3×3 area, 9 weights
conv_d1 = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)
# Dilation=2: covers 5×5 area, still 9 weights
conv_d2 = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)
# Dilation=4: covers 9×9 area, still 9 weights
conv_d4 = nn.Conv2d(64, 64, kernel_size=3, padding=4, dilation=4)
x = torch.randn(1, 64, 32, 32)
print(conv_d1(x).shape) # [1, 64, 32, 32]
print(conv_d2(x).shape) # [1, 64, 32, 32]
print(conv_d4(x).shape) # [1, 64, 32, 32]
# Same output sizes, but each "sees" a different area of the input
DeepLab, one of the most successful semantic segmentation architectures, relies heavily on dilated convolutions to maintain spatial resolution while capturing large-scale context. WaveNet, an audio generation model, uses exponentially increasing dilation (1, 2, 4, 8, 16, ...) to capture long-range temporal dependencies without an explosion in parameters. I still occasionally get the dilation padding calculation wrong on the first try — the effective kernel size is K + (K−1) × (dilation−1), and the "same" padding has to match half of that. It's worth double-checking with a quick shape test.
Because dilated convolutions skip intermediate positions, they can miss fine-grained patterns in the gaps. Stack several layers with the same dilation rate and you get gridding artifacts — visible grid-like patterns in the output where certain pixels never directly contributed to the computation. The fix is to mix different dilation rates (1, 2, 4) in successive layers so the gaps get filled in. DeepLab v3 uses this mixed-dilation strategy.
1×1 Convolutions — The Surprisingly Powerful Pixel-Wise Mixer
A 1×1 convolution sounds like it shouldn't do anything useful. It looks at one pixel at a time — no spatial context, no neighboring pixels. But remember that "one pixel" in a feature map is really a vector of values across all channels. If the input has 256 channels, a 1×1 convolution takes those 256 values at each spatial position and computes a weighted sum (plus bias, plus nonlinearity), producing a new channel value. It's a per-pixel fully-connected layer across the channel dimension. Think of it as a tiny MLP applied independently at every spatial location.
The "Network in Network" paper (Lin et al., 2013) introduced this idea, and it turned out to be enormously useful. GoogLeNet uses 1×1 convolutions as bottleneck layers: before an expensive 3×3 or 5×5 convolution, a 1×1 conv reduces the channel count from, say, 256 to 64. The subsequent 3×3 conv operates on 64 channels instead of 256, slashing computation by 4×. ResNet's bottleneck block follows the same pattern: 1×1 to reduce channels → 3×3 to process spatially → 1×1 to expand channels back.
# 1×1 conv as a channel reducer (bottleneck)
reduce = nn.Conv2d(256, 64, kernel_size=1) # 256 channels → 64 channels
expand = nn.Conv2d(64, 256, kernel_size=1) # 64 channels → 256 channels
x = torch.randn(1, 256, 56, 56)
print(reduce(x).shape) # [1, 64, 56, 56] — spatial unchanged, channels reduced
print(expand(reduce(x)).shape) # [1, 256, 56, 56] — restored
# ResNet bottleneck pattern: reduce → spatial conv → expand
bottleneck = nn.Sequential(
nn.Conv2d(256, 64, kernel_size=1, bias=False), # reduce: 256 → 64
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False), # spatial: 3×3 on 64 channels
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 256, kernel_size=1, bias=False), # expand: 64 → 256
nn.BatchNorm2d(256),
)
print(sum(p.numel() for p in bottleneck.parameters())) # ~70K vs ~590K for a direct 3×3 on 256 channels
For our security camera system, the 1×1 convolution would be how the network takes, say, 256 different feature detectors and mixes their responses at each pixel — deciding that "strong edge response + strong skin-color response + weak texture response" means "likely a person" at this location, without looking at neighboring pixels yet.
Rest Stop
This is a good place to pause. If you need to stop here and come back later, that's completely fine — you've already covered the core of what a CNN does.
Here's the mental model so far. A convolutional layer slides small learnable filters across an image, producing feature maps that show where patterns were detected. These filters share weights across all positions, making the layer parameter-efficient and translation-equivariant. We control the output size with stride (how far the kernel jumps), padding (whether borders shrink), and dilation (how wide the kernel "sees" without adding parameters). A 1×1 convolution mixes channel information without spatial context, acting as a cheap bottleneck. Everything else we'll cover builds on this foundation.
When you're ready, we'll look at how to throw away information strategically (pooling), why stacking layers creates a hierarchy of features (and why that's the deep reason CNNs work), and the practical blocks that modern architectures are built from.
Pooling: Throwing Away the Right Information
Convolutions detect patterns. Pooling compresses the result. The idea: after you know "there's a vertical edge somewhere in this 2×2 region," you don't need to remember which of the four pixels it was in. Pooling layers slide a small window across the feature map (typically 2×2 with stride 2) and reduce each window to a single value, halving the width and height.
Max Pooling
Max pooling keeps the maximum value in each window. If our edge detector produced activations [0.1, 0.3, 2.8, 0.5] in a 2×2 region, max pooling outputs 2.8 — "the strongest edge response in this area." The exact sub-pixel location is discarded, which actually buys us a useful property: small shifts in the input (the person in our security camera moves slightly) don't change the max. This is a form of translation invariance, a step beyond the translation equivariance that convolution provides.
pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Simulate a feature map from our security camera
x = torch.randn(1, 64, 112, 112)
print(pool(x).shape) # [1, 64, 56, 56] — halved spatial dims, channels untouched
# Max pooling has zero learnable parameters
print(sum(p.numel() for p in pool.parameters())) # 0
Average Pooling
Average pooling takes the mean of all values in the window instead of the maximum. Where max pooling asks "is this feature present anywhere in this region?", average pooling asks "how much of this feature is present on average?" It's a smoother, less aggressive operation. In practice, max pooling dominates in the middle layers of networks — the "detect the strongest response" behavior is empirically better for classification. Average pooling's main role appears at the very end of the network, as global average pooling.
Strided Convolution as an Alternative
There's a third option: skip pooling entirely and use a stride-2 convolution to downsample. The paper "Striving for Simplicity" (Springenberg et al., 2014) showed that replacing pooling layers with strided convolutions works as well or better. The advantage: strided convolutions have learnable parameters, so the network can learn how to downsample rather than being forced into a fixed max or average. ResNet uses this approach. The trade-off is that pooling is parameter-free and computationally cheap, while strided convs add both weights and compute.
| Method | Learnable | Parameters | Best For |
|---|---|---|---|
| Max pool 2×2 | No | 0 | Feature detection, keeping strongest activations |
| Average pool 2×2 | No | 0 | Smooth aggregation, averaging responses |
| Global average pool | No | 0 | Replacing FC layers, classification heads |
| Stride-2 conv 3×3 | Yes | K²·C_in·C_out | Modern architectures, learnable downsampling |
Global Average Pooling — The FC Layer Killer
Global average pooling (GAP) takes the idea of average pooling to its extreme: the window size is the entire feature map. If your last convolutional layer outputs a tensor of shape (batch, 512, 7, 7), GAP averages each 7×7 feature map into a single number, producing (batch, 512). One scalar per channel. That's it.
Why is this such a big deal? Before GAP existed, networks like VGG ended with fully-connected layers. VGG's first FC layer took a 7×7×512 feature volume (25,088 values) and connected it to 4,096 neurons. That single layer had 4,096 × 25,088 = 102 million parameters — over 70% of VGG's entire parameter budget sitting in one layer that's not even doing feature extraction. GAP replaces this with a parameter-free averaging operation, then a single small FC layer for classification: 512 × 1,000 = 512,000 parameters. That's a 200× reduction.
GoogLeNet (2014) popularized this approach, and it's been nearly universal since. GAP also acts as a structural regularizer — fewer parameters means less overfitting. And it enables input-size flexibility: because there are no FC layers expecting a fixed input size, a model with GAP can accept images of any resolution. The convolutional layers produce feature maps of varying spatial size, and GAP collapses them to the same channel vector regardless.
# Global Average Pooling in our security camera classifier
gap = nn.AdaptiveAvgPool2d(1) # output 1×1 regardless of input spatial size
features = torch.randn(8, 512, 7, 7) # batch of 8, 512 channels, 7×7 spatial
out = gap(features) # (8, 512, 1, 1)
out = out.flatten(1) # (8, 512) — ready for the classifier
# Complete classification head
classifier = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Dropout(0.2),
nn.Linear(512, 2) # person vs no-person for our camera
)
# Total classifier params: 512 × 2 + 2 = 1,026. That's it.
Feature Hierarchies — The Deep Reason CNNs Work
We've been building up the individual pieces. Now let's talk about what happens when you stack them. This is the single most important concept for understanding why CNNs are so effective, and it connects directly to why transfer learning works.
When you stack convolutional layers, each layer builds on the features detected by the layer below. The first layer, working directly on raw pixels, can only learn simple patterns — edges at various orientations, color blobs, brightness gradients. The second layer takes those edge maps as input and learns combinations: a horizontal edge meeting a vertical edge becomes a corner, a curved sequence of edge responses becomes an arc. By the third and fourth layers, the network has assembled corners and textures into object parts — an eye, a wheel, a window frame. By layers five and deeper, it recognizes entire objects and scenes.
This hierarchy was beautifully visualized by Zeiler and Fergus (2013): Layer 1 produces Gabor-like edge filters. Layers 2–3 produce texture and corner detectors. Layers 4–5 produce neurons that fire for dog faces, bird legs, and bicycle wheels. The remarkable thing is that this emerges entirely from backpropagation and classification labels. Nobody tells the network "learn edges first, then corners, then parts." The learning algorithm discovers this hierarchy on its own.
For our security camera, this means early layers detect the outlines and textures of whatever is in the frame. Middle layers assemble those into body parts — head shapes, arm-like limbs, torso regions. Deep layers combine the parts into a holistic "person" detector. This hierarchical decomposition is the foundation of transfer learning: the early-layer features (edges, textures, corners) are useful for almost any visual task. You can take a ResNet trained on ImageNet's 1,000 categories and fine-tune it for security camera person detection, and the early layers transfer perfectly — you only retrain the deeper, task-specific layers.
Receptive Field — What Each Neuron Can "See"
The receptive field of a neuron is the region of the original input image that can possibly influence that neuron's output value. Understanding this concept is essential for architecture design, and it tripped me up more than once before I internalized it.
A single 3×3 conv layer: each output neuron sees a 3×3 patch of the input. Stack a second 3×3 layer on top: each neuron in layer 2 sees a 3×3 region of layer 1's output, but each of those layer-1 neurons already covered a 3×3 patch of the input. The layer-2 neuron indirectly covers a 5×5 patch of the original input. Add a third 3×3 layer and we're at 7×7. The pattern: each 3×3 layer adds 2 to the receptive field.
# Theoretical receptive field after L layers of K×K conv, stride 1:
# RF = L × (K - 1) + 1
#
# K=3, L=1: RF = 3
# K=3, L=2: RF = 5
# K=3, L=3: RF = 7
# K=3, L=5: RF = 11
# K=3, L=10: RF = 21
# This is why VGG chose to stack 3×3 layers:
# Two 3×3 layers = 5×5 receptive field, 2 × (3×3) = 18 params per channel pair
# One 5×5 layer = 5×5 receptive field, 25 params per channel pair
# Same receptive field, fewer parameters, PLUS an extra nonlinearity between layers
That last point is subtle and important. Two 3×3 layers with a ReLU between them are strictly more expressive than a single 5×5 layer with the same receptive field, because the nonlinearity lets the network represent a wider class of functions. This is the VGG insight that influenced nearly every subsequent architecture.
Stride and pooling multiply the growth rate. A stride-2 layer doubles the subsequent receptive field growth, because each pixel in the downsampled map corresponds to a 2×-wider region of the input. This is why architectures progressively downsample — to ensure that deep layers have receptive fields large enough to "see" entire objects.
Effective vs. Theoretical Receptive Field
Here's where things get humbling. I'm still building my intuition for why the effective receptive field differs so much from the theoretical one. The theoretical receptive field is the full region of input pixels that could influence a neuron. The effective receptive field (Luo et al., 2016) is the region that actually does influence it significantly. They found that the actual influence follows a Gaussian distribution — center pixels have an outsized impact, and border pixels contribute almost nothing. The effective receptive field is often a small fraction of the theoretical one, roughly proportional to its square root.
This has direct practical consequences. If our security camera image contains a person who occupies 200×200 pixels of the frame, the detection layer's effective receptive field needs to cover at least 200×200 pixels. The theoretical receptive field might need to be 400×400 or more to achieve that. If your network can't detect large objects, the first thing to check is whether the receptive field is large enough. If it's not, no amount of data or training time will fix the problem — it's a structural limitation of the architecture.
When a model fails on large objects but works on small ones, calculate the effective receptive field at the detection layer. If it's smaller than the objects you're trying to detect, you need architectural changes: more layers, larger strides, dilated convolutions, or a feature pyramid network. This is a structural problem, not a data problem.
Depthwise Separable Convolutions — The Efficiency Revolution
Standard convolution is doing two things at once: spatial filtering (detecting patterns in the height/width dimensions) and channel mixing (combining information across channels). Depthwise separable convolution splits these into two sequential operations, and the savings are dramatic.
A standard 3×3 convolution with 256 input channels and 256 output channels has 3 × 3 × 256 × 256 = 589,824 parameters. The depthwise separable version works in two steps. First, the depthwise convolution: apply a separate 3×3 filter to each input channel independently. That's 256 independent filters, each with 9 weights: 3 × 3 × 256 = 2,304 parameters. Each channel gets filtered spatially, but the channels don't talk to each other yet. Second, the pointwise convolution: a 1×1 convolution that mixes the 256 filtered channels into 256 new output channels. That's 1 × 1 × 256 × 256 = 65,536 parameters. Total: 2,304 + 65,536 = 67,840. That's about 8.7× fewer parameters than the standard version.
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_ch, out_ch, kernel_size=3, stride=1, padding=1):
super().__init__()
# Depthwise: one filter per input channel (groups=in_ch)
self.depthwise = nn.Conv2d(
in_ch, in_ch, kernel_size,
stride=stride, padding=padding, groups=in_ch, bias=False
)
# Pointwise: 1×1 conv mixes channels
self.pointwise = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
def forward(self, x):
return self.pointwise(self.depthwise(x))
# Compare
dw_sep = DepthwiseSeparableConv(256, 256)
standard = nn.Conv2d(256, 256, 3, padding=1)
dw_params = sum(p.numel() for p in dw_sep.parameters())
std_params = sum(p.numel() for p in standard.parameters())
print(f"Depthwise separable: {dw_params:,}") # 67,840
print(f"Standard: {std_params:,}") # 590,080
print(f"Reduction: {std_params / dw_params:.1f}×") # ~8.7×
The general cost ratio is 1/C_out + 1/K². For large C_out and K=3, that's approximately 1/9 — a 9× reduction. MobileNet V1 (Howard et al., 2017) was built entirely on this idea and achieved AlexNet-level accuracy with 1/30th the computation. MobileNet V2 improved things with inverted residuals: instead of compressing then expanding, it expands the channels with a 1×1 conv, applies the depthwise conv in the expanded space (where the network has more room to learn), then projects back down with another 1×1. EfficientNet later used depthwise separable convolutions as its core building block, scaling width, depth, and resolution together using a compound scaling method.
For our security camera running on a cheap edge device with limited compute, depthwise separable convolutions are the difference between "runs at 2 frames per second" and "runs at 30 frames per second." The accuracy trade-off is real but often negligible for practical applications.
Transposed Convolutions — Going Back Up
Everything so far has been about going down — extracting features while reducing spatial resolution. But some tasks need to go back up. Semantic segmentation needs per-pixel labels, so you have to map a small feature map back to full image resolution. Autoencoders reconstruct the input. GANs generate images from low-dimensional noise. All of these need upsampling.
A transposed convolution (sometimes misleadingly called a "deconvolution") does this. Conceptually, it inserts zeros between the input elements to spread them out, then applies a regular convolution to fill in the gaps. The output is larger than the input.
# Transposed conv: double the spatial dimensions
up = nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1)
x = torch.randn(1, 256, 14, 14)
print(up(x).shape) # torch.Size([1, 128, 28, 28])
# Output size for transposed conv:
# out = (in - 1) × stride - 2 × padding + kernel_size
# (14 - 1) × 2 - 2 × 1 + 4 = 26 - 2 + 4 = 28
There's a well-known problem though. When the stride doesn't evenly divide the kernel size, the transposed convolution produces uneven overlap regions — some output pixels get contributions from more kernel positions than others. The result is checkerboard artifacts: visible grid-like patterns in the output. Odena et al. documented this thoroughly and proposed a cleaner alternative: bilinear upsampling followed by a regular convolution. The upsampling step handles the "make it bigger" part with a fixed, smooth interpolation, and the convolution handles the "learn features" part with learnable weights. Most modern architectures use this approach.
# Clean alternative: upsample then convolve
class SmoothUpsample(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
self.conv = nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1)
def forward(self, x):
return self.conv(self.up(x))
smooth = SmoothUpsample(256, 128)
x = torch.randn(1, 256, 14, 14)
print(smooth(x).shape) # [1, 128, 28, 28] — no checkerboard
If our security camera system needed to output a pixel-level mask (where exactly is the person in the frame?), we'd need some form of upsampling in the decoder path. The bilinear-then-conv approach would be the safer default.
The Conv-BN-ReLU Block — Putting It All Together
In practice, you almost never see a bare Conv2d layer in a modern architecture. The standard building block is Conv → BatchNorm → ReLU, repeated many times. Let's build one and then assemble a complete feature extractor for our security camera.
import torch
import torch.nn as nn
class ConvBlock(nn.Module):
def __init__(self, in_ch, out_ch, kernel_size=3, stride=1, padding=1):
super().__init__()
# bias=False because BatchNorm has its own bias parameter
self.conv = nn.Conv2d(in_ch, out_ch, kernel_size,
stride=stride, padding=padding, bias=False)
self.bn = nn.BatchNorm2d(out_ch)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
return self.relu(self.bn(self.conv(x)))
The bias=False detail matters. BatchNorm subtracts the mean and divides by the standard deviation, then applies its own learnable shift and scale. Any bias the conv layer added gets subtracted out immediately — it's wasted parameters. Setting bias=False saves one parameter per output channel, and more importantly, it removes a redundant degree of freedom that can slow down optimization.
Now let's wire together a complete security camera classifier. Watch the dimension flow — this is the rhythm of every CNN: spatial dims go down, channel count goes up.
# Complete security camera person detector
model = nn.Sequential(
# Stage 1: 224×224×3 → 224×224×64
ConvBlock(3, 64),
ConvBlock(64, 64),
nn.MaxPool2d(2, 2), # → 112×112×64
# Stage 2: 112×112×64 → 112×112×128
ConvBlock(64, 128),
ConvBlock(128, 128),
nn.MaxPool2d(2, 2), # → 56×56×128
# Stage 3: 56×56×128 → 28×28×256 (stride-2 conv instead of pool)
ConvBlock(128, 256, stride=2),
ConvBlock(256, 256),
# Classification head
nn.AdaptiveAvgPool2d(1), # → 1×1×256
nn.Flatten(), # → 256
nn.Linear(256, 2) # → person / no-person
)
x = torch.randn(2, 3, 224, 224)
print(model(x).shape) # torch.Size([2, 2])
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}") # ~957K — fits on almost any device
224 → 112 → 56 → 28 → 1 spatially, and 3 → 64 → 128 → 256 in channels. Every major classification architecture follows this rhythm: trade spatial resolution for feature richness. Early layers know where things are at high resolution. Deep layers know what things are with many feature channels.
Parameter Count: Where the Memory Goes
Being able to compute parameter counts in your head is a genuinely practical skill. It tells you where the memory goes, which layers dominate training time, and whether your model will fit on the deployment device. The formula for one conv layer:
# params = (K × K × C_in + bias) × C_out
# with bias=True: bias = 1
# with bias=False: bias = 0
# Conv2d(3, 64, 7) with bias:
# (7 × 7 × 3 + 1) × 64 = 148 × 64 = 9,472
# Conv2d(64, 128, 3) with bias:
# (3 × 3 × 64 + 1) × 128 = 577 × 128 = 73,856
# Conv2d(512, 512, 3) with bias:
# (3 × 3 × 512 + 1) × 512 = 4,609 × 512 = 2,359,808 ← 2.3M in one layer
Notice how the parameter count scales with the square of the channel count (C_in × C_out). The 512→512 layer has 250× more parameters than the 3→64 layer. This is why VGG-16, with its wide 512-channel layers, has 138M parameters, and it's why depthwise separable convolutions exist — they break the C_in × C_out coupling.
# Verify with PyTorch
layers = [
("Conv2d(3, 64, 7)", nn.Conv2d(3, 64, 7)),
("Conv2d(64, 128, 3)", nn.Conv2d(64, 128, 3)),
("Conv2d(512, 512, 3)", nn.Conv2d(512, 512, 3)),
]
for name, layer in layers:
params = sum(p.numel() for p in layer.parameters())
print(f"{name:25s} → {params:>10,} parameters")
# For our security camera model (~957K), the vast majority sits in
# the 128→256 and 256→256 stages. The 3→64 layers are tiny.
Wrapping Up
That was a lot of ground. We started with a 5×5 toy image and a handcrafted edge-detection kernel, walked through every multiplication and sum, and built up to a complete person-detection architecture with under a million parameters. Along the way, we encountered the output size formula, the receptive field calculation, the 9× savings from depthwise separable convolutions, the checkerboard problem with transposed convolutions, and the deep reason CNNs work — the feature hierarchy that emerges from nothing but gradient descent.
I'm genuinely grateful for the researchers who built these ideas piece by piece over decades — LeCun for the original architecture, Zeiler and Fergus for showing us what the layers learn, He et al. for residual connections, Howard et al. for making it all run on phones. If you've followed the whole section, you now have a working understanding of every major building block in a convolutional neural network. The next section covers how these blocks were assembled into the landmark architectures that defined modern computer vision — LeNet, AlexNet, VGG, GoogLeNet, ResNet, and beyond.
I hope this felt less like a lecture and more like sitting down with someone who'd gone through the same confusion and came out the other side with a clearer picture. Because that's what it is.
Resources
CS231n, Stanford's Convolutional Neural Networks for Visual Recognition course — the single best resource I've found for building visual intuition about CNNs. The lecture notes are free and the assignments force you to implement everything from scratch. (cs231n.stanford.edu)
Zeiler & Fergus, "Visualizing and Understanding Convolutional Networks" (2013) — the paper that first showed what each layer of a CNN actually learns. Reading it feels like opening the black box for the first time. The visualizations are worth the paper alone.
Odena et al., "Deconvolution and Checkerboard Artifacts" (2016) — a short, readable distill.pub article explaining why transposed convolutions produce those irritating grid patterns and what to do about it. Changed how I think about upsampling permanently. (distill.pub/2016/deconv-checkerboard)
Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" (2017) — the paper that introduced depthwise separable convolutions to the mainstream. Beautifully written, with clear parameter count comparisons that make the efficiency gains visceral.
Luo et al., "Understanding the Effective Receptive Field in Deep Convolutional Neural Networks" (2016) — the paper that showed the effective receptive field is much smaller than the theoretical one. Short and impactful. Will make you reconsider your architecture choices.