Architecture Evolution
The history of vision architectures is a chain reaction: each breakthrough solved a specific frustration that made the previous generation hit a wall. LeNet proved convolutions learn visual features. AlexNet proved they scale with GPUs. VGG proved small kernels beat large ones. GoogLeNet proved you can go wide, not deep. ResNet cracked the depth barrier with skip connections — an idea so good it escaped CNNs entirely and now lives in Transformers, diffusion models, and everything else. DenseNet pushed feature reuse further. EfficientNet showed how to scale all dimensions together. And ConvNeXt proved the CNN isn't dead — it was wearing the wrong clothes. Understanding why each jump happened is what separates someone who memorized a timeline from someone who can design architectures.
I avoided going deep on architecture history for a while. I knew the names — LeNet, AlexNet, ResNet — the way you know capital cities. Enough to answer a trivia question, not enough to explain why each one exists. Every time someone asked "why does ResNet use skip connections?" I'd say something about gradient flow and change the subject. Finally the discomfort of pattern-matching without understanding grew too strong. Here is that dive.
The story we're tracing spans 1998 to 2022 — roughly 25 years. In that time, the field went from a network with 60,000 parameters recognizing handwritten digits to systems with hundreds of millions of parameters that can understand photographs, generate images, and serve as the visual backbone for multimodal AI. Each architecture along the way solved one specific frustration that the previous generation hit.
Before we start, a heads-up. We're going to talk about receptive fields, parameter counts, and gradient dynamics, but you don't need to have those memorized beforehand. We'll build up each concept as we need it, with explanation.
This isn't a short journey, but I hope you'll be glad you came.
The Running Thread: Building a Photo Organizer
To ground each architecture in something concrete, imagine we're building a photo organizer — the kind that looks at your camera roll and sorts pictures into categories: landscapes, pets, food, people. We start in 1998 with a system that can barely recognize handwritten digits, and we'll watch the technology evolve until it can understand scenes with the same architecture that reads text. Every architecture we discuss, we'll ask: how would this improve our photo organizer?
LeNet-5 (1998) — Where the Template Was Born
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner built LeNet-5 for a specific job: reading handwritten digits on postal checks. The input was tiny — 32×32 grayscale images. The architecture was two convolutional layers with 5×5 kernels, two subsampling layers (average pooling, not max — max pooling hadn't become standard yet), and three fully connected layers at the end. About 60,000 parameters total.
What made LeNet-5 genuinely novel was what it wasn't doing. Before this, the standard approach to image recognition was hand-engineering features — a human expert would decide what to look for (edges at specific angles, corner patterns, texture descriptors) and then feed those handcrafted numbers into a classifier. LeNet-5 said: let the network learn its own features, end-to-end, through backpropagation. The convolutional layers would figure out which local patterns matter, the pooling layers would build tolerance to small shifts and distortions, and the fully connected layers would combine everything for the final decision.
The template it established is the one we still recognize today: convolution → pooling → convolution → pooling → flatten → fully connected → output. Extract features at increasing levels of abstraction, reduce spatial dimensions as you go, then classify.
For our photo organizer, LeNet-5 would be useless. It was designed for 32×32 grayscale images of digits. Throw a color photograph of a dog at it and it wouldn't know where to begin. The hardware wasn't there — training even this small network required days of computation. The datasets weren't there — ImageNet wouldn't exist for another decade. And the training tricks we now take for granted (ReLU, dropout, batch normalization) hadn't been invented yet.
For fourteen years after LeNet-5, CNNs largely gathered dust. The machine learning community moved on to support vector machines and hand-engineered feature pipelines. The idea that deep networks could learn visual features from scratch was considered interesting but impractical.
LeNet-5 didn't give us a useful model — it gave us a blueprint. The Conv-Pool-FC skeleton survived for nearly two decades. Every subsequent architecture is an answer to the question: "What if we made this deeper, wider, or fundamentally different?" If you understand LeNet's structure, you understand the template everything else modifies.
AlexNet (2012) — The Day Everything Changed
For fourteen years, the pattern recognition community ran annual competitions on increasingly difficult image datasets. By 2012, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) had become the benchmark: 1.2 million training images across 1,000 categories — dogs, cars, mushrooms, saxophones, everything. The best systems used hand-engineered features like SIFT descriptors and HOG features, fed into SVMs or shallow classifiers. The top-5 error rate was stuck around 26%. Progress was measured in fractions of a percent per year.
Then Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a convolutional neural network. AlexNet achieved a top-5 error rate of 15.3%. Not a fraction-of-a-percent improvement. A ten-percentage-point improvement. In a single year. The machine learning world pivoted overnight.
I'll be honest — when I first read about this result years later, it felt like it must have been obvious in hindsight. It wasn't. The community had largely given up on neural networks for vision. The fact that Hinton's group even submitted a CNN was considered eccentric.
What Actually Made It Work
AlexNet's architecture — five convolutional layers, three fully connected layers, about 60 million parameters — wasn't beautiful. It used oversized 11×11 and 5×5 kernels. The network was physically split across two GTX 580 GPUs with 3GB of memory each, with cross-GPU communication at specific layers. That split was an engineering hack born from GPU memory limitations, not a design insight.
What mattered were four ingredients that happened to come together at the same time.
ReLU activation was arguably the most important. Previous networks used sigmoid or tanh activations, which saturate for large inputs — the gradient gets crushed toward zero, and learning grinds to a halt. This is the vanishing gradient problem, and it's the reason deep networks had failed for so long. ReLU — max(0, x) — doesn't saturate for positive values. The gradient is either 0 or 1, and training was roughly 6× faster compared to tanh. Without ReLU, a network this deep would have its gradients die before reaching the early layers.
GPU training turned weeks of computation into days. Convolutions are massively parallelizable — each filter application is independent — and GPUs are built for exactly that kind of workload. AlexNet was the first major demonstration that GPUs weren't entertainment hardware; they were the engine deep learning had been waiting for.
Dropout randomly zeroed out neurons during training (with probability 0.5 in the fully connected layers). This was a brand-new technique at the time, and it prevented a network with 60 million parameters from memorizing the training set. Without dropout, the gap between training and test accuracy was enormous.
Data augmentation — random crops, horizontal flips, PCA-based color jittering — artificially expanded the effective training set. These tricks seem routine now, but they were new in 2012, and they made a measurable difference in generalization.
For our photo organizer, AlexNet would be the first version that actually works on real photos. Not well — the 11×11 kernels are wasteful, and 1,000 categories is a coarse level of understanding — but it would correctly sort dogs from cats more often than not. The real impact was the signal it sent: deep learning on vision was not a dead end. It was the beginning.
AlexNet's lasting contribution wasn't any specific layer design. It was proof of concept: CNNs + GPUs + large datasets = trajectory toward superhuman performance. Every paper, startup, and research lab in deep learning since 2012 traces back to this demonstration. The architecture itself is a historical artifact. The idea it proved is permanent.
VGGNet (2014) — The Discipline of Small Kernels
After AlexNet, the immediate question was: how do we do this better? Two groups at Oxford (the Visual Geometry Group, hence "VGG") asked the simplest possible version of that question: what if we take everything that AlexNet did, throw away the complicated parts, and stack small convolutions as deep as we can go?
Karen Simonyan and Andrew Zisserman's key insight was elegant. AlexNet had used 11×11 and 5×5 kernels — large filters that look at big chunks of the image at once. VGG replaced all of these with 3×3 convolutions, the smallest kernel that still captures spatial patterns in all directions.
Here's why that's not a downgrade. Two 3×3 convolutions stacked back-to-back give you the same effective receptive field as a single 5×5 convolution — both let a neuron "see" a 5×5 region of the input. But the stacked version uses fewer parameters: 2 × (3 × 3) = 18 weights per channel pair, versus 5 × 5 = 25 for the single large kernel. You save parameters and you get an extra ReLU activation sandwiched in between, making the learned function more expressive. Stack three 3×3 layers and you match a 7×7 receptive field with even bigger savings.
Think of it like climbing stairs versus taking a single leap. The leap covers the same distance, but the stairs give you more control at each step, and each step is less effort. VGG's small kernels are the stairs — they reach the same view of the input through a series of smaller, more controlled transformations.
VGG-16 (16 weight layers) and VGG-19 (19 layers) achieved excellent results on ImageNet. The architecture was beautifully simple — you could describe the entire thing in three lines: 3×3 conv with ReLU, double channels at each pooling stage, end with three fully connected layers. But VGG was enormous: 138 million parameters, the vast majority crammed into those final FC layers. It was slow to train, slow to run, and memory-hungry.
For our photo organizer, VGG-16 would be a solid feature extractor — people still use VGG features for style transfer and perceptual loss to this day. But you wouldn't deploy it on a phone. The lesson from VGG: deeper is better, small kernels beat large ones, but brute-force depth has a ceiling. We need smarter ways to go deep.
GoogLeNet / Inception (2014) — Going Wide Instead of Deep
The same year VGG went deep and simple, Google went wide and clever. Christian Szegedy's team asked a different question: instead of choosing a single filter size for each layer, what if we use all of them and let the network decide which one matters?
The Inception module runs multiple convolution branches in parallel — a 1×1 conv, a 3×3 conv, a 5×5 conv, and a 3×3 max pooling — then concatenates all their outputs along the channel dimension. The network effectively gets to look at each spatial location through multiple "lenses" simultaneously, capturing features at different scales within a single layer.
The catch is obvious: running 1×1, 3×3, and 5×5 convolutions in parallel is expensive. A 5×5 conv on 256 input channels produces an enormous number of multiplications. The solution was a trick that became one of the most important patterns in all of deep learning: the 1×1 convolution as a bottleneck.
A 1×1 convolution looks strange at first — a kernel that's one pixel wide can't capture any spatial pattern. What it does is mix channels. Think of it as a linear transformation applied independently at every spatial position, producing a new set of channels that's a compressed combination of the old ones. By placing a 1×1 conv before the expensive 3×3 and 5×5 operations — reducing channels from, say, 256 down to 64 — you cut the subsequent computation by 4×. The 3×3 conv operates on 64 channels instead of 256, does its spatial work, and the result gets concatenated with the other branches.
This bottleneck pattern shows up everywhere in modern architectures. Every time you see a 1×1 convolution used to reduce or expand channels, you're seeing GoogLeNet's legacy.
GoogLeNet (Inception v1) had only 6.8 million parameters — 20× fewer than VGG — and won the 2014 ImageNet challenge. It also used auxiliary classifiers at intermediate layers to help gradients flow during training, though later work showed these weren't strictly necessary. The Inception family continued evolving through v2, v3, and v4, each refining the module design.
For our photo organizer, GoogLeNet would give us much better accuracy per computation than VGG. But the multi-branch design was complex to implement and hard to modify. What the field really needed was a simpler way to go deep — much deeper.
ResNet (2015) — The Most Important Architecture Idea in Deep Learning
If you remember one architecture from this entire chapter, make it ResNet. The idea introduced here — the skip connection — escaped CNNs entirely and now lives in Transformers, diffusion models, U-Nets, and essentially every deep architecture designed after 2015. It's arguably the single most impactful architectural idea in the history of deep learning.
The Degradation Problem
By 2015, everyone agreed that deeper networks should be more powerful. More layers means more capacity to learn complex functions. But when researchers tried to push past about 20 layers, something bizarre happened: the network got worse.
And not in the way you'd expect. When a model overfits, training accuracy is high but test accuracy is low — the model memorized the training data but can't generalize. This was different. Training accuracy itself decreased with more layers. A 56-layer network had higher training error than a 20-layer network. Adding more capacity was actively hurting the model's ability to learn anything at all.
Sit with that for a moment, because it's genuinely strange. A 56-layer network contains a 20-layer network as a subset. In the worst case, the extra 36 layers could learn to do nothing — pass the input straight through unchanged, an identity mapping. The 56-layer network should be at least as good as the 20-layer one. So why was it performing worse?
Because learning the identity function turns out to be hard. When you initialize a stack of convolutional layers with random weights and train them with SGD, the optimization landscape makes it easier to learn some transformation than to learn no transformation. Pushing a stack of nonlinear layers toward the precise weight configuration that passes the input through unchanged requires the optimizer to navigate to a very specific point in parameter space. It's like asking someone to balance a pencil on its tip — technically possible, but the natural dynamics push you away from that equilibrium. The deeper network couldn't figure out that "do nothing" was the right answer for its extra layers.
The Residual Reframing
Kaiming He's insight was a change of perspective. Instead of asking a block of layers to learn the desired output H(x) directly, ask it to learn just the difference between the desired output and the input: F(x) = H(x) − x. Then the block's output becomes:
output = F(x) + x # the skip connection
That + x is the skip connection — the input bypasses the layers entirely and gets added to whatever the layers produce. If the optimal transformation for those layers is close to the identity (which it often is in deep networks), then F(x) needs to be close to zero. And pushing weights toward zero is the easy part — weight decay does it by default, and random initialization already starts near zero.
Think back to our pencil-balancing analogy. Without the skip connection, each block is trying to balance the pencil — learn the exact right transformation from scratch. With the skip connection, the pencil is already standing. The block only needs to learn small nudges. If the nudge should be zero, the weights stay near zero. No balancing act required.
This reframing changes the optimization problem fundamentally. Each block learns modifications to an already-reasonable signal rather than constructing transformations from nothing. If a block has nothing useful to contribute, its weights go to zero, and the input passes through unchanged. No harm done. ResNet-152 — 152 layers deep — trained smoothly and achieved state-of-the-art accuracy. The depth barrier was broken.
A common oversimplification is that skip connections "help gradients flow." They do — the gradient has a direct path backward through the addition operation. But that's a side benefit, not the core insight. The conceptual shift is what matters: learning residuals (how to modify the input) is fundamentally easier than learning full mappings (how to transform the input from scratch). This is why ResNet works — it changes what the network is asked to learn, not just how signals propagate.
Two Flavors of Residual Block
ResNet uses two block designs depending on network depth.
The Basic Block (ResNet-18 and ResNet-34) stacks two 3×3 convolutions, each followed by batch normalization. The input skips past both convolutions and gets added to the output before the final ReLU.
The Bottleneck Block (ResNet-50, 101, 152) uses the 1×1 → 3×3 → 1×1 pattern borrowed from GoogLeNet's bottleneck idea. The first 1×1 conv reduces the channel dimension (say from 256 to 64), the 3×3 conv does the spatial work at the reduced dimension, and the final 1×1 conv expands back to 256. The computational savings are dramatic: instead of a 3×3 conv on 256 channels (256 × 256 × 3 × 3 ≈ 590K parameters), you get three smaller convolutions totaling about 70K parameters. That's roughly an 8× reduction.
import torch.nn as nn
class BasicBlock(nn.Module):
"""Two 3×3 convs with a skip connection (ResNet-18/34)."""
expansion = 1
def __init__(self, in_ch, out_ch, stride=1, downsample=None):
super().__init__()
self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_ch)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_ch)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity # the skip connection
return self.relu(out)
class BottleneckBlock(nn.Module):
"""1×1 → 3×3 → 1×1 with a skip connection (ResNet-50/101/152)."""
expansion = 4
def __init__(self, in_ch, mid_ch, stride=1, downsample=None):
super().__init__()
out_ch = mid_ch * self.expansion
self.conv1 = nn.Conv2d(in_ch, mid_ch, 1, bias=False) # reduce
self.bn1 = nn.BatchNorm2d(mid_ch)
self.conv2 = nn.Conv2d(mid_ch, mid_ch, 3, stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(mid_ch)
self.conv3 = nn.Conv2d(mid_ch, out_ch, 1, bias=False) # expand
self.bn3 = nn.BatchNorm2d(out_ch)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
return self.relu(out)
When the skip connection adds F(x) + x, both tensors need the same shape. But when you change spatial resolution (stride > 1) or channel count between stages, the identity shortcut needs adjustment. ResNet handles this with a projection shortcut — a 1×1 convolution with the appropriate stride and output channels. That's the downsample parameter in the code above.
Pre-Activation: Cleaning Up the Shortcut
The original ResNet (v1) places the batch normalization and ReLU after the convolution, and the skip connection gets added before the final ReLU. This means an activation function sits on the shortcut path — the identity isn't truly identity. He et al. later proposed Pre-Activation ResNet (v2), which reorders to: BN → ReLU → Conv. Now the skip connection is a pure, unobstructed identity mapping. The result: even better gradient flow and networks that trained smoothly at 1,001 layers.
I'm still developing my intuition for exactly why the ordering matters so much. The theoretical argument is that a "clean" identity path lets gradients propagate without any nonlinear distortion. But the practical improvement from v1 to v2 is modest on standard-depth networks and mostly matters when you push to extreme depths.
ResNet's skip connection escaped the CNN world entirely. U-Net uses skip connections between encoder and decoder for segmentation. Transformers use residual connections around every self-attention and feed-forward sub-layer. Diffusion models build their noise prediction networks on residual blocks. When someone says "residual connection" in any architecture — vision, language, audio, anything — they're using He et al.'s 2015 idea.
Rest Stop
Congratulations on making it this far. You can stop here if you want.
You now have a solid mental model of the first era of CNN architecture evolution: LeNet established the template, AlexNet proved it scales, VGG proved small kernels beat large ones, GoogLeNet proved you can go wide with bottlenecks, and ResNet broke the depth barrier with skip connections. That arc — from 60K parameters in 1998 to 60 million in 2015 — covers the core ideas that underpin almost everything in modern computer vision.
If someone asks you in an interview to trace the evolution of CNN architectures, you can tell a coherent story. If someone asks why skip connections exist, you can explain the degradation problem and why learning residuals is easier than learning full mappings. That's the 80% version.
But there's more. After ResNet, the field didn't stop — it branched. Some researchers pushed feature reuse further (DenseNet). Others figured out how to scale networks intelligently (EfficientNet). And eventually, the Transformer people walked in and said, "What if we don't use convolutions at all?" — which prompted the CNN community to fire back with ConvNeXt, proving the architecture wars were a false dichotomy all along.
If the discomfort of not knowing what happened after ResNet is nagging at you, read on.
DenseNet (2017) — What If Every Layer Talks to Every Other Layer?
ResNet showed that connecting a layer's input to its output via a skip connection was powerful. DenseNet asked: what if we take that idea to its logical extreme?
In a DenseNet block, every layer receives the feature maps from all preceding layers, and passes its own features to all subsequent layers. If a block has 6 layers, layer 6 receives the concatenated features from layers 1 through 5, plus the original block input. That's not addition (like ResNet) — it's concatenation along the channel dimension. Every feature ever computed within the block is preserved and accessible.
The growth rate k controls how many new feature maps each layer produces. With k=32, each layer adds 32 channels to the running total. By the end of a 6-layer block starting from k₀ initial channels, the last layer sees k₀ + 32×5 = k₀ + 160 input channels. The growth rate is deliberately kept small — the point is that you don't need each layer to produce hundreds of channels when it has direct access to everything computed before it.
Think about the difference from ResNet using our stair-climbing analogy. In ResNet, each step passes a message to the next step: "here's where I am, plus my adjustment." In DenseNet, every step passes its message to every future step. If step 3 learned something useful, step 6 doesn't have to rediscover it — it's right there in the concatenated input. This is extreme feature reuse.
The advantages are real: strong gradient flow (every layer has a direct path to the loss), surprising parameter efficiency (DenseNet-121 matches ResNet accuracy with fewer parameters), and rich feature diversity. The disadvantage is equally real: memory. All those concatenated feature maps live in GPU memory simultaneously. DenseNet is memory-hungry during training in a way that ResNet isn't. Various checkpoint-based memory-efficient implementations exist, but the memory overhead remains a practical limitation that kept DenseNet from becoming the default backbone.
For our photo organizer, DenseNet would be worth considering in a low-parameter regime — when you need decent accuracy but have a tight parameter budget. In practice, though, ResNet's simplicity and DenseNet's memory cost meant ResNet-50 became the default, not DenseNet-121.
MobileNet (2017) — Deep Learning on Your Phone
Every architecture we've discussed so far was designed for beefy GPUs. MobileNet confronted a different frustration entirely: what if the model needs to run on a phone, in real time, with a battery to worry about?
The key innovation is the depthwise separable convolution, and it's worth understanding the math because the savings are dramatic. A standard 3×3 convolution on 256 input channels producing 256 output channels requires 256 × 256 × 3 × 3 ≈ 590,000 multiplications per spatial position. MobileNet splits this into two steps. First, a depthwise convolution: one 3×3 filter per input channel, applied independently (256 × 3 × 3 = 2,304 operations). Then a pointwise convolution: a 1×1 conv that mixes the channels (256 × 256 = 65,536 operations). Total: roughly 68,000 versus 590,000. An 8–9× reduction in computation for comparable representational power.
MobileNet-v2 refined this further with inverted residual blocks: expand channels with a 1×1 conv (from narrow to wide), apply a depthwise 3×3 conv at the wide dimension, then project back down with another 1×1 conv. The "inverted" part is that the bottleneck is the input/output, and the expansion happens in the middle — the opposite of ResNet's bottleneck block. These designs power the real-time vision in every smartphone camera today.
EfficientNet (2019) — Scaling All Dimensions Together
Before EfficientNet, making a network better meant picking one dimension to scale up: add more layers (depth), add more channels per layer (width), or feed in higher-resolution images. Practitioners would scale whichever dimension felt right, and the results were often disappointing — you'd double the computation and get marginal accuracy gains.
Mingxing Tan and Quoc Le's insight was that these three dimensions aren't independent. If you make a network deeper, it can learn more complex feature hierarchies — but it needs wider layers to store those features, and higher-resolution inputs to provide the fine-grained spatial detail that those features describe. Scale depth alone and the extra layers have nothing new to work with. Scale resolution alone and the network doesn't have the capacity to exploit the extra detail. You need all three to grow together.
Compound Scaling
They formalized this as compound scaling, using a single coefficient φ to scale all three dimensions simultaneously:
# depth: d = α^φ (more layers)
# width: w = β^φ (more channels)
# resolution: r = γ^φ (bigger images)
# constraint: α · β² · γ² ≈ 2 (roughly doubles FLOPS per φ step)
# Values found via grid search on the baseline:
α = 1.2 # depth multiplier
β = 1.1 # width multiplier
γ = 1.15 # resolution multiplier
# EfficientNet-B0 (baseline): 224×224 input, 5.3M params, 77.1% top-1
# EfficientNet-B7 (φ=6): 600×600 input, 66M params, 84.3% top-1
The constraint α · β² · γ² ≈ 2 means each step of φ roughly doubles the compute budget. Width and resolution are squared because they each affect both spatial dimensions of the computation.
Starting From a Strong Baseline
Compound scaling alone isn't the whole story. EfficientNet-B0 — the baseline network that gets scaled — was itself found through neural architecture search (NAS), using MobileNet-v2's inverted residual blocks plus squeeze-and-excitation modules for channel attention. The baseline was already better than most hand-designed architectures at comparable compute. Compound scaling amplified that advantage.
EfficientNet-B7 achieved 84.3% top-1 accuracy on ImageNet with 8.4× fewer parameters than the previous state-of-the-art. The message: how you scale matters as much as what you scale.
For our photo organizer, EfficientNet is the practical sweet spot. EfficientNet-B0 to B3 hit an excellent accuracy-per-compute ratio for transfer learning. You pick the variant that fits your latency or memory budget. It remains one of the best defaults when you need a CNN backbone and can't afford to experiment endlessly.
Vision Transformer — ViT (2020) — What If We Don't Use Convolutions at All?
And then the Transformer people showed up.
The Vision Transformer paper (Dosovitskiy et al., 2020) asked a question that would have sounded absurd two years earlier: what if we take a standard Transformer — the same architecture powering BERT and GPT for language — and apply it directly to images? No convolutions. No pooling. No inductive biases about spatial locality. The whole machine.
The Core Idea
Take a 224×224 image. Split it into a grid of 16×16 patches. You get 14 × 14 = 196 patches. Flatten each patch into a vector (16 × 16 × 3 = 768 dimensions for RGB). Project it linearly to get a patch embedding. Add a learnable position embedding so the model knows where each patch came from. Prepend a learnable [CLS] token (borrowed from BERT). Feed the whole sequence — 197 tokens, each 768-dimensional — into a standard Transformer encoder. Read the classification from the [CLS] token at the end.
That's it. An image becomes a sequence of patch tokens, and a Transformer processes the sequence. The same architecture that translates English to French now classifies photographs.
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
super().__init__()
self.num_patches = (img_size // patch_size) ** 2
# A single Conv2d with kernel=stride=patch_size splits + projects in one op
self.proj = nn.Conv2d(in_ch, embed_dim,
kernel_size=patch_size, stride=patch_size)
def forward(self, x):
# (B, 3, 224, 224) → (B, 768, 14, 14) → (B, 768, 196) → (B, 196, 768)
return self.proj(x).flatten(2).transpose(1, 2)
class ViT(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_ch=3,
embed_dim=768, depth=12, num_heads=12, num_classes=1000):
super().__init__()
self.patch_embed = PatchEmbedding(img_size, patch_size, in_ch, embed_dim)
n = (img_size // patch_size) ** 2
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, n + 1, embed_dim))
layer = nn.TransformerEncoderLayer(
d_model=embed_dim, nhead=num_heads,
dim_feedforward=embed_dim * 4, activation='gelu',
batch_first=True, norm_first=True)
self.encoder = nn.TransformerEncoder(layer, num_layers=depth)
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
B = x.shape[0]
tokens = self.patch_embed(x) # (B, 196, 768)
cls = self.cls_token.expand(B, -1, -1) # (B, 1, 768)
tokens = torch.cat([cls, tokens], dim=1) # (B, 197, 768)
tokens = tokens + self.pos_embed # add position info
tokens = self.norm(self.encoder(tokens))
return self.head(tokens[:, 0]) # classify from [CLS]
Why Self-Attention Works for Vision
In a CNN, the receptive field starts small — a neuron in layer 1 sees a 3×3 region — and grows slowly as you stack layers. You need many layers before any neuron can "see" the full image. Self-attention flips this entirely. Every patch attends to every other patch from layer 1. A patch in the top-left corner immediately relates to a patch in the bottom-right corner. The receptive field is global from the very first layer.
This is incredibly powerful when long-range relationships matter — recognizing that a steering wheel and tires belong to the same car, even though they're spatially distant in the image. CNNs need many layers to propagate that information; Transformers get it for free.
The cost is quadratic: self-attention is O(n²) in sequence length. For 196 patches, that's manageable. For dense prediction tasks with thousands of tokens, it gets expensive — a limitation that spawned an entire sub-field of efficient attention mechanisms.
The Data Hunger Problem
Here's what makes ViT genuinely interesting from a theoretical perspective. Trained on ImageNet alone (1.2 million images), ViT performed worse than ResNet. CNNs have built-in assumptions about how images work — local connectivity means neighboring pixels are processed together, and weight sharing across spatial positions gives translation equivariance. These inductive biases are like a head start: the network doesn't have to learn that nearby pixels are related, because the architecture already enforces it.
Transformers have almost none of these priors. The architecture treats patches as an unordered set (position embeddings provide some spatial information, but it's learned, not structural). So the network has to learn from scratch what CNNs get for free. This requires vastly more data.
ViT only surpassed CNNs when pre-trained on massive datasets: JFT-300M (300 million images, Google's internal dataset) or ImageNet-21k (14 million images). With enough data, the lack of inductive bias becomes an advantage — the model isn't constrained by assumptions about locality and can learn whatever patterns actually exist.
Facebook's DeiT (Data-efficient Image Transformers, 2021) later showed that the problem wasn't the architecture — it was the training recipe. With heavy augmentation (RandAugment, random erasing), regularization (stochastic depth), and a distillation token that learns from a CNN teacher, ViT could match CNNs using only ImageNet. No JFT-300M required. Transformers need different regularization than CNNs, and once you figure out the right bag of tricks, they train efficiently on standard datasets.
Why ViT Changed the Entire Field
The deepest significance of ViT isn't "Transformers work for vision." It's that the same architecture now works for vision and language. A patch embedding goes in one end, a token embedding goes in the other, and the same Transformer processes both. This unification is what made multimodal AI possible. CLIP, DALL-E, GPT-4V, LLaVA — all of these systems exist because vision and language now share the same architectural backbone. That convergence may be the most consequential development in deep learning since ResNet.
For our photo organizer, ViT (especially via DeiT or a pre-trained variant) would give us the most powerful visual features available — and if we later wanted to add text-based search ("find photos of dogs on beaches"), the same Transformer backbone could handle both modalities.
ViT-B/16 means ViT-Base (12 layers, 768-dim, ~86M params) with 16×16 patches. ViT-L/14 means ViT-Large with 14×14 patches. Smaller patches = more tokens = more compute but finer-grained features. Sizes range from Tiny (Ti) through Small (S), Base (B), Large (L), to Huge (H).
Swin Transformer (2021) — Bringing Hierarchy Back
ViT processes all patches at the same resolution through every layer, which works for classification but creates a mismatch for dense prediction tasks — object detection and segmentation need multi-scale feature maps, not a single-resolution sequence.
Swin Transformer (Liu et al., 2021) introduced shifted window attention. Instead of global self-attention across all patches, each layer computes attention within local windows (typically 7×7 patches). Windows shift by half their size between consecutive layers, which lets information flow across window boundaries without ever computing global attention. This gives you linear complexity in image size instead of quadratic.
The architecture also reintroduces the hierarchical structure that CNNs naturally have. Through patch merging layers (analogous to pooling), Swin creates feature maps at 4×, 8×, 16×, and 32× downsampling — exactly the multi-scale feature pyramid that detection heads like FPN expect. It became the default Transformer backbone for object detection (COCO) and segmentation (ADE20K), filling the role that ResNet had held for years.
ConvNeXt (2022) — The CNN Strikes Back
Here's my favorite part of this whole story, because it challenges assumptions the field had made for two years.
By 2022, the narrative was clear: Transformers are better than CNNs for vision. Attention mechanisms capture global relationships that convolutions miss. The future belongs to ViT and Swin. But ConvNeXt (Liu et al., 2022) asked an uncomfortable question: what if the Transformer's success wasn't really about attention?
The authors started with a plain ResNet-50 and systematically applied every design modernization that Transformers had introduced — but kept the architecture purely convolutional. They adjusted stage ratios to match Swin's distribution (3:3:9:3 instead of 3:4:6:3). They replaced the ResNet stem with a 4×4 stride-4 "patchify" layer. They used depthwise separable convolutions. They inverted the bottleneck design. They increased the kernel size to 7×7. They swapped ReLU for GELU. They replaced BatchNorm with LayerNorm. They used fewer normalization and activation layers per block.
Each change, individually, was small. Together, they turned a ResNet into something that matched or beat Swin Transformer at every scale — tiny to huge — while remaining a pure CNN. ConvNeXt-T (28M parameters) hit 82.1% top-1 accuracy on ImageNet, compared to Swin-T's 81.3% at similar parameter count.
The field had attributed Transformer superiority to self-attention. ConvNeXt showed that most of the gains came from better training practices and design modernizations that apply equally well to convolutions. The "architecture wars" were, to a significant extent, a false dichotomy.
I'll be honest — I still occasionally get tripped up when someone asks whether to use a Transformer or CNN backbone for a new vision project. The answer, post-ConvNeXt, is genuinely "it depends on your deployment constraints more than any fundamental capability difference." The principles — residual connections, careful scaling, modern normalization, strong augmentation — transfer across architecture families.
The Full Timeline
| Architecture | Year | Key Innovation | Top-1 ImageNet | Params | Practical Niche |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | Conv-Pool-FC template, end-to-end learning | N/A (MNIST) | 60K | Historical blueprint |
| AlexNet | 2012 | ReLU, dropout, GPU training at scale | 63.3% | 60M | Historical — proved deep learning works |
| VGG-16 | 2014 | Stacked 3×3 convolutions, depth through simplicity | 71.5% | 138M | Feature extraction, perceptual loss, style transfer |
| GoogLeNet | 2014 | Inception modules, 1×1 bottleneck convolutions | 69.8% | 6.8M | Parameter-efficient classification |
| ResNet-50 | 2015 | Skip connections, residual learning | 76.1% | 25.6M | Default CNN backbone, universal building block |
| DenseNet-121 | 2017 | Dense connectivity, feature concatenation | 74.4% | 8M | Low-parameter regimes |
| MobileNet-v2 | 2018 | Depthwise separable convs, inverted residuals | 72.0% | 3.4M | Mobile and edge deployment |
| EfficientNet-B0 | 2019 | Compound scaling (depth + width + resolution) | 77.1% | 5.3M | Best accuracy-per-FLOP CNN |
| ViT-B/16 | 2020 | Pure Transformer on image patches | 77.9%* | 86M | Large-data pre-training, multimodal backbone |
| DeiT-B | 2021 | Data-efficient ViT training recipe | 81.8% | 86M | ViT without massive pre-training data |
| Swin-B | 2021 | Shifted window attention, hierarchical features | 83.5% | 88M | Detection, segmentation backbone |
| ConvNeXt-B | 2022 | Modernized CNN matching Transformer accuracy | 83.8% | 89M | CNN simplicity + modern accuracy |
*ViT-B/16 accuracy shown is ImageNet-only training. With JFT-300M pre-training, ViT-L/16 reaches 87.8%.
Wrapping Up
If you're still with me, thank you. I hope it was worth the journey.
We started with a 60,000-parameter network that could read handwritten digits on postal checks, watched the field spend fourteen years in the wilderness, then witnessed AlexNet blast the door open with GPUs and ReLU. VGG showed that small kernels beat large ones. GoogLeNet showed that going wide with bottlenecks is as valid as going deep. ResNet introduced the skip connection — a single idea so powerful it now lives in every deep architecture regardless of domain. DenseNet explored maximal feature reuse. MobileNet brought deep learning to phones. EfficientNet showed how to scale intelligently. ViT proved you don't need convolutions at all, unifying vision and language under one architecture. Swin brought hierarchy back to Transformers. And ConvNeXt proved the CNN was never dead — it was wearing 2015 clothes in a 2022 world.
My hope is that the next time you encounter an unfamiliar architecture — in a paper, a codebase, a system design interview — instead of treating it as a black box of layer counts and acronyms, you'll trace the lineage. You'll see the skip connection and think "ResNet, because identity mappings are hard." You'll see the 1×1 bottleneck and think "GoogLeNet, because you compress before you compute." You'll see patch embeddings and think "ViT, because an image is a sequence of patches." Every modern architecture is a conversation with the ones that came before it, and now you can follow that conversation.
- Explain the degradation problem and why skip connections solve it (ResNet's core contribution)
- Trace the full evolution from LeNet to ConvNeXt, explaining the frustration that motivated each jump
- Explain why two 3×3 convs beat one 5×5: same receptive field, fewer parameters, more nonlinearity (VGG's insight)
- Describe 1×1 convolutions as channel mixing / bottleneck operations (GoogLeNet's contribution)
- Explain compound scaling: why depth, width, and resolution must grow together (EfficientNet)
- Articulate why ViT is data-hungry (weak inductive bias) and how DeiT fixed it (training recipe, not architecture)
- Explain why ViT's unification of vision and language architectures enabled multimodal AI
- Describe what ConvNeXt proved: most of the "Transformer advantage" came from training modernizations, not attention
- Given a compute budget and task, recommend an appropriate backbone and justify your choice