Modern Vision

Chapter 9: CNNs & Computer Vision ViT · Swin · CLIP · DINOv2 · SAM · Foundation Models
TL;DR

The Vision Transformer (ViT) proved that pure self-attention — no convolutions at all — can match or beat CNNs on images, provided you have enough data. Swin Transformer fixed ViT's scalability problem with windowed attention, making transformers practical for detection and segmentation. CLIP bridged vision and language by training image and text encoders in a shared embedding space, enabling zero-shot classification. DINOv2 showed you can learn powerful visual features from images alone through self-distillation. Hybrid architectures like ConvNeXt blurred the CNN-transformer line entirely. SAM made segmentation promptable and universal. The era of training a model per task is ending — modern vision is about foundation models you adapt, not architectures you build from scratch.

The Moment Everything Changed

I avoided looking at Vision Transformers for longer than I'd like to admit. Every time a new ViT variant hit the arXiv feed, I'd skim the abstract and think, "I know CNNs, I'll learn this later." Convolutions were working fine for me. The inductive biases felt right. Locality, translation equivariance — these made physical sense for images. Why would I throw them away and let a model attend to every pixel pair in a 224×224 image?

Then the results started piling up. ViT beating CNNs on ImageNet. Swin Transformer dominating every detection and segmentation benchmark. CLIP doing zero-shot classification that competed with fully supervised models. DINOv2 learning features from raw images that transferred to practically anything. At some point the discomfort of not understanding what was happening under the hood became greater than the comfort of my CNN knowledge. Here is that dive.

Before we start, a heads-up. We'll be talking about self-attention, patch embeddings, contrastive learning, and self-distillation. You don't need to know any of it beforehand. We'll build each idea from the ground up, one piece at a time. If you understand convolutions from earlier in this chapter, you already have the right frame of reference — modern vision is largely a story of what happens when you start removing convolutions and replacing them with attention.

This isn't a short journey, but I hope you'll be glad you came.

Vision Transformer: An Image Is Worth 16×16 Words

For decades, the assumption was: images need convolutions. The local connectivity, the weight sharing, the hierarchical receptive fields — these are how you process spatial data. Dosovitskiy and colleagues at Google Brain asked a heretical question in 2020: what if we didn't use convolutions at all? What if we took the transformer architecture that was crushing NLP benchmarks and applied it directly to images?

The problem is that transformers eat sequences of tokens, and images aren't sequences. An image is a 2D grid of pixels. A 224×224 RGB image has 150,528 values. Feeding each pixel as a token would mean an attention matrix of 50,176 × 50,176 — roughly 2.5 billion entries. That's not going to work.

Patch Embedding — Turning Images Into Sequences

The fix is elegant. Instead of treating each pixel as a token, chop the image into non-overlapping patches and treat each patch as a token. Think of it like cutting a photograph into a grid of small squares and then reading them left-to-right, top-to-bottom, like words in a sentence.

Take a 224×224 image and divide it into 16×16 patches. That gives us a 14×14 grid — 196 patches total. Each patch is 16 × 16 × 3 = 768 values (for RGB). Flatten that into a vector, run it through a linear projection to get a D-dimensional embedding, and now you have 196 tokens, each representing one patch of the image.

  ViT: Image → Sequence of Patch Tokens
  ────────────────────────────────────────

  224×224 image, 16×16 patches:

  ┌────┬────┬────┬────┬─ ─ ─┬────┐
  │ P₁ │ P₂ │ P₃ │ P₄ │     │P₁₄│   Each patch: 16×16×3 = 768 values
  ├────┼────┼────┼────┤     ├────┤
  │P₁₅│P₁₆│P₁₇│P₁₈│     │P₂₈│   → flatten → linear projection
  ├────┼────┼────┼────┤     ├────┤
  │    │    │    │    │     │    │   → 196 tokens of dimension D
  ┆    ┆    ┆    ┆    ┆     ┆    ┆
  ├────┼────┼────┼────┤     ├────┤
  │    │    │    │    │     │P₁₉₆│
  └────┴────┴────┴────┴─ ─ ─┴────┘

  Sequence: [CLS, P₁, P₂, P₃, ... P₁₉₆]  +  positional embeddings
                ↓
         Standard Transformer Encoder
         (multi-head self-attention + MLP)
                ↓
         [CLS] output → classification head

Two more pieces are essential. First, positional embeddings — learned vectors added to each patch token to tell the model where in the image that patch lives. Without them, the transformer would have no idea that patch 1 is top-left and patch 196 is bottom-right; it would treat the image as an unordered bag of patches. ViT uses learned 1D positional embeddings rather than the sinusoidal encodings from NLP transformers, and they work surprisingly well — the model figures out the 2D spatial structure from the data.

Second, a special [CLS] token is prepended to the sequence, borrowed directly from BERT. This token participates in self-attention with all 196 patches, aggregating information from the entire image. After the final transformer layer, the [CLS] token's output gets fed to a classification head. It's the model's way of saying "here's my summary of what this image contains."

Why It Works (When It Has Enough Data)

Here's the critical insight about ViT, and I'll be honest — it took me a while to internalize this. CNNs have strong inductive biases baked in. Locality: a pixel is most related to its neighbors. Translation equivariance: a cat in the top-left should be processed the same way as a cat in the bottom-right. These assumptions give CNNs a head start. They don't need to learn that nearby pixels are correlated — that knowledge is wired into the architecture.

ViT has almost none of these priors. It starts from a blank slate. Every patch can attend to every other patch from layer one. The model has to learn locality, spatial relationships, and feature hierarchies entirely from data. That means on small datasets, ViT gets destroyed by CNNs. The original paper is candid about this: trained on ImageNet-1k alone (1.28 million images), ViT underperforms ResNets.

But give it enough data — ImageNet-21k (14M images) or JFT-300M (300M images) — and something remarkable happens. The lack of inductive bias becomes a strength. The model isn't constrained by architectural assumptions about what relationships matter. It learns to attend to distant patches when that's useful, to focus locally when that's useful, and to build whatever feature hierarchy the data demands. With enough examples, ViT surpasses every CNN on the same benchmarks.

The Quadratic Problem

Self-attention computes an N×N attention matrix, where N is the number of tokens. For ViT with 196 patches, that's a 196×196 matrix — manageable. But if you wanted to process a 1024×1024 image with 16×16 patches, you'd have 4,096 tokens and a 4096×4096 attention matrix. Double the image resolution and you quadruple the attention cost. This quadratic scaling is ViT's fundamental limitation, and it's what motivated the Swin Transformer.

Swin Transformer: Making Vision Transformers Practical

ViT proved that transformers can work for vision. But it had two serious shortcomings for real-world use. First, the quadratic attention cost made it impractical for high-resolution images. Second, it produced a single-scale feature map — all tokens at the same resolution. CNNs naturally produce multi-scale features (think of a ResNet with its /4, /8, /16, /32 feature maps), which is essential for detection and segmentation where you need to find both tiny and large objects.

The Swin Transformer (Liu et al., 2021) solved both problems with two ideas that, in hindsight, feel inevitable: windowed attention and hierarchical feature maps.

Windowed Attention — From Global to Local

Instead of every patch attending to every other patch across the entire image, Swin restricts attention to local windows. Divide the feature map into non-overlapping windows of, say, 7×7 patches. Self-attention happens within each window independently. A window of 49 tokens produces a 49×49 attention matrix — the same size regardless of how large the full image is. That transforms the complexity from quadratic in image size to linear.

But now you've lost global context. Patches in one window can't see patches in the neighboring window. The fix is shifted windows. In alternating layers, the window grid is shifted by half a window width. This means patches that were in separate windows in one layer end up sharing a window in the next layer, creating cross-window information flow.

  Swin Transformer: Shifted Window Attention
  ────────────────────────────────────────────

  Layer L: Regular Windows         Layer L+1: Shifted Windows
  ┌─────────┬─────────┐           ┌────┬──────────┬────┐
  │         │         │           │    │          │    │
  │  Win 1  │  Win 2  │           │    │  Win A   │    │
  │         │         │           │    │          │    │
  ├─────────┼─────────┤    →      ├────┼──────────┼────┤
  │         │         │           │    │          │    │
  │  Win 3  │  Win 4  │           │    │  Win B   │    │
  │         │         │           │    │          │    │
  └─────────┴─────────┘           └────┴──────────┴────┘

  Patches at window boundaries in Layer L
  become window-interior in Layer L+1.
  Information flows across the entire image
  over successive layers.

Think of it like this: imagine you're in a building with rooms. In regular windowed attention, people can talk within their room but not to the next room. Shifted windows are like moving the walls between conversations — suddenly people who were separated end up in the same room, and they carry information from their previous conversations with them.

Hierarchical Feature Maps

The second innovation is that Swin builds feature maps at multiple scales, like a CNN does. It starts with 4×4 patches (small, high-resolution tokens) and progressively merges patches between stages. Adjacent 2×2 groups of patches get concatenated and projected down. This halves the spatial resolution and doubles the channel depth at each stage, producing a pyramid of features:

  Stage 1:  56 × 56 tokens, C channels       (1/4 resolution)
  Stage 2:  28 × 28 tokens, 2C channels      (1/8 resolution)
  Stage 3:  14 × 14 tokens, 4C channels      (1/16 resolution)
  Stage 4:   7 ×  7 tokens, 8C channels      (1/32 resolution)

This is the same multi-scale structure you'd get from a ResNet. That means Swin can drop directly into existing detection and segmentation frameworks — Feature Pyramid Networks, Mask R-CNN, UPerNet — anywhere CNNs were used as backbones. And it beat the CNNs on every one of those benchmarks.

Swin became the de facto vision backbone for dense prediction tasks. Swin v2 later scaled the design to 3 billion parameters and 1536×1536 resolution, proving the approach could grow.

DeiT: Making ViT Practical Without Massive Data

ViT's dependence on hundreds of millions of training images was a real barrier. Most teams don't have a JFT-300M lying around. Facebook Research's DeiT (Data-efficient Image Transformer, Touvron et al., 2021) showed that ViT could be trained effectively on ImageNet-1k alone — if you were clever about it.

The key idea: knowledge distillation. Train a strong CNN teacher (like a RegNet), then train the ViT student to match the teacher's outputs. DeiT adds a special distillation token alongside the [CLS] token. The [CLS] token trains against ground-truth labels in the standard way. The distillation token trains against the teacher's soft predictions — learning not what the answer is, but how the teacher reasons about uncertainty across classes.

The dual supervision — hard labels from the dataset plus soft knowledge from the CNN — gives the ViT student enough signal to learn effectively without needing 300 million images. Strong data augmentation (RandAugment, random erasing, Mixup, CutMix) filled in the rest. DeiT matched ViT-on-JFT performance using only ImageNet-1k, making vision transformers accessible to everyone.

There's something deeply satisfying about this. CNNs, with their strong inductive biases, learn efficiently from limited data. Transformers, with their minimal biases, can eventually surpass CNNs given enough data. DeiT essentially lets the CNN's inductive biases flow into the transformer through distillation — the best of both worlds.

The Hybrid Question: CNN or Transformer — Or Both?

Once ViT showed that pure attention could work for vision, and Swin showed it could work at scale, an obvious question emerged: do we actually need to choose between convolutions and attention? Maybe the best architecture uses both.

ConvNeXt — A CNN That Learned From Transformers

ConvNeXt (Liu et al., 2022) is one of my favorite papers in recent years, because it completely reframes the CNN-vs-transformer debate. The authors started with a standard ResNet-50 and systematically modernized it by adopting design principles from Swin Transformer — one change at a time, measuring the impact of each:

Larger kernels (7×7 instead of 3×3). Inverted bottleneck blocks (wide → narrow → wide, like transformers' MLP). LayerNorm instead of BatchNorm. GELU activation instead of ReLU. Fewer activation functions and normalization layers. A "patchify" stem instead of the traditional conv-pool-conv stem.

The result: a pure CNN — no attention whatsoever — that matches Swin Transformer's performance across classification, detection, and segmentation. ConvNeXt's message is profound: the gap between CNNs and transformers was never really about attention versus convolution. It was about training recipes, architectural details, and scaling strategies. Modernize a CNN with the lessons from the transformer era, and the performance gap vanishes.

CoAtNet and True Hybrids

CoAtNet (Dai et al., 2021) takes a more explicit hybrid approach: use convolution layers in the early stages where local features matter most, then switch to transformer layers in the later stages where global context becomes important. The intuition is that early layers need to extract edges, textures, and local patterns — convolutions are efficient at this. Later layers need to reason about relationships between distant parts of the image — attention excels here.

This staged approach gets the CNN's data efficiency in the early layers (where the model needs to learn local patterns regardless) and the transformer's flexibility in the later layers (where global reasoning adds the most value). CoAtNet achieved state-of-the-art results on ImageNet with less data than pure ViT needed.

The Real Takeaway

The CNN-versus-transformer debate has largely been resolved, and the answer is: it depends on your constraints. Need the best accuracy and have massive data? Pure transformer. Need data efficiency and fast inference? CNN or hybrid. Need a drop-in backbone for detection or segmentation? Swin or ConvNeXt. The architecture matters less than the training recipe, the data, and the scale.

Rest Stop

If you've made it this far, you've covered a lot of ground. You now have a solid mental model of how transformers invaded computer vision: ViT showed it was possible, Swin made it practical, DeiT made it data-efficient, and ConvNeXt showed that a modernized CNN could keep up. That's a real understanding of the architectural evolution, and it's genuinely useful.

If you want to stop here, you can. You understand the backbones. That alone puts you ahead of most people who use these models as black boxes.

But the backbone is only half the story. The other half is what you build on top of it — or more precisely, how you train it. The models that are reshaping production computer vision right now aren't distinguished by their architecture (they mostly use some variant of ViT). They're distinguished by their training paradigm: contrastive vision-language learning (CLIP), self-supervised distillation (DINOv2), and promptable task-agnostic pretraining (SAM). These are the foundation models, and they're the reason most teams no longer train vision models from scratch.

If that discomfort of not knowing what's underneath is nagging at you, read on.

CLIP: When Vision Learned to Read

For the entire history of supervised computer vision, the workflow was: collect images, label them (expensive), train a model on those labels, and hope it generalizes. If your task changed — from classifying animals to classifying vehicles — you needed new labels and a new training run. Every task was its own silo.

CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) broke that pattern entirely. Instead of training with class labels, CLIP trains with image-caption pairs — 400 million of them, scraped from the internet. An image of a golden retriever on a beach paired with the caption "a golden retriever playing on a sandy beach." No class taxonomy. No label ontology. Just images and their natural language descriptions.

The Training Loop

CLIP has two encoders. An image encoder (a ViT or ResNet) that maps images to vectors. A text encoder (a Transformer) that maps text to vectors of the same dimension. Training is contrastive: take a batch of N image-text pairs, compute the cosine similarity between every image and every text in the batch, and push matching pairs (the diagonal of the N×N matrix) together while pushing non-matching pairs apart.

  CLIP Training
  ─────────────

  Batch of N image-text pairs:

  Image Encoder        Text Encoder
  (ViT-L/14)           (Transformer)
       │                     │
       ▼                     ▼
  [img₁]  [img₂]  [img₃]   [txt₁]  [txt₂]  [txt₃]

  Cosine Similarity Matrix (N × N):

              txt₁    txt₂    txt₃
   img₁  [  0.92    0.11    0.05  ]   ← match
   img₂  [  0.08    0.89    0.13  ]   ← match
   img₃  [  0.04    0.09    0.94  ]   ← match
           ────────────────────────
   Goal: maximize diagonal, minimize off-diagonal

The loss is symmetric cross-entropy applied both row-wise (for each image, which text matches?) and column-wise (for each text, which image matches?). After training, the two encoders live in a shared embedding space: images and text that describe the same thing land near each other.

Zero-Shot Classification

This is where the paradigm shift happens. Suppose you want to classify images into categories you've never trained on — say, identifying bird species. With traditional supervised learning, you'd need a labeled dataset of birds and a training run. With CLIP, you write text descriptions of your classes: "a photo of a cardinal," "a photo of a blue jay," "a photo of a robin." Encode them with the text encoder. Encode your image with the image encoder. Compute cosine similarities. The highest similarity wins.

# Zero-shot classification with CLIP — no training data needed
import clip, torch
from PIL import Image

model, preprocess = clip.load("ViT-B/32", device="cuda")

image = preprocess(Image.open("mystery_bird.jpg")).unsqueeze(0).to("cuda")
texts = clip.tokenize([
    "a photo of a cardinal",
    "a photo of a blue jay",
    "a photo of a robin"
]).to("cuda")

with torch.no_grad():
    img_emb = model.encode_image(image)
    txt_emb = model.encode_text(texts)
    img_emb /= img_emb.norm(dim=-1, keepdim=True)
    txt_emb /= txt_emb.norm(dim=-1, keepdim=True)
    similarity = (img_emb @ txt_emb.T).softmax(dim=-1)

# similarity might be [0.03, 0.91, 0.06] → blue jay

CLIP's zero-shot accuracy on ImageNet competes with a fully supervised ResNet-50 trained on 1.28 million labeled ImageNet images. That's a model that has never seen an ImageNet label matching a model trained specifically on ImageNet — and they're in the same ballpark. I'll be honest: when I first read that result, I didn't believe it.

Why CLIP Became the Foundation of Everything

CLIP's influence extends far beyond classification. Its text encoder became the language understanding backbone of Stable Diffusion. Its image encoder (or the improved SigLIP variant from Google, which replaces the softmax loss with a more scalable sigmoid loss) serves as the vision backbone for multimodal LLMs like LLaVA and GPT-4V. Its contrastive framework inspired open-vocabulary object detection (detecting objects described by any text, not a fixed class set) and visual search (finding images by text queries in a vector database). If you understand CLIP, you understand the foundation on which most of modern vision-language AI is built.

Prompt Engineering for Vision

CLIP is sensitive to how you phrase class descriptions. "A photo of a dog" works dramatically better than "dog" alone, because the training data was image-caption pairs, not single-word labels. For better accuracy, average embeddings from multiple prompt templates: "a photo of a dog," "a picture of a dog," "a bright photo of a dog." This prompt ensemble technique is used in most serious CLIP deployments.

DINOv2: Vision Without Words

CLIP needs image-text pairs. 400 million of them. That's a lot of internet scraping and a lot of assumptions about caption quality. What if you could learn equally powerful visual features from images alone — no captions, no labels, nothing but raw pixels?

That's DINOv2 (Oquab et al., Meta, 2023). It learns visual representations through self-distillation — a teacher-student setup where the teacher is a slowly-updating copy of the student itself.

How Self-Distillation Works

Take an image and create multiple augmented views of it — different crops, scales, color distortions. Feed some views to the student network and other (often larger, less-distorted) views to the teacher network. The student's job: produce embeddings that match the teacher's embeddings for the same image, despite seeing different views. The teacher's weights are updated as an exponential moving average (EMA) of the student's weights — it changes slowly, providing a stable target.

  DINOv2: Self-Distillation
  ──────────────────────────

  Same image, different augmented views:

  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │ Global   │  │ Global   │  │  Local   │  │  Local   │
  │ crop 1   │  │ crop 2   │  │  crop 1  │  │  crop 2  │
  │ (large)  │  │ (large)  │  │ (small)  │  │ (small)  │
  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
       │              │              │              │
       ▼              ▼              ▼              ▼
  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
  │ Teacher │   │ Teacher │   │ Student │   │ Student │
  │ (EMA)   │   │ (EMA)   │   │         │   │         │
  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘
       │              │              │              │
       └──────┬───────┘              └──────┬───────┘
              │                             │
              ▼                             ▼
        Teacher outputs              Student outputs
              │                             │
              └──── consistency loss ────────┘

  Teacher weights = slow EMA of student weights
  No labels anywhere in this process

The model converges because the teacher's slow updates create a stable learning signal — the student is always chasing a smoothed version of its own improving representations. Over time, the features settle into rich, semantically meaningful embeddings. DINOv2 was trained on a curated dataset of 142 million images (LVD-142M), with no labels at all.

Why DINOv2 Matters

DINOv2's features are remarkably general. Freeze the backbone, add a linear layer on top, and you get strong performance on image classification, semantic segmentation, depth estimation, and instance retrieval — all without fine-tuning the backbone. It's the closest thing we have to a universal vision feature extractor.

Where DINOv2 particularly shines is on dense prediction tasks — segmentation, depth, surface normal estimation. CLIP, trained on image-level alignment with text, produces features tuned for understanding what's in an image. DINOv2, trained on view consistency across spatial crops, preserves fine-grained spatial information that tells you where things are. They're complementary. When you need features for a task and don't have labeled data, CLIP is your pick for anything text-related, and DINOv2 for anything spatial.

SAM: Segment Anything

SAM (Segment Anything Model, Kirillov et al., Meta, 2023) is what happens when you apply the foundation model philosophy to segmentation specifically. The goal: a single model that can segment any object in any image, prompted by a point click, a bounding box, or even a text description.

The architecture has three pieces. A heavyweight ViT image encoder processes the image once, producing a dense feature map. A lightweight prompt encoder converts user prompts (points, boxes, masks) into embeddings. A lightweight mask decoder combines image features with prompt embeddings to predict segmentation masks. The image encoder runs once per image; the prompt encoder and mask decoder are fast enough to run interactively.

  SAM Architecture
  ────────────────

  Input Image ──→ ViT Image Encoder ──→ Image Features (computed once)
                                              │
  User Prompt ──→ Prompt Encoder ──→ ─────────┤
  (point/box/mask)                             │
                                               ▼
                                        Mask Decoder
                                               │
                                               ▼
                                      Segmentation Mask(s)
                                      + confidence scores

  Image encoder: expensive, runs once
  Prompt + decoder: cheap, runs per interaction

SAM was trained on the SA-1B dataset — 11 million images with over 1.1 billion masks, the largest segmentation dataset ever created (also generated semi-automatically using SAM itself in a data engine loop). The result is zero-shot segmentation that works across domains: medical images, satellite photos, microscopy, everyday scenes — without any domain-specific training.

SAM represents a philosophical shift. Traditional segmentation models learn to segment specific categories (person, car, tree). SAM learns to segment objects as a concept — anything that looks like a distinct thing in an image. You don't specify what to segment by naming a class; you specify it by pointing at it. That's a fundamentally different interface.

The Foundation Model Paradigm

If you step back from the individual models, a pattern becomes clear. Modern vision has converged on a two-phase paradigm that would have seemed bizarre a decade ago:

Phase 1: Pre-train broadly. Train a massive model on a massive dataset, using a general-purpose objective. CLIP uses contrastive image-text matching. DINOv2 uses self-distillation. SAM uses promptable segmentation. The model learns rich, transferable representations.

Phase 2: Adapt cheaply. Take the pre-trained backbone and adapt it to your specific task. This might mean adding a linear head (linear probing), fine-tuning a few layers, or using the model zero-shot. The key: you never need to train a vision backbone from scratch again. The heavy lifting is done once, shared by the entire community.

  The Foundation Model Paradigm
  ──────────────────────────────

  Before (2012-2020):
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ Task A   │    │ Task B   │    │ Task C   │
  │ Dataset  │    │ Dataset  │    │ Dataset  │
  │ Model    │    │ Model    │    │ Model    │
  │ Training │    │ Training │    │ Training │
  └──────────┘    └──────────┘    └──────────┘
  Each task: train from scratch or fine-tune ImageNet model

  Now (2021+):
                ┌─────────────────────┐
                │  Foundation Model   │
                │  (CLIP / DINOv2 /   │
                │   SAM / SigLIP)     │
                │  Pre-trained once   │
                └──────┬──────────────┘
                       │
            ┌──────────┼──────────┐
            ▼          ▼          ▼
       ┌─────────┐ ┌─────────┐ ┌─────────┐
       │ Task A  │ │ Task B  │ │ Task C  │
       │ Linear  │ │ Fine-   │ │ Zero-   │
       │ probe   │ │ tune    │ │ shot    │
       └─────────┘ └─────────┘ └─────────┘
  Adapt the foundation — don't rebuild it

I'm still developing my intuition for why this works as well as it does. The going theory is that large-scale pre-training on diverse data forces the model to learn general visual concepts — edges, textures, shapes, objects, scenes, spatial relationships — that are universally useful, rather than task-specific shortcuts. But the honest answer is that no one fully understands why representations learned through self-distillation or contrastive learning transfer so effectively to such wildly different downstream tasks.

When to Reach for What

The practical question for an engineer isn't "which architecture is theoretically best?" — it's "which model do I grab for this specific problem?" Here's how the landscape maps to real decisions:

Scenario Reach For Why
Classify images, no labeled data CLIP zero-shot Describe classes in text, compare embeddings
Classify images, some labeled data DINOv2 or CLIP backbone + linear probe Frozen foundation features + thin task head
Object detection or segmentation backbone Swin Transformer or ConvNeXt Multi-scale features, plug into FPN/Mask R-CNN
Interactive or zero-shot segmentation SAM Point/box prompts, works across domains
Visual search by text query CLIP/SigLIP embeddings + vector DB Images and text in the same embedding space
Dense prediction (depth, normals) DINOv2 backbone + task head Best spatial features without labels
Fast inference on edge/mobile EfficientViT or MobileNetV3 Designed for low latency and small memory
Maximum accuracy, large dataset ViT-Large or ViT-Giant, fine-tuned Pure transformers scale best with data
Limited data, need quick results ConvNeXt or DeiT CNN inductive bias or distillation helps

The Interview Lens

Modern vision comes up constantly in senior ML interviews. The questions that separate candidates aren't about memorizing architectures — they're about understanding trade-offs. Here are the threads interviewers pull on:

ViT vs. CNN inductive bias. Why does ViT underperform CNNs on small datasets? Because CNNs have locality and translation equivariance baked in — they don't need to learn that nearby pixels are related. ViT has to learn this from scratch, which costs data. But when data is abundant, those same inductive biases can become constraints that limit what the model can learn.

Attention complexity. ViT's self-attention is O(N²) in the number of patches. Swin's windowed attention is O(N) with respect to image size (O(M²) per window, where M is fixed). This is why Swin can handle high-resolution images where vanilla ViT cannot. Interviewers want to see that you understand this isn't an abstract concern — it directly determines what resolution you can process in production.

Supervised vs. self-supervised vs. contrastive. Traditional ImageNet training is supervised — you need 1.28 million labeled images. CLIP is contrastive — you need image-text pairs, but no class labels. DINOv2 is self-supervised — you need images and nothing else. Each training paradigm produces features with different strengths. Supervised features are optimized for the specific task you labeled. CLIP features understand language-aligned semantics. DINOv2 features preserve spatial structure. Knowing when each is appropriate shows depth.

The foundation model trade-off. Why not fine-tune the whole backbone for every task? Because catastrophic forgetting destroys general features. Linear probing preserves them but limits adaptation. The practical answer is usually somewhere in between: fine-tune the last few layers, or use adapters (small trainable modules inserted into frozen layers), or use LoRA. Understanding this spectrum from linear probing to full fine-tuning — and when each makes sense — is what interviewers are really after.

What You Should Now Be Able To Do

Checklist
  • Explain how ViT converts images to patch token sequences and processes them with a standard transformer encoder
  • Articulate why ViT underperforms CNNs on small datasets (minimal inductive bias) and outperforms them on massive datasets
  • Describe Swin Transformer's windowed attention and shifted windows — and why they make attention complexity linear in image size
  • Explain how DeiT uses knowledge distillation from a CNN teacher to make ViT data-efficient
  • Contrast ConvNeXt (modernized CNN matching transformers) with CoAtNet (explicit CNN-transformer hybrid)
  • Walk through CLIP's contrastive training: image encoder + text encoder + N×N similarity matrix + symmetric loss
  • Perform zero-shot classification with CLIP and explain why prompt engineering matters
  • Describe DINOv2's self-distillation: teacher-student, EMA updates, multiple augmented views, no labels
  • Explain SAM's architecture: heavyweight image encoder + lightweight prompt encoder + mask decoder
  • Distinguish when to use CLIP (text-aligned tasks), DINOv2 (spatial/dense tasks), Swin (detection backbone), and SAM (interactive segmentation)
  • Articulate the foundation model paradigm: pre-train broadly, adapt cheaply