Detection & Segmentation

Chapter 9: CNNs & Computer Vision Object Detection · Semantic Segmentation · Instance Segmentation

I avoided the detection and segmentation literature for longer than I should have. Every time a colleague mentioned "anchor boxes" or "RoI Align," I'd nod along and quietly hope no one asked follow-up questions. I could use a detection API, sure. I could call model.predict() and get back bounding boxes. But I had no idea what was actually happening underneath — why some models were fast and others accurate, why "NMS" existed, or what made DETR feel like magic. Finally the discomfort of not knowing grew too great. Here is that dive.

Object detection — finding what objects are in an image and where they are — has been one of the central problems in computer vision since the early 2010s. Segmentation takes it further: instead of drawing rough rectangles, it classifies every single pixel. Together, they're the backbone of autonomous driving, medical imaging, robotics, and any system that needs to actually understand a visual scene, not just label it.

Before we start, a heads-up. We're going to cover a lot of ground — from the earliest region-based detectors through transformer architectures and foundation models. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

Why detection is harder than classification
The shared vocabulary: bounding boxes, IoU, anchors, NMS
The R-CNN family: from 47-second inference to end-to-end training
YOLO and one-stage detection: speed as a design philosophy
Focal loss: the idea that closed the accuracy gap
DETR: when transformers replaced hand-crafted pipelines
The three flavors of segmentation
U-Net: the architecture that conquered medical imaging
DeepLab: seeing context without losing detail
Mask R-CNN: detection meets pixel-perfect masks
SAM: segment anything, zero-shot
Loss functions and evaluation metrics that actually matter

Object Detection

Think about the difference between these two questions. "Is there a cat in this photo?" That's classification — one image, one label. Now consider: "Where are all the cats, dogs, and people in this photo, and how big is each one?" That's detection. And the hard part isn't the "what" — it's the "how many."

A classifier always produces exactly one output. A detector might produce zero predictions, or fifty. This variable-length output is what makes detection architecturally fascinating. You can't design a fixed-size output layer when you don't know how many objects you'll find. Every major detection architecture is, at its core, an answer to this question: how do we handle a variable, unknown number of outputs?

Let's make this concrete with a running example. Imagine we're building a security camera system for a small warehouse. The camera sees a loading dock — forklifts move in and out, workers walk around, pallets get stacked. Our job: detect and locate every person, forklift, and pallet in each frame. Some frames have two objects. Some have twenty. The system needs to handle both.

The Shared Vocabulary

Every detection model, from the oldest R-CNN to the newest DETR variant, shares four fundamental concepts. Think of these as the tools every builder in this field carries around.

Bounding Boxes

A bounding box is an axis-aligned rectangle that says "the object is somewhere inside here." It's a rough approximation — a person standing at an angle gets a rectangle that includes a lot of empty background — but it's surprisingly useful and cheap to predict.

You'll encounter two main formats, and getting them confused is one of the most common detection bugs:

# Center format: where is the center, how big is it?
(cx, cy, w, h)    # YOLO uses this

# Corner format: where are the top-left and bottom-right corners?
(x1, y1, x2, y2)  # Pascal VOC uses this

# COCO uses a hybrid: top-left corner + size
(x, y, w, h)      # x,y is top-left, NOT center

Converting between formats is trivial arithmetic, but mixing them up silently produces boxes that are shifted or scaled wrong. I've seen production bugs from exactly this. Always verify which format your data and model expect.

IoU — Intersection over Union

IoU measures how well a predicted box matches a ground-truth box. Compute the area where both boxes overlap, divide by the total area they cover together. That's it.

IoU = Area of Overlap / Area of Union

Ranges from 0 (no overlap) to 1 (perfect match)

In our warehouse example, suppose the model predicts a box around a forklift, but the box is a bit too big on the right side. If the predicted box and the real forklift box share 70% of their combined area, the IoU is 0.7. The standard threshold for "that's a correct detection" is IoU > 0.5, but COCO's more rigorous evaluation averages across thresholds from 0.5 to 0.95 in steps of 0.05. IoU is beautifully scale-invariant — a tiny box on a tiny object gets the same IoU as a huge box on a huge object, as long as the relative overlap matches.

One subtlety worth knowing: when two boxes don't overlap at all but are close, IoU is zero — it can't tell "close miss" from "completely wrong." That limitation led to improved variants like GIoU (Generalized IoU), DIoU (Distance IoU), and CIoU (Complete IoU) that also consider the distance and aspect ratio between boxes. These are used as regression losses during training, making box predictions converge faster.

Anchor Boxes

Here's the problem our warehouse detector faces: boxes can be any size and any aspect ratio. A person standing upright is tall and narrow. A pallet lying flat is wide and short. A forklift is roughly square. Predicting raw coordinates from scratch turns out to be surprisingly hard for neural networks — the search space is enormous.

Anchor boxes collapse that search space. They're predefined box templates — a set of shapes and sizes — placed at every spatial location in the feature map. Instead of predicting "there's a box at pixel (142, 87) with size 200×150 from nothing," the network predicts "the box at this grid location is 1.2× wider and 0.8× taller than anchor template #4, and shifted 5 pixels to the right."

Predicting small offsets from a reasonable starting point is drastically easier than predicting absolute coordinates. A typical anchor configuration uses 3 aspect ratios (1:1, 1:2, 2:1) at 3 scales, giving 9 anchors per spatial location. In practice, teams often run k-means clustering on their dataset's ground-truth boxes to find the anchor shapes that best match their actual objects. That's a neat trick — let the data tell you what templates to use.

Anchors dominated detection from 2015 to about 2022. More recently, anchor-free methods have shown they can match or beat anchor-based ones while being simpler. We'll get to that shift later.

Non-Maximum Suppression (NMS)

Our warehouse detector evaluates thousands of candidate locations per image. Many of them will fire for the same forklift, producing a cluster of overlapping boxes around it. We need to collapse that cluster into a single detection. That's what NMS — Non-Maximum Suppression — does.

The idea is wonderfully straightforward. Sort all predicted boxes by confidence score. Take the most confident one — that's a keeper. Remove every other box that overlaps the keeper by more than some IoU threshold (typically 0.5). Now take the next most confident surviving box, and repeat. When no boxes remain, you're done.

NMS works well in most scenes, but it has a known weakness: when objects genuinely overlap. Two workers standing close together might share a high IoU between their correct boxes, causing NMS to suppress one valid detection. Soft-NMS helps by decaying confidence scores of overlapping boxes instead of hard-deleting them. And as we'll see, DETR eliminates NMS entirely through a fundamentally different approach.

The Detection Recipe

Generate candidate locations → predict box offsets (from anchors or directly) → classify each box → apply NMS to remove duplicates. Every classical detector follows some version of this pipeline. The differences — and they're substantial — lie in how candidates are generated and when classification happens.

Two-Stage Detectors — The R-CNN Family

The story of modern detection starts with a family of models that solved the problem in two explicit steps: first propose regions that might contain objects, then classify each proposal and refine its box. This two-step approach dominated from 2014 to 2016, and its descendants remain competitive today.

I'll trace the evolution because each step solves a specific, painful bottleneck — and understanding those bottlenecks makes the architecture choices feel inevitable rather than arbitrary.

R-CNN (2014) — The Proof of Concept

R-CNN (Regions with CNN features) was the work that proved deep learning could dominate detection. The idea was brute-force but effective: use a classical algorithm called Selective Search to generate about 2,000 region proposals per image — rectangular crops that "might contain something." Warp each crop to a fixed size. Feed each one through a CNN independently. Classify each with SVMs. Refine each box with regression.

Accurate? Yes — it smashed everything that came before on the Pascal VOC benchmark. Practical? Not remotely. Each image required 2,000 separate CNN forward passes. Inference took 47 seconds per image. Training took 84 hours on a GPU. For our warehouse system processing 30 frames per second, that's about 23 minutes per frame. Not ideal.

But the insight was powerful: CNNs extract features so rich that even a simple classifier on top of them crushes hand-engineered detectors. The question became: how do we keep that power while eliminating the absurd redundancy?

Fast R-CNN (2015) — Share the Computation

The waste in R-CNN was obvious. Those 2,000 proposals overlap massively. Many of them share the same image pixels. Why compute CNN features from scratch for each one?

Fast R-CNN fixed this with one crucial change: run the CNN once on the entire image to produce a single shared feature map. Then, for each proposal, extract features from the relevant region of that shared map using RoI Pooling (Region of Interest Pooling). RoI Pooling takes a variable-sized region of the feature map and produces a fixed-size output by dividing the region into a grid and max-pooling within each cell.

This was a massive speedup. The expensive CNN computation happens once, not 2,000 times. Classification and box regression now happen in a single network, trained end-to-end. But one bottleneck remained: Selective Search still ran on CPU and took about 2 seconds per image. The neural network was fast. The proposal generation wasn't.

Faster R-CNN (2015) — Proposals Go Neural

Faster R-CNN replaced the last hand-crafted component. Instead of Selective Search, it introduced the Region Proposal Network (RPN) — a small convolutional network that slides over the shared feature map and, at each spatial position, predicts two things for each anchor: "is there an object here?" (objectness score) and "how should I adjust this anchor to better fit the object?" (box offsets).

# Faster R-CNN pipeline
Image → Backbone CNN → Shared Feature Map
                              ↓
                   Region Proposal Network (RPN)
                     "Is there something here? Adjust this anchor."
                              ↓
                   ~300 highest-scoring proposals
                              ↓
                   RoI Align → Fixed-size features per proposal
                              ↓
                   Classification head: "What is it?"
                   Box regression head: "Where exactly?"

The RPN is lightweight — a few convolutional layers — but it shares the backbone features, so proposals now cost almost nothing. What took 2 seconds with Selective Search takes about 10 milliseconds with the RPN. The entire pipeline — feature extraction, proposal generation, classification, box regression — is one unified network, trained end-to-end with a multi-task loss.

Think about what happened in the span of 18 months. R-CNN: 47 seconds per image, four separate stages, SVMs and external proposals. Faster R-CNN: 200 milliseconds per image, fully differentiable, one network. Same core idea — propose then classify — but the engineering went from "research prototype you can't deploy" to "the foundation of production detection systems."

RoI Align — A Small Fix That Mattered Enormously

I need to mention RoI Align because it's the kind of detail that separates someone who understands detection from someone who memorized architecture names. The original RoI Pooling snapped proposal coordinates to the nearest integer grid cell on the feature map. This quantization — rounding 3.7 to 4, for instance — introduces small spatial misalignments. For classification, this doesn't matter much. For anything pixel-level, it's destructive.

RoI Align (introduced with Mask R-CNN in 2017) replaces rounding with bilinear interpolation. Instead of snapping to grid cells, it samples at the exact floating-point coordinates. A small change in the code, but it improved mask quality dramatically. Every modern two-stage detector uses RoI Align by default.

Why Two Stages Still Make Sense

The RPN acts as a coarse attention mechanism — it eliminates 99% of the image area that's background, so the second stage only classifies a few hundred promising regions. This divide-and-conquer approach is why two-stage detectors remain the accuracy leaders on benchmarks. The cost is speed.

One-Stage Detectors — The Speed Path

Two-stage detectors are accurate, but there's an obvious question: do we really need two steps? What if we predict class labels and bounding boxes directly from the feature map in a single pass?

That's what one-stage detectors do. They sacrifice some of the careful region-by-region attention for raw speed. And it turned out that with the right training tricks, they could close the accuracy gap too.

YOLO — You Only Look Once

I'll be honest — when I first saw YOLO's approach, I thought it was too simple to work well. YOLO (2016) divides the input image into an S×S grid. Each grid cell is responsible for detecting objects whose centers fall within it. Each cell predicts B bounding boxes (with confidence scores) and C class probabilities. Everything happens in a single forward pass.

# YOLO's output tensor
Input image → CNN backbone → S × S × (B × 5 + C) tensor

# Each grid cell predicts:
#   B boxes: (cx, cy, w, h, confidence) for each
#   C class probabilities shared across all boxes in that cell

# Example: S=7, B=2, C=20 (Pascal VOC)
# Output: 7 × 7 × (2 × 5 + 20) = 7 × 7 × 30

Back to our warehouse. With a 7×7 grid, the image is divided into 49 cells. A forklift whose center falls in cell (3, 4) gets detected by that cell. The cell predicts two boxes with confidence scores, and the class distribution says "98% forklift." That's the entire detection in one shot, no proposals, no second stage.

YOLOv1 ran at 45 FPS — real-time on a single GPU. But it had clear limitations. Each cell could only detect B objects, so two objects near each other in the same cell meant one got missed. Small objects were particularly problematic because they occupied tiny portions of grid cells. But the core insight — treat detection as a single regression problem — opened a whole new design space.

The YOLO Evolution

The YOLO family evolved dramatically across versions, and the trajectory tells a story about what matters in practical detection.

YOLOv2 (2017) added anchor boxes, so instead of predicting raw coordinates, the network predicted offsets from templates. It added batch normalization everywhere and trained at multiple input resolutions. Accuracy improved substantially.

YOLOv3 (2018) introduced multi-scale predictions — detecting objects at three different feature map resolutions using a Feature Pyramid Network. This was the fix for small objects. The backbone moved to Darknet-53, much deeper than before.

YOLOv4 (2020) was a masterclass in engineering. It introduced "bag of freebies" — training tricks like mosaic augmentation, CIoU loss, and self-adversarial training that improve accuracy without slowing inference — and "bag of specials" — architecture tweaks like CSPDarknet and PANet that slightly increase inference cost but substantially improve accuracy.

YOLOv5 (2020) had no academic paper and the naming was controversial, but Ultralytics' PyTorch implementation became the most-deployed YOLO variant because the engineering was superb — easy to train, easy to export, well-documented.

YOLOv8 (2023) made a fundamental shift: anchor-free detection. Instead of predicting offsets from predefined templates, it directly predicts object centers and dimensions. A decoupled head separates classification from box regression, letting each task use features optimized for its own objective. YOLOv8 handles detection, instance segmentation, and pose estimation in one architecture.

YOLO11 (2024) is the latest, with improved backbone efficiency and accuracy across all tasks.

The trend across versions is striking: better backbones, better feature fusion, fewer hand-crafted components, and more task generality. YOLO started as a clever hack for speed. It's now a production-grade family that rivals two-stage detectors on accuracy while maintaining real-time performance.

SSD — Multi-Scale Was the Key

SSD (Single Shot MultiBox Detector, 2016) tackled YOLO's weakness with small objects by making predictions from multiple feature map layers simultaneously. Early layers with high spatial resolution handle small objects. Later layers with large receptive fields handle big ones. SSD attaches a prediction head to each of these layers, each with its own set of anchors scaled to the appropriate resolution.

This multi-scale detection idea proved so fundamental that it became standard. Feature Pyramid Networks (FPN), YOLOv3's three-scale predictions, and every modern detector adopted some version of it.

RetinaNet and Focal Loss — The Training Problem

For years, one-stage detectors were faster but less accurate than two-stage ones. The explanation seemed architectural — two stages give you more refined features. But RetinaNet (2017) proved otherwise. The gap wasn't architecture. It was training.

Here's the problem. A one-stage detector evaluates roughly 100,000 candidate locations per image. Of those, maybe 10 contain actual objects. The rest are background. Standard cross-entropy loss treats every example equally, so the model spends 99.99% of its training compute learning to classify trivially easy background regions. The hard examples — the actual objects, the ambiguous boundaries — get drowned out.

Two-stage detectors sidestep this because the RPN filters out most background before classification. One-stage detectors see everything at once, and the imbalance is crushing.

Focal loss fixed this with a deceptively simple modification:

# Standard cross-entropy
CE(p) = -log(p)

# Focal loss
FL(p) = -(1 - p)^γ × log(p)

# When p is high (easy example, model already confident):
#   (1 - 0.95)^2 = 0.0025 → loss multiplied by 0.0025 → nearly zero
#
# When p is low (hard example, model struggling):
#   (1 - 0.3)^2 = 0.49 → loss barely reduced → full learning signal
#
# γ = 2 is typical. Easy examples get ~100× less weight.

The (1 - p)^γ factor acts as a dynamic difficulty weight. The easier an example is (the more confident the model already is), the more aggressively the loss gets crushed. The model's training compute naturally concentrates on the hard, informative examples.

With focal loss, RetinaNet became the first one-stage detector to match Faster R-CNN's accuracy while running at one-stage speeds. It proved the performance gap was a training problem — specifically a loss function problem — not a fundamental architectural limitation. That insight changed the field's trajectory.

Focal Loss Beyond Detection

Focal loss matters anywhere you have extreme class imbalance with dense predictions — segmentation, dense retrieval, multi-label classification with many rare labels. The principle is general: stop wasting training capacity on examples the model already gets right.

DETR — When Transformers Replaced the Pipeline

Everything we've discussed so far shares a pattern: anchors, proposals, NMS, hand-tuned thresholds. These components work, but they're engineering artifacts — they exist because we couldn't think of a cleaner way. DETR (DEtection TRansformer, 2020) asked: what if we threw all of it away?

DETR reimagines detection as a direct set prediction problem. Feed the image through a CNN backbone, flatten the feature map into a sequence, process it through a Transformer encoder, then use a Transformer decoder with N learned object queries — each one a learnable embedding that learns to "ask" about a specific type of object location. The decoder outputs N predictions. That's it. No anchors. No NMS. No proposal stage.

# DETR pipeline
Image → CNN backbone → Feature map → Flatten to sequence
    → Transformer Encoder (global reasoning across image)
    → Transformer Decoder (N learned object queries attend to image)
    → N predictions: (class, bounding box) pairs

# N is fixed (e.g., 100). Most slots predict "no object."
# Each query specializes during training to handle different
# spatial regions or object types.

The key innovation that makes this work is bipartite matching. During training, DETR uses the Hungarian algorithm to find the optimal one-to-one assignment between its N predictions and the ground-truth objects. Each ground-truth object matches exactly one prediction slot. Unmatched slots are trained to predict "no object." This one-to-one matching eliminates duplicate detections by design — which is why NMS is unnecessary.

I'll be honest — when I first read the DETR paper, the elegance was intoxicating. No hand-crafted components. End-to-end differentiable. Conceptually clean. But elegance had a price: DETR needed 500 training epochs (vs. ~36 for Faster R-CNN), and it was notably bad at small objects. The Transformer's global attention treats every spatial location equally, so it lacked the multi-scale bias that FPN-based detectors get for free.

Follow-up work fixed the practical problems. Deformable DETR replaced global attention with deformable attention — attending to a learned sparse set of sampling points instead of all locations — cutting convergence to ~50 epochs and dramatically improving small object performance. DINO introduced denoising training and better query initialization, pushing accuracy to match or exceed the best traditional detectors. RT-DETR optimized for real-time inference, making transformer-based detection practical for deployment.

The DETR family shows something important: transformer architectures can work for detection, but they needed specific engineering — deformable attention, multi-scale features, better training recipes — before they became practical. The initial elegance was necessary but not sufficient.

Anchor-Free Detectors — Simplifying the Pipeline

Parallel to the DETR line, another movement was simplifying detection from a different angle: eliminating anchors.

Anchor design is tedious. You need to choose aspect ratios, scales, and counts. Get them wrong for your dataset and you miss objects. Teams commonly run k-means clustering on ground-truth boxes to determine anchor shapes — which works, but it's a hand-tuned step in an otherwise learned pipeline.

CornerNet (2018) predicted top-left and bottom-right corners as heatmaps, then grouped them into boxes. CenterNet (2019) took the simpler path: predict object centers as heatmaps and regress width/height from each center point. FCOS (2019) went fully convolutional — each pixel in the feature map predicts its distances to the four edges of the nearest object's bounding box, plus a "centerness" score to down-weight low-quality predictions from pixels far from the object center.

These methods are simpler to implement, have fewer hyperparameters, and avoid the anchor-object matching complexity. Modern YOLOv8+ adopted anchor-free designs. The field has largely moved on from explicit anchors — they were a useful intermediate step, but direct regression from points or centers has proved to be enough.

Rest Stop

If you've made it this far, congratulations. You now have a solid mental model of how detection works: the shared vocabulary of boxes, IoU, and NMS; the two-stage approach that proposes then classifies; the one-stage approach that predicts everything at once; focal loss as the fix for class imbalance; and DETR as the elegant transformer alternative. That's a strong foundation — enough to discuss detection architectures with confidence in any interview or design review.

What we haven't covered yet is the world beyond bounding boxes. Boxes are rough. They include background, they can't represent shape, and they can't tell you which pixel belongs to which object. That's where segmentation comes in. If the discomfort of not knowing how pixel-level understanding works is nagging at you, read on.

Detection Evaluation — Metrics That Matter

Detection evaluation is trickier than classification because you need to simultaneously measure "did we find the right objects?" and "did we put the boxes in the right places?"

For each class, rank all detections by confidence. Walk down the list, computing precision and recall at each step. The area under that precision-recall curve is Average Precision (AP) for that class — it captures how well the model ranks true detections above false ones. Average AP across all classes, and you get mAP (mean Average Precision), the single number everyone uses to compare detectors.

But mAP depends on the IoU threshold: how tight must a box be to count as correct?

mAP@0.5 (Pascal VOC convention) counts any box with IoU above 0.5 as correct. Lenient. mAP@0.75 demands tighter boxes. mAP@[0.5:0.95] (COCO convention) averages across ten thresholds — this is the gold standard, because it rewards detectors that produce precisely localized boxes, not just roughly correct ones.

COCO also breaks down performance by object size: AP_S (small, area < 32²), AP_M (medium), AP_L (large). This matters because most detectors perform very differently across sizes. A model with great overall mAP might be terrible on small objects. In our warehouse, pallets are large and easy. Loose bolts on the floor are tiny and critical. Always check the size breakdown.

Detector	Type	Speed	Accuracy	Best For
Faster R-CNN	Two-stage	~5–15 FPS	High	Offline processing, accuracy-critical tasks
YOLOv8 / YOLO11	One-stage, anchor-free	~80–160 FPS	High	Real-time video, edge deployment
RetinaNet	One-stage	~10–20 FPS	High	One-stage speed, two-stage accuracy
DETR / DINO	Transformer	~10–20 FPS	Very high	NMS-free pipelines, clean architectures
RT-DETR	Transformer	~30–70 FPS	High	Real-time transformer detection
FCOS / CenterNet	One-stage, anchor-free	~30–50 FPS	High	Simple pipeline, fewer hyperparameters

These speeds are rough guides — actual performance depends heavily on backbone, input resolution, hardware, and batch size. The key insight is the tradeoff: more computation buys accuracy; simplicity buys speed. And that tradeoff has been narrowing every year.

Segmentation

Detection draws rectangles around things. That's useful, but a rectangle is a blunt instrument. It includes background, ignores shape, and can't tell you which pixel belongs to which object. In our warehouse, a bounding box around a pallet stacked with boxes includes all the empty space between them. If a robotic arm needs to pick up a specific box, it needs to know the exact boundary — not "somewhere inside this rectangle."

Segmentation is pixel-level classification. Instead of one label per image (classification) or one label per box (detection), it assigns a label to every single pixel. The output is a map the same size as the input, where each pixel carries a prediction. And there are three distinct flavors, each answering a different question.

Three Flavors of Segmentation

Semantic Segmentation — "What class is each pixel?"

Every pixel gets a class label. All forklift pixels are "forklift." All floor pixels are "floor." All worker pixels are "worker." But here's the limitation that tripped me up at first: semantic segmentation can't distinguish between individual objects of the same class. Two workers standing next to each other? They're one blob of "worker" pixels. You know there are worker pixels in that region, but not how many workers or where one ends and another begins.

This is fine for tasks where instance identity doesn't matter — autonomous driving (road vs. sidewalk vs. building), satellite land-use mapping, scene parsing. It's not fine when you need to count things or track individuals.

Instance Segmentation — "Which object does each pixel belong to?"

Each individual object gets its own pixel-perfect mask. Worker 1, Worker 2, Worker 3 — each with a separate boundary. This is harder because the model must both classify pixels and group them into distinct instances. It combines detection (finding individual objects) with segmentation (labeling pixels).

In our warehouse, instance segmentation lets the system count exactly how many pallets are stacked and trace the precise outline of each one — critical for robotic manipulation or inventory management.

Panoptic Segmentation — "Everything, all at once"

Panoptic segmentation unifies both flavors. It divides the scene into stuff (amorphous regions like floor, walls, sky) and things (countable objects like forklifts, workers, pallets). Stuff gets semantic labels. Things get individual instance masks. Every pixel is accounted for — no gaps, no overlaps.

Task	What It Tells You	Warehouse Example	Limitation
Semantic	Class of every pixel	All "floor" pixels labeled, all "pallet" pixels labeled	Can't separate Pallet 1 from Pallet 2
Instance	Which object each pixel belongs to	Pallet 1 mask, Pallet 2 mask, Worker 1 mask	Doesn't label amorphous stuff (floor, walls)
Panoptic	Both, for everything	Floor (stuff) + Pallet 1, Pallet 2, Worker 1 (things)	More complex training and evaluation

The recent Mask2Former architecture deserves a mention here. It treats all three segmentation tasks as the same underlying problem — predict a set of binary masks with associated class labels — using a Transformer decoder with learned mask embeddings. One architecture, three tasks, state-of-the-art results on COCO, ADE20K, and Cityscapes. It's the clearest sign that the field is converging toward unified models.

FCN — The Starting Point

Before 2015, pixel-level segmentation meant sliding a patch classifier across the image, one pixel at a time. Thousands of overlapping forward passes for one image. Horrendously slow.

The Fully Convolutional Network (Long et al., 2015) made one elegant observation: if you replace the fully connected layers in a classification CNN (like VGG) with 1×1 convolutions, the entire network becomes fully convolutional. It can accept inputs of any size and produce a spatial output map in a single forward pass. Instead of "this image is a cat," it outputs "here is a probability map of cat-ness across the image."

The catch: after repeated pooling, the output map is tiny — perhaps 7×7 for a 224×224 input. FCN recovers spatial resolution using transposed convolutions (learned upsampling operations). The paper showed that combining upsampled predictions with features from earlier layers (FCN-32s → FCN-16s → FCN-8s) — a form of skip connections — produced sharper boundaries.

FCN was the proof of concept that end-to-end pixel prediction was possible. But the outputs were still coarse, and the architecture lacked a principled way to combine low-level spatial detail with high-level semantic understanding. That's where U-Net stepped in.

U-Net — The Architecture That Conquered Medical Imaging

If there's one segmentation architecture that deserves to be understood deeply rather than memorized as a name, it's U-Net (Ronneberger et al., 2015). Originally designed for biomedical image segmentation where labeled data is painfully scarce, U-Net has become the default segmentation architecture across domains — cells, satellites, industrial inspection, even as the backbone inside diffusion models like Stable Diffusion.

The Shape That Gives It a Name

U-Net has a symmetric structure that, when drawn, looks like the letter U. The left side is the encoder (contracting path): stacks of convolutions followed by max-pooling, progressively shrinking the spatial resolution while deepening the feature channels. Each pooling step halves the resolution and doubles the channels. This path captures what is in the image — the semantic content — but loses where things are.

The right side is the decoder (expanding path): transposed convolutions that progressively increase spatial resolution while reducing channels. This path tries to recover the spatial detail that pooling destroyed.

U-Net Architecture (Simplified)

Input (572×572)
    │
    ▼
┌─────────────┐                    ┌─────────────┐
│ Conv 3×3 ×2 │─── skip concat ──▶│ Conv 3×3 ×2 │──▶ Output (C classes)
│  64 ch      │                    │  64 ch      │
└──────┬──────┘                    └──────▲──────┘
  pool ↓                             up   │
┌─────────────┐                    ┌─────────────┐
│ Conv 3×3 ×2 │─── skip concat ──▶│ Conv 3×3 ×2 │
│  128 ch     │                    │  128 ch     │
└──────┬──────┘                    └──────▲──────┘
  pool ↓                             up   │
┌─────────────┐                    ┌─────────────┐
│ Conv 3×3 ×2 │─── skip concat ──▶│ Conv 3×3 ×2 │
│  256 ch     │                    │  256 ch     │
└──────┬──────┘                    └──────▲──────┘
  pool ↓                             up   │
┌─────────────┐                    ┌─────────────┐
│ Conv 3×3 ×2 │─── skip concat ──▶│ Conv 3×3 ×2 │
│  512 ch     │                    │  512 ch     │
└──────┬──────┘                    └──────▲──────┘
  pool ↓                             up   │
   ┌──────────────┐                       │
   │  Conv 3×3 ×2 │───────────────────────┘
   │  1024 ch     │  (bottleneck — deepest features)
   └──────────────┘

Skip Connections — Why U-Net Actually Works

The encoder-decoder structure alone isn't special — FCN had that. The magic of U-Net is the skip connections. At each level, the encoder's feature maps are concatenated — not added, concatenated — with the decoder's upsampled feature maps before further convolution.

Think about what this does. The bottleneck features (1024 channels, tiny spatial resolution) contain rich semantic information — "this region is a tumor" — but they've lost almost all spatial precision. The encoder features at the same resolution level contain precise spatial information — "there's an edge at these exact coordinates" — but limited semantic understanding. Concatenation gives the decoder both. The high-level "what" from the upsampled deep features, and the fine-grained "where" from the encoder's skip. The subsequent convolution layers learn to fuse them.

Without skip connections, the decoder has to reconstruct spatial detail from a heavily compressed bottleneck — it's trying to paint a detailed picture from a blurry sketch. With skip connections, the fine details take a shortcut directly from the encoder. The spatial information never needs to survive the bottleneck. That's why U-Net produces sharp, precise boundaries even with limited training data.

Why Medical Imaging Fell in Love

Medical imaging has two properties that make U-Net's design almost perfectly suited:

Data scarcity. Getting an expert radiologist to annotate pixel-level masks is expensive and slow. U-Net's skip connections provide strong inductive biases — the architecture "knows" that spatial details from the encoder should inform the decoder — so it can learn effective segmentation from hundreds of images rather than millions.

Boundary precision matters clinically. A few pixels of error in a tumor boundary can mean the difference between "operable" and "has invaded surrounding tissue." U-Net's direct spatial information transfer preserves the sub-pixel precision needed for clinical decisions.

Modern variants have extended the idea: U-Net++ uses nested, dense skip connections that capture features at multiple semantic levels. Attention U-Net adds learnable gates to skip connections, letting the model decide which spatial details are worth passing through. nnU-Net (no-new-Net) auto-configures all U-Net hyperparameters for any medical segmentation task — preprocessing, architecture depth, patch size, augmentation — and routinely wins medical segmentation challenges without any manual tuning. But the core idea — symmetric encoder-decoder with skip concatenation — remains the beating heart of all of them.

DeepLab — Seeing More Without Losing Detail

U-Net's strategy is: lose spatial detail in the encoder, then recover it in the decoder with skip connections. DeepLab asks a different question: what if we never lose spatial detail in the first place?

Atrous (Dilated) Convolutions

Standard convolutions with pooling or stride reduce spatial resolution to increase the receptive field. You need a big receptive field to understand context — is this pixel part of a car or a building? — but pooling destroys spatial precision. It's a fundamental tension.

Atrous convolutions (also called dilated convolutions) resolve it by inserting gaps between kernel elements. The name comes from the French "à trous" — "with holes." The kernel samples the input at spaced-out positions instead of contiguous ones:

Standard 3×3 conv (receptive field = 3×3):
  X X X        9 parameters, sees 3×3 area
  X X X
  X X X

Dilated 3×3 conv, rate=2 (receptive field = 5×5):
  X . X . X    still 9 parameters, but sees 5×5 area
  . . . . .
  X . X . X
  . . . . .
  X . X . X

Dilated 3×3 conv, rate=4 (receptive field = 9×9):
  X . . . X . . . X    9 parameters, sees 9×9 area
  . . . . . . . . .
  . . . . . . . . .
  . . . . . . . . .
  X . . . X . . . X
  . . . . . . . . .
  . . . . . . . . .
  . . . . . . . . .
  X . . . X . . . X

Same number of parameters. Same computational cost. Much larger receptive field. No resolution loss. The network "sees" more context without any pooling. The tradeoff is memory — you're maintaining higher-resolution feature maps throughout — but for segmentation, where spatial precision is the whole point, that tradeoff is worth it.

ASPP — Multi-Scale Context in Parallel

In our warehouse, a forklift nearby occupies thousands of pixels. The same forklift across the room occupies a hundred. The model needs to recognize both. ASPP (Atrous Spatial Pyramid Pooling) handles this by applying atrous convolutions at multiple dilation rates in parallel and concatenating the results:

ASPP Module (DeepLabv3):

                  ┌── Atrous Conv, rate=6  ──┐
                  ├── Atrous Conv, rate=12 ──┤
Feature map ──────┼── Atrous Conv, rate=18 ──┼──▶ Concat ──▶ 1×1 Conv ──▶ Output
                  ├── 1×1 Conv ──────────────┤
                  └── Global Avg Pool ───────┘

Each branch captures context at a different spatial scale.
The 1×1 conv captures point-level features.
Global avg pool captures image-level context.
Concatenation fuses everything.

DeepLabv3+ combines ASPP with a lightweight decoder and a modified backbone (Xception or ResNet with atrous convolutions in later stages). The result: multi-scale context awareness with sharp boundary recovery. It remains a top choice for semantic segmentation on driving scenes and outdoor environments.

Mask R-CNN — Detection Meets Pixel Masks

Here's where detection and segmentation converge. Mask R-CNN (He et al., 2017) takes Faster R-CNN — the two-stage detector we built up earlier — and adds one elegant branch: for each detected region, predict a binary segmentation mask in addition to the class label and bounding box.

Mask R-CNN:

Image → Backbone (ResNet + FPN) → Region Proposal Network
                                         ↓
                                   RoI Proposals
                                         ↓
                                    RoI Align
                                  ╱      │      ╲
                            Class Head  Box Head  Mask Head
                            (softmax)  (regress)  (28×28 binary
                                                   mask per class)

Two design choices make this work well. First, RoI Align replaces RoI Pooling. The quantization errors from RoI Pooling that didn't matter for classification become devastating for pixel-level masks. Bilinear interpolation preserves sub-pixel precision. Second, the mask head predicts K separate binary masks — one for each class — and uses the mask corresponding to the predicted class. This decouples classification from segmentation. The mask branch doesn't need to figure out what class the object is; it focuses entirely on where the object's pixels are. The classification head handles "what." The mask head handles "where, exactly." Each task gets features optimized for its own job.

Mask R-CNN dominated instance segmentation benchmarks for years. It's a two-stage method, so not the fastest, but it remains the go-to starting point for instance segmentation projects — and it's the foundation of many production systems.

SAM — Segment Anything, Zero-Shot

Every architecture we've discussed so far is trained for a specific set of classes. A Cityscapes model segments roads and cars. A medical model segments tumors. Train on one domain, deploy on that domain. SAM — the Segment Anything Model (Meta AI, 2023) — broke that paradigm.

SAM is a foundation model for segmentation. Trained on over 1 billion masks across 11 million diverse images, it can segment objects it has never been explicitly trained on. No fine-tuning. No domain-specific labels. Give it any image and a prompt — a click, a bounding box, a set of points — and it returns a segmentation mask.

SAM Architecture:

┌─────────────────────────────────────────────────────┐
│ Image Encoder (ViT-H)                               │
│ Input image → Patch embeddings → Transformer layers  │
│ → Dense image embeddings                             │  ← Heavy. Runs once.
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│ Prompt Encoder                                       │
│ Points/boxes → positional encodings                  │
│ Masks → lightweight CNN                              │  ← Lightweight.
│                        ↓                             │     Runs per prompt.
│ Mask Decoder (lightweight transformer)               │
│ Cross-attention between prompt and image embeddings  │
│ → Predicted masks + confidence scores                │
└─────────────────────────────────────────────────────┘

The architecture is deliberately split. The image encoder — a massive Vision Transformer (ViT-Huge) — is computationally expensive but runs once per image, producing reusable embeddings. The prompt encoder and mask decoder are lightweight, so you can interactively try dozens of prompts in real time. Click a point on a worker's helmet, get the helmet mask instantly. Click the worker's torso, get the full-body mask. Draw a box around a pallet stack, get pixel-perfect boundaries for the whole stack.

This design makes SAM transformative for data annotation. Manually drawing pixel-level masks takes minutes per object. With SAM, a single click produces a usable mask in milliseconds. Teams building segmentation datasets can annotate 10× to 50× faster. SAM's masks can also feed into downstream pipelines — classifiers, trackers, 3D reconstruction systems — where it handles "where" while other models handle "what."

SAM 2 (2024) extended the paradigm to video, maintaining consistent segmentation masks across frames using a memory mechanism. Prompt an object in one frame, and SAM 2 tracks its mask through the entire video. The move from "segment anything in an image" to "segment anything in a video" is a natural progression that opens up video annotation, video editing, and temporal analysis at scale.

I'm still developing my intuition for where SAM's zero-shot ability breaks down. Highly specialized domains — certain medical imaging modalities, unusual industrial inspection scenarios — sometimes need fine-tuning. But the general pattern is clear: foundation models for segmentation work astonishingly well out of the box, and they're reshaping how the field thinks about task-specific training.

Loss Functions for Segmentation

Segmentation losses need to handle a problem that classification losses don't face: extreme class imbalance at the pixel level. In a driving scene, 60% of pixels might be "road," 20% "sky," and the remaining 20% split among a dozen classes. A model that predicts "road" everywhere gets 60% pixel accuracy while being completely useless.

Cross-entropy, applied per-pixel, is the baseline. It works, but it's dominated by the majority class. You can add class weights to compensate — assign higher loss weight to rare classes — but choosing those weights is another hyperparameter to tune.

# Weighted cross-entropy
weights = torch.tensor([0.5, 2.0, 3.0, 5.0, ...])  # rare classes get higher weight
loss = nn.CrossEntropyLoss(weight=weights)

Dice loss directly optimizes the overlap between predicted and ground-truth masks. It measures what fraction of the predicted and actual regions coincide, naturally handling class imbalance because it's based on the ratio of overlap to total area — not raw pixel counts:

Dice coefficient = 2 × |Prediction ∩ Ground Truth|
                   ─────────────────────────────────
                     |Prediction| + |Ground Truth|

Dice loss = 1 − Dice coefficient

Perfect overlap → Dice = 1.0 → Loss = 0.0
No overlap     → Dice = 0.0 → Loss = 1.0

Dice loss is particularly critical in medical imaging, where the region of interest — a lesion, a tumor — might occupy 1–2% of the image. Cross-entropy would happily let the model ignore those pixels. Dice loss forces the model to get the small region right, because the denominator normalizes by region size.

In practice, most production segmentation pipelines combine both: Loss = CE + λ × Dice. Cross-entropy provides stable per-pixel gradients for smooth optimization. Dice loss ensures small regions aren't ignored. Together, they reliably outperform either alone.

Segmentation Evaluation

Metric	Formula	What It Measures	When to Use
IoU (Jaccard)	`\|P ∩ G\| / \|P ∪ G\|`	Overlap per class	Per-class analysis
mIoU	Mean of per-class IoU	Average quality across classes	Standard for semantic segmentation (Cityscapes, ADE20K)
Dice Score	`2\|P ∩ G\| / (\|P\| + \|G\|)`	Overlap (equivalent to pixel-level F1)	Standard in medical imaging
Pixel Accuracy	Correct / Total pixels	Raw accuracy	Avoid as primary metric — misleading with imbalanced classes

IoU and Dice are related: Dice = 2 × IoU / (1 + IoU). Dice is always ≥ IoU for the same prediction. For benchmarking, mIoU is the standard on PASCAL VOC, Cityscapes, and ADE20K. For medical imaging, Dice score is the convention. Know both, and know when each is expected.

The Full Landscape

Task	Key Architectures	Output	Primary Use Cases
Semantic	DeepLabv3+, U-Net, FCN	Class label per pixel	Driving, satellite, scene parsing
Instance	Mask R-CNN, Mask2Former	Per-object binary mask + class	Counting, robotics, manipulation
Panoptic	Mask2Former, Panoptic FPN	Stuff labels + thing instance masks	Full scene understanding
Promptable	SAM, SAM 2	Binary mask for prompted region	Annotation, zero-shot, interactive

Wrapping Up

If you're still with me, thank you. That was a long road.

We started with the fundamental question of detection — how do you output a variable number of predictions? — and traced the answer from R-CNN's brute-force 2,000 proposals through Faster R-CNN's learned RPN, YOLO's single-pass grid, focal loss's fix for class imbalance, and DETR's clean transformer formulation. Then we moved to segmentation: from FCN's proof-of-concept pixel prediction, through U-Net's skip connections that made medical imaging practical, DeepLab's dilated convolutions that preserved spatial detail, Mask R-CNN's elegant fusion of detection and masks, and SAM's paradigm-breaking zero-shot foundation model.

My hope is that the next time someone mentions anchor boxes, NMS, or skip connections, instead of nodding along and hoping no one asks a follow-up, you'll have a clear mental model of what's happening under the hood — why those ideas exist, what problems they solve, and where the field is heading next.

Resources Worth Your Time

R-CNN → Faster R-CNN papers — Read the three papers back-to-back. Each one directly addresses the previous paper's bottleneck. A masterclass in iterative engineering.
"Focal Loss for Dense Object Detection" (Lin et al., 2017) — The RetinaNet paper. One of the most insightful loss function papers in all of deep learning.
"End-to-End Object Detection with Transformers" (Carion et al., 2020) — The original DETR paper. Elegant and readable, even if you skim the math.
"U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015) — Still the most-cited segmentation paper. The original figures are iconic.
"Segment Anything" (Kirillov et al., 2023) — The SAM paper. Wildly ambitious in scope and remarkably clear in presentation.
Ultralytics YOLOv8 Documentation — Not a paper, but the most practical guide to deploying modern detection. Excellent engineering docs.

← Previous Image Augmentation Next → Modern Vision