Self-Supervised and Contrastive Learning

Chapter 16, Section 4

I avoided self-supervised learning for longer than I'd like to admit. Every time someone mentioned SimCLR or BYOL in conversation, I'd nod along, mentally filing it as "unsupervised learning with extra steps." I'd glance at a paper, see terms like "InfoNCE" and "representation collapse," and quietly close the tab. Finally the discomfort of not understanding how every major foundation model actually learns — BERT, GPT, CLIP, DINOv2, all of them — grew too great for me. Here is that dive.

Self-supervised learning (SSL) is a family of techniques where a model learns representations from data without human-provided labels. Instead, the data provides its own supervision: hide part of an input, ask the model to reconstruct it. The model develops useful internal representations as a side effect of solving that puzzle. The concept traces back decades, but it exploded between 2018 and 2023 — every foundation model you use today was pretrained this way.

Before we start, a heads-up. We're going to walk through contrastive losses, momentum encoders, augmentation pipelines, and masked image modeling. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

The labeling bottleneck and why the world has more photos than patience
Pretext tasks: teaching by hiding
From puzzles to pull-and-push: contrastive learning
The InfoNCE loss: picking the suspect from a lineup
SimCLR: the clean baseline
MoCo: decoupling negatives from batch size
Rest stop and an off ramp
BYOL: dropping the negatives entirely
DINO and DINOv2: self-distillation and emergent attention
Masked image modeling: MAE and the vision equivalent of BERT
Why text and vision learned differently
CLIP: contrastive learning across modalities
Evaluation protocols: how we know it worked
When SSL beats supervised pretraining
Resources and credits

The Labeling Bottleneck and Why the World Has More Photos Than Patience

Imagine you're building a wildlife classifier for trail cameras. Your national park has 200 cameras, each snapping a photo every 30 seconds. After one month, you have 50,000 images. Deer, foxes, owls, raccoons, the occasional hiker, a lot of empty forest.

Now imagine labeling those photos. A volunteer sits down one Saturday and tags 47 images before losing the will to live. Forty-seven. Out of fifty thousand. That's 0.094% of your data. If you try to train a supervised classifier on those 47 labels, the result is going to be somewhere between "useless" and "creative fiction."

This is the labeling bottleneck, and it's not specific to trail cameras. ImageNet's million labels took years of human effort. Medical imaging labels require a specialist at hundreds of dollars per hour. Satellite imagery datasets need domain experts who can distinguish crop types from orbit. The pattern is always the same: the world is drowning in unlabeled data and starving for labels.

Self-supervised learning exists because someone asked a beautiful question: what if we could learn something useful from all 50,000 photos, without a single label? What if the data itself could teach the model what "deer" and "fox" look like, even if nobody ever wrote those words down?

That question has an answer. Several answers, actually. And they start with a surprisingly low-tech idea.

Pretext Tasks: Teaching by Hiding

Here's the core insight of self-supervised learning: you can create a training signal from the data itself by damaging the input and asking the model to repair it. The repair task is a means to an end — we don't care whether the model becomes good at repair. We care that, in learning to repair, it develops internal representations that understand the structure of the data.

The technical term for this manufactured task is a pretext task. The task you actually care about later — classifying animals, detecting objects, whatever — is called the downstream task. You train on the pretext task, throw away its prediction head, and keep the learned backbone for the downstream task. The pretext task is the training exercise. The representation is the prize.

Think of it like training a detective. You don't care whether the detective gets good at solving practice puzzles. You care that solving those puzzles develops pattern recognition, attention to detail, and spatial reasoning — skills that transfer to real cases.

Let's look at three pretext tasks from 2016–2018 that pioneered this idea in computer vision. We'll use our trail camera photos as the running example.

Rotation prediction

Take one of our trail camera photos — say, a fox standing in a clearing. Rotate it by 0°, 90°, 180°, or 270°. Show the model the rotated image and ask: "which rotation was applied?" That's it. Four-way classification. The label is free — you know the rotation because you applied it.

Why does this work? To predict that a photo was rotated 90°, the model has to know what "upright" means for foxes, trees, and ground. It has to understand that foxes stand on legs, legs point down, sky is up. It develops a sense of object structure and spatial arrangement — not because we asked for that, but because the rotation task demands it.

Rotation prediction was introduced by Gidaris, Singh, and Komodakis in 2018, and it was shockingly effective for how crude it seems. A model trained to predict rotations learned features that transferred well to object classification — better than most hand-designed features of the era.

Jigsaw puzzles

Cut the fox photo into a 3×3 grid of patches. Shuffle the nine patches. Ask the model to predict the original arrangement — or at least which of a few hundred predefined permutations was used.

This forces the model to learn spatial relationships between parts of an image. Where does the head go relative to the body? What does "next to a tree trunk" look like? The model can't solve the jigsaw without understanding how visual elements relate spatially. This was introduced by Noroozi and Favaro in 2016.

Colorization

Convert the fox photo to grayscale. Ask the model to predict the original colors. Since foxes are reddish-orange and forests are green, the model has to learn that "this shape is a fox" and "this texture is foliage" in order to assign the right colors. Colorization as a pretext task was explored by Zhang, Isola, and Efros in 2016.

I'll be honest — when I first read about jigsaw puzzles as a training method, I thought it was a gimmick. It felt like the kind of idea you'd pitch at a research retreat after one too many coffees. But the intuition underneath is solid: any task that requires understanding the content of an image to solve — rather than exploiting low-level pixel statistics — will produce representations that carry semantic knowledge.

And yet, all three of these approaches share a frustrating limitation.

They're brittle. Each one bakes in specific assumptions about what kind of damage is informative. Rotation prediction fails when the images don't have a natural "up" (satellite imagery, microscopy). Jigsaw puzzles are sensitive to the grid size and permutation set. Colorization assumes color carries semantic information, which isn't always true. Every pretext task is a hand-designed inductive bias, and designing good ones turned out to be almost as much work as designing good features by hand.

Researchers spent 2016–2019 inventing increasingly clever pretext tasks, each a little better than the last. And then contrastive learning arrived and made most of them obsolete overnight.

From Puzzles to Pull-and-Push: Contrastive Learning

Here's the idea that changed everything. Instead of designing a specific puzzle for the model to solve, give it a much more general objective: learn to tell whether two things are the same or different.

Take one of our fox photos. Apply two different random transformations — crop it differently, flip one horizontally, change the brightness, add some blur. Now you have two versions of the same fox photo that look different but depict the same scene. These two views form a positive pair.

Every other image in the training batch — a deer photo, an owl photo, an empty forest — those are negatives.

The objective: make the model produce similar representations for the two fox views, and dissimilar representations for the fox and everything else. Pull the positives together in embedding space. Push the negatives apart.

That's contrastive learning. No jigsaw grids, no rotation angles, no colorization. The supervision comes entirely from the data augmentation pipeline: "these two things came from the same photo" versus "these came from different photos." The augmentation strategy is the pretext task. It defines what the model should be invariant to (crop position, color shifts) and what it should distinguish (different animals, different scenes).

This is a profound shift. With hand-designed pretext tasks, you're telling the model what kind of understanding to develop. With contrastive learning, you're telling the model what to ignore (augmentation details) and letting it figure out the rest. Our detective analogy evolves: instead of giving the detective specific practice puzzles, we show them pairs of crime scene photos and say "these two are from the same crime, those are from different crimes. Learn to tell the difference."

The InfoNCE Loss: Picking the Suspect from a Lineup

We need a loss function that implements "pull positives together, push negatives apart." The standard choice is InfoNCE (Noise-Contrastive Estimation with an information-theoretic flavor), introduced by van den Oord et al. in 2018.

Here's the intuition. Imagine a police lineup. You have one suspect (the positive match) standing in a line with N-1 innocent people (the negatives). The witness (our model) has to pick out the suspect. InfoNCE turns this into a softmax classification problem: out of all N candidates, which one is the true match?

Let's walk through a tiny concrete example. Suppose we have three trail camera photos in our batch: a fox, a deer, and an owl. We augment each one twice, producing six views total. Focus on the fox. Its two augmented views — call them fox₁ and fox₂ — are the positive pair. The four other views (deer₁, deer₂, owl₁, owl₂) are negatives.

The model encodes all six views into vectors. We compute the cosine similarity between fox₁ and every other view. The result is a set of similarity scores:

fox₁ · fox₂ = 0.85 (positive — the same fox)
fox₁ · deer₁ = 0.21 (negative)
fox₁ · deer₂ = 0.18 (negative)
fox₁ · owl₁ = 0.12 (negative)
fox₁ · owl₂ = 0.15 (negative)

InfoNCE says: run these through a softmax. The loss is the negative log probability assigned to the positive pair. If the model gives fox₂ a high probability relative to the negatives, loss is low. If it can't distinguish its fox from the deer, loss is high.

There's one critical ingredient: a temperature parameter τ that scales all the similarity scores before the softmax. Dividing by a small τ (say, 0.07) sharpens the distribution — the model has to be very precise about which is the positive. Dividing by a larger τ (say, 0.5) softens it — the model gets more credit for "in the right neighborhood."

import torch
import torch.nn.functional as F

def info_nce_loss(query, positive, negatives, temperature=0.07):
    # query:    (B, D) — the anchor embedding (e.g., fox view 1)
    # positive: (B, D) — the matching embedding (e.g., fox view 2)
    # negatives:(B, N, D) — embeddings from other images

    # Similarity between anchor and its positive match
    pos_sim = (query * positive).sum(dim=-1, keepdim=True)

    # Similarity between anchor and each negative
    neg_sim = torch.bmm(negatives, query.unsqueeze(-1)).squeeze(-1)

    # Combine: positive is always index 0, then all negatives
    logits = torch.cat([pos_sim, neg_sim], dim=-1) / temperature

    # Target: index 0 is the correct answer (the positive)
    labels = torch.zeros(query.size(0), dtype=torch.long,
                         device=query.device)
    return F.cross_entropy(logits, labels)

The loss function above computes a dot product between the anchor and its positive (one number), then between the anchor and each negative (N numbers). It stacks them into a single vector, divides by temperature, and applies cross-entropy where the correct answer is always index 0 — the positive pair. That's the entire InfoNCE computation.

I still get tripped up by the temperature parameter. My instinct says "low temperature = more confident = better," but it's more nuanced than that. Too low and the model chases trivially hard negatives instead of learning useful structure. Too high and it never learns to make fine distinctions. Most methods land between 0.05 and 0.2. It's worth sweeping early in a project — it's one of those knobs that can swing accuracy by several points.

Our lineup analogy extends naturally. A low temperature is like asking the witness to pick from a lineup of near-identical twins. A high temperature is like asking them to pick a fox from a group that includes a lamp, a car, and a houseplant. You want the difficulty in a sweet spot where the model has to work for it but doesn't resort to memorizing irrelevant details.

The specter of collapse

Before we look at specific methods, we need to confront the central failure mode of self-supervised learning: representation collapse.

Here's the problem. The easiest way to make every positive pair perfectly similar? Map every single input to the same point. Every image produces the exact same vector. Positive pairs are identical — loss is zero. The model has "solved" the task by learning nothing.

This is like our detective deciding that every crime scene photo looks the same. Technically, they'd never misidentify a match — but they'd also be completely useless.

Contrastive methods prevent collapse with negatives. If you map everything to one point, then the similarity between your anchor and every negative is also maximal, and the loss explodes. The negatives act as a repulsive force, keeping the embedding space from collapsing.

Methods that don't use negatives — and we'll see several — need other tricks. Asymmetric architectures, stop-gradient operations, exponential moving averages. Understanding how each method prevents collapse is, in many ways, understanding the method itself. We'll come back to this with every approach we cover.

A practical diagnostic: monitor the standard deviation of your embedding dimensions during training. If it trends toward zero, your representations are collapsing. Catch it early.

SimCLR: The Clean Baseline

SimCLR, published by Google Brain in 2020, stripped contrastive learning to its essentials and became the reference that everyone benchmarks against. The full name — "A Simple Framework for Contrastive Learning of Visual Representations" — is accurate. It is, genuinely, the simplest version of the idea that works well.

Let's trace through it with our trail camera photos. Say our batch has 256 images: some foxes, some deer, some owls, some empty frames. Each image gets augmented twice, producing 512 views. Every image and its second augmentation form a positive pair. Everything else in the batch is a negative. For any given fox photo, that's one positive and 510 negatives.

Each view passes through a shared encoder — a ResNet-50 in the original paper — to produce a feature vector. Then through a small 2-layer MLP called the projection head, which maps the features to a 128-dimensional space. InfoNCE is applied in that 128-dimensional space.

After training, something counterintuitive happens: we throw the projection head away. Downstream tasks use the encoder's output, not the projector's output. Why? The projection head absorbs augmentation-specific shortcuts — it learns which features are useful for the contrastive task specifically, including features that are about "was this cropped from the left?" rather than "is this a fox?" The encoder, sitting behind the projector, develops cleaner, more transferable representations.

import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T

class SimCLR(nn.Module):
    def __init__(self, backbone, proj_dim=128):
        super().__init__()
        self.backbone = backbone
        feat_dim = backbone.fc.in_features
        backbone.fc = nn.Identity()   # strip the classification head
        self.projector = nn.Sequential(
            nn.Linear(feat_dim, 2048),
            nn.ReLU(),
            nn.Linear(2048, proj_dim),
        )

    def forward(self, x):
        h = self.backbone(x)          # encoder features
        z = self.projector(h)          # projected for contrastive loss
        return F.normalize(z, dim=-1)  # unit-norm for cosine similarity

The model takes an image, runs it through the backbone to get features h, then projects through a two-layer MLP to get z. The normalize call at the end ensures all vectors live on the unit sphere, so dot products become cosine similarities. After training, we keep the backbone and discard the projector.

The augmentation stack that launched a thousand papers

SimCLR's most cited finding isn't about architectures or loss functions. It's about augmentations. The authors ran an exhaustive ablation study, trying every combination of crop, flip, color jitter, grayscale, and blur. The conclusion: random resized crop + color jitter is the minimum viable augmentation pair.

Why these two specifically? Without color jitter, the model takes a shortcut: it matches images by their color histogram. Two crops from the same photo tend to have similar color distributions, so the model learns "similar colors = same image" and never develops actual semantic understanding. Without random cropping, the model matches global layout — "there's a dark blob in the top-left" — without learning what objects are.

You need both to force the model past shortcuts and toward genuine content understanding. This finding reshaped every subsequent SSL paper — the augmentation stack became a first-class design decision, not an afterthought.

simclr_augment = T.Compose([
    T.RandomResizedCrop(224, scale=(0.2, 1.0)),
    T.RandomHorizontalFlip(),
    T.RandomApply([T.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),
    T.RandomGrayscale(p=0.2),
    T.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

Each augmented view passes through this pipeline: a random crop covering between 20% and 100% of the image, a coin-flip horizontal mirror, an 80% chance of color distortion, a 20% chance of full grayscale, and a Gaussian blur. The normalize at the end standardizes pixel values to the range the pretrained backbone expects.

SimCLR has one significant limitation, and it's practical rather than conceptual. More negatives per batch means better representations. The original paper used batch sizes of 4,096 to 8,192, spread across 32 to 128 TPU cores. For a researcher with a single GPU, that's not a training recipe — it's a fantasy.

That limitation motivated the next major development.

MoCo: Decoupling Negatives from Batch Size

MoCo — Momentum Contrast — was published by Kaiming He at Facebook AI in 2020, and it solved SimCLR's batch size problem with two elegant ideas.

The first idea: a queue of negatives. Instead of using the current batch as the source of negatives, MoCo maintains a FIFO (first-in, first-out) queue of 65,536 past embeddings. Each training step, the current batch's embeddings get enqueued and the oldest embeddings get dequeued. The result: massive negative diversity even with a batch size of 256. Our trail camera analogy makes this concrete — instead of comparing each fox photo against only the 255 other photos in the current batch, we compare it against a rolling memory of the last 65,536 embeddings we've seen.

The second idea solves a problem the first one creates. The queue contains embeddings produced at different training steps. If the encoder's weights change rapidly between steps, the old embeddings in the queue are stale — they were computed by a different version of the model. Comparing fresh query embeddings against stale key embeddings is like asking our detective to compare today's crime scene photos against witness sketches drawn by a different artist last month.

MoCo's solution is a momentum encoder. Instead of one encoder, there are two: a query encoder that's updated normally by gradient descent, and a key encoder that's updated as a slow-moving average of the query encoder. The key encoder produces the embeddings that go into the queue.

import torch

@torch.no_grad()
def momentum_update(encoder_q, encoder_k, m=0.999):
    """Key encoder = slow exponential moving average of query encoder."""
    for q_param, k_param in zip(encoder_q.parameters(),
                                 encoder_k.parameters()):
        k_param.data = m * k_param.data + (1 - m) * q_param.data

The update rule is one line: each parameter of the key encoder becomes 99.9% of its old value plus 0.1% of the query encoder's current value. This means the key encoder changes very slowly — slowly enough that embeddings from 100 training steps ago are still reasonably consistent with embeddings from the current step. The queue stays coherent.

The momentum coefficient m=0.999 was found to work well across settings. Values too low (0.9) make the key encoder change too fast, degrading queue consistency. Values too high (0.9999) make it change too slowly, preventing it from tracking the query encoder's improving representations.

MoCo v2 borrowed SimCLR's augmentation pipeline and MLP projection head for a free accuracy boost. MoCo v3 adapted the framework to Vision Transformers. The momentum encoder trick survived every version — it consistently stabilizes training. The approach shows that the key engineering challenge in contrastive learning wasn't the loss function or the architecture, but the logistics of managing negative examples.

Rest Stop and an Off Ramp

Congratulations on making it this far. You can stop if you want.

You now have a working mental model of contrastive self-supervised learning: take an image, augment it twice to get a positive pair, use other images as negatives, train a model to distinguish positives from negatives using InfoNCE loss. SimCLR does this with large batches. MoCo does it with a momentum-updated queue, letting you use small batches. Both produce learned representations that transfer to tasks like classification and detection — without a single human-provided label.

That mental model is genuinely useful. It covers the core mechanism behind a large fraction of modern SSL research, and it's enough to understand why foundation models like CLIP work the way they do.

It doesn't tell the complete story, though. There are methods that work without any negatives at all — a fact that surprised the research community and challenged fundamental assumptions about why contrastive learning works. There are masked approaches that reconstruct images instead of comparing them. And there's the question of how we actually evaluate whether self-supervised representations are any good.

The short version: BYOL and DINO skip negatives by using an asymmetric student-teacher setup. MAE masks 75% of image patches and reconstructs them. Linear probing freezes the backbone and trains a simple classifier on top. There — you're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

BYOL: Dropping the Negatives Entirely

In 2020, DeepMind published BYOL — Bootstrap Your Own Latent — and the result was genuinely confusing. It matched the performance of SimCLR and MoCo without using any negative pairs at all. Positives only. No repulsive force.

My immediate reaction was: that shouldn't work. We established that without negatives, nothing stops the model from mapping everything to a single point. The entire motivation for negatives is collapse prevention. How can you remove them and still get useful representations?

The answer is architectural asymmetry, and it took me a while to internalize why it works.

BYOL has two networks. The online network consists of an encoder, a projector MLP, and a predictor MLP. The target network has the same encoder and projector, but no predictor. The target network's weights are updated as an exponential moving average of the online network — the same momentum trick from MoCo, repurposed.

Here's the training loop. Take a trail camera photo of a fox. Augment it two ways. Pass one augmentation through the online network (encoder → projector → predictor) to get a prediction. Pass the other augmentation through the target network (encoder → projector, no predictor) to get a target. Train the online network to make the prediction match the target. Crucially: gradients only flow through the online network. The target network has stop_gradient applied — it doesn't get updated by backpropagation, only by the slow EMA.

This asymmetry is the collapse-prevention mechanism. Think of it as an apprenticeship. The online network (the apprentice) is trying to predict what the target network (the master) will say. The master changes slowly. If the apprentice tries the lazy strategy of always predicting the same constant, the master is still producing different representations for different images (because the master was initialized with useful randomness and changes slowly). The apprentice's loss would be high. The only way to get low loss is to actually learn something about the input.

BatchNorm in the projector adds another layer of collapse resistance. Because batch normalization computes statistics across the entire batch, it implicitly creates information sharing between examples. If all representations tried to converge to one point, BatchNorm would re-center and re-scale them, breaking the collapse. It's a subtle effect, but researchers have shown that removing BatchNorm from BYOL causes it to collapse.

My favorite thing about BYOL is that, aside from the high-level explanation I described, no one is completely certain why it works as well as it does. There have been several theoretical analyses — one arguing it's an implicit form of contrastive learning, another relating it to spectral methods — and none of them fully explain the observed stability. It's one of those results where practice ran ahead of theory, and theory is still catching up.

DINO and DINOv2: Self-Distillation and Emergent Attention

DINO (Meta, 2021) applied a similar student-teacher framework to Vision Transformers, with one addition that changed the game: centering. Instead of using BatchNorm to prevent collapse, DINO maintains a running mean of the teacher's outputs and subtracts it. This centers the teacher's distribution, preventing any one dimension from dominating and dragging everything toward a single point.

The student network is trained to match the teacher's softmax output (hence "self-distillation" — the teacher is a slowly-evolving version of the student, and the student distills knowledge from itself). DINO uses different augmentation strengths for student and teacher: the teacher sees large "global" crops covering 50%+ of the image, while the student sees smaller "local" crops covering as little as 5%. This forces the student to predict the global structure from local patches.

But DINO's most striking result wasn't about accuracy numbers. It was about what the Vision Transformer learned to see. When researchers visualized the self-attention maps from the last layer, they found something remarkable: the model had discovered object boundaries. Without any segmentation labels, without any object detection supervision, DINO's ViT learned to attend to semantically meaningful regions — the outline of a dog, the wings of a bird, the contour of a car. Purely from self-supervised training on unlabeled images.

I'll be honest — when I first saw those attention maps, I didn't believe they were real. They looked too clean, too purposeful, for a model that had never been told what an "object" is. But they've been replicated extensively, and the effect is genuine. Something about the combination of self-distillation, Vision Transformers, and multi-crop augmentation causes semantic structure to emerge.

DINOv2 (Meta, 2023) scaled this approach to 142 million curated images with a ViT-giant backbone. The result is, as of this writing, the closest thing we have to a universal visual feature extractor. DINOv2 features rival or beat CLIP features on dense prediction tasks — segmentation, depth estimation, surface normal prediction — without any task-specific fine-tuning. You take the frozen DINOv2 backbone, attach a lightweight head, and it works across a startling range of vision tasks. If you're building a computer vision system in 2024 and beyond, DINOv2 is likely your first stop for general-purpose features.

Masked Image Modeling: MAE and the Vision Equivalent of BERT

Contrastive and self-distillation methods aren't the only path. There's a parallel strand of SSL research that works by a different principle entirely: mask part of the input and reconstruct it.

If you've encountered BERT, you've seen this idea in text: mask 15% of the tokens in a sentence, train the model to predict the missing words. The model can't memorize the answers — they change each time — so it has to learn grammar, semantics, and world knowledge to fill in the blanks.

MAE — Masked Autoencoders — applies this to images. Published by Kaiming He (the same researcher behind MoCo) in 2021, MAE works as follows: divide an image into non-overlapping patches (16×16 pixels each, for a typical ViT). Randomly mask 75% of them. Feed only the visible 25% of patches into a Vision Transformer encoder. Then use a lightweight decoder to reconstruct the missing patches' pixel values.

The 75% masking ratio sounds extreme, and it is. It's much higher than BERT's 15% token masking. Why? Because images are far more redundant than text. If you mask one word in a sentence, recovering it often requires understanding the whole sentence. If you mask one small patch in an image, you can often recover it from neighboring patches using texture interpolation — no semantic understanding needed. You have to mask a lot of an image before the task requires the model to understand what it's looking at rather than filling in textures.

I'm still developing my intuition for why 75% is the sweet spot rather than 70% or 80%. The empirical ablation in the MAE paper shows a clear peak around 75%, with accuracy dropping on both sides. Too low (say 40%) and the task is too easy — the model fills in textures without learning semantics. Too high (say 90%) and there's not enough context left to make meaningful predictions. But the transition is surprisingly sharp, and I don't have a first-principles argument for why it lands exactly where it does.

MAE has a practical advantage over contrastive methods: it doesn't need large batches or momentum encoders. The training signal comes from pixel reconstruction, not from comparing pairs. This makes it simpler to train and more stable. It also means MAE scales gracefully to custom domains where batch size may be limited.

The cost is that pixel-level reconstruction teaches the model more about textures and low-level statistics than about high-level semantics. Contrastive methods like SimCLR and DINO tend to produce features that are more immediately useful for classification. MAE features shine more after fine-tuning, where the model can adapt its texture-heavy representations toward the downstream task.

A newer approach, BEiT (BERT Pre-Training of Image Transformers), addresses this by tokenizing image patches into a discrete vocabulary first (using a pretrained tokenizer) and then predicting the token IDs of masked patches rather than raw pixels. This pushes the reconstruction target toward more abstract, semantic representations — closer to how BERT works with discrete words.

Why Text and Vision Learned Differently

There's a pattern worth noticing. In NLP, masked prediction (BERT) and next-token prediction (GPT) dominated from the start. In vision, contrastive methods (SimCLR, MoCo, DINO) held the lead for years before masked modeling (MAE) became competitive. Why?

Language is discrete. Words are tokens from a finite vocabulary. When BERT predicts a masked token, the target is a specific word ID — a clean, unambiguous signal. The loss function is cross-entropy over a vocabulary, and it works beautifully.

Images are continuous. Pixels are floating-point numbers. When MAE predicts a masked patch, the target is a grid of RGB values. A loss on raw pixels treats a slightly-wrong shade of green the same as predicting a completely wrong object. Pixel-level reconstruction rewards texture matching more than semantic understanding. That's why contrastive methods, which operate on learned representations rather than raw pixels, had the early advantage in vision.

The convergence is happening now. Methods like I-JEPA (Joint-Embedding Predictive Architecture, from Yann LeCun's group at Meta, 2023) predict in representation space rather than pixel space. Instead of reconstructing the pixels of masked patches, I-JEPA predicts the encoded representation of the masked region. This combines the simplicity of masking with the semantic richness of learned embeddings. It's the direction the field is moving, and it works across both text and vision without the strong augmentation pipelines that contrastive methods require.

CLIP: Contrastive Learning Across Modalities

Everything we've discussed so far operates within a single modality — images compared to images, or text compared to text. CLIP (Contrastive Language-Image Pretraining, OpenAI, 2021) extends contrastive learning across modalities, and the result is arguably the most versatile model to come out of the SSL era.

The setup: two encoders. An image encoder (a ViT or ResNet) and a text encoder (a Transformer). Given a batch of N image-text pairs — think "photo of a fox in the forest" next to the corresponding trail camera image — CLIP computes the cosine similarity between every image and every text in the batch. That's an N×N matrix of similarities. The diagonal (matching image-text pairs) should be high. Everything off-diagonal (mismatched pairs) should be low.

The loss is InfoNCE applied symmetrically: for each image, classify which text matches it (out of N candidates). For each text, classify which image matches it. Average the two.

What makes CLIP remarkable is the downstream capability this produces. Because images and text live in the same embedding space, you can do zero-shot classification: encode the candidate class names as text ("a photo of a fox," "a photo of a deer," "a photo of an owl"), encode your test image, and pick the class whose text embedding is closest. No fine-tuning. No task-specific training. For our trail camera project, we could classify animals by describing them in English — a capability that would have seemed like science fiction five years ago.

CLIP was trained on 400 million image-text pairs scraped from the internet. The scale of the data — not the cleverness of the architecture — is what gives CLIP its generalization power. The contrastive objective is the same InfoNCE we covered with SimCLR. The magic is in the pairing: natural language supervision turns out to be a richer training signal than any pretext task humans could design.

Evaluation Protocols: How We Know It Worked

Self-supervised learning produces a backbone — a feature extractor. But how do we measure whether those features are any good? There are two standard evaluation protocols, and understanding the distinction matters.

Linear probing

Freeze the entire pretrained backbone. Attach a single linear layer on top. Train that linear layer — and only that layer — on a labeled downstream dataset (like ImageNet classification). Report the accuracy.

This is the strict test. It measures how linearly separable the learned features are. If a linear classifier can distinguish cats from dogs using the frozen backbone's features, those features must already contain meaningful semantic information. The backbone did the hard work during self-supervised pretraining; the linear layer is doing minimal computation on top.

Back to our trail camera: we'd freeze the SimCLR backbone (trained on 50,000 unlabeled photos), add one linear layer, and train it on our 47 labeled examples. If linear probing gives 78% accuracy, that tells us the self-supervised features are rich enough that species information is accessible with a trivial classifier.

Fine-tuning

Initialize the backbone with the pretrained weights, but unfreeze everything. Train the whole model end-to-end on the labeled downstream dataset. Report the accuracy.

This is the practical test. It measures how good the self-supervised backbone is as an initialization — a starting point for supervised training. Fine-tuning almost always gives higher accuracy than linear probing, because the backbone can adapt its features to the specific downstream task. The gap between linear probing accuracy and fine-tuning accuracy tells you how much task-specific adaptation the features need.

Most SSL papers report both. Linear probing is the apples-to-apples comparison of feature quality. Fine-tuning is the "what can I actually get in practice" number. When choosing a pretrained backbone for a real project, the fine-tuning number is what matters. When comparing SSL methods against each other, linear probing is the cleaner signal.

When SSL Beats Supervised Pretraining

Not every project needs self-supervised learning. Here's the honest assessment of when it helps and when it doesn't, based on what the empirical literature consistently shows.

SSL pretraining wins when there's a domain shift between the available labeled data and your target domain. ImageNet features are great if your images look like internet photos. They're mediocre for pathology slides, X-rays, satellite imagery, underwater footage, or trail camera photos taken at night with infrared. If your domain is visually different from the web images that supervised models were trained on, SSL on your own unlabeled domain data produces better features.

SSL also wins when labels are scarce. If you have 50 labels and 50,000 unlabeled images, a supervised model trains on 50 examples. An SSL backbone trains on all 50,000, then fine-tunes on the 50 labels. The SSL backbone has seen vastly more of the data distribution. Empirically, this advantage is most pronounced below about 1,000 labels and largely disappears above 10,000.

And SSL wins when you need one backbone for many tasks. If the trail camera project eventually needs a species classifier, a behavior detector, a population counter, and an anomaly detector, training a separate supervised model for each is wasteful. A self-supervised backbone, trained once on all the unlabeled data, provides features that transfer to all four tasks. This is the foundation model philosophy distilled to its essence.

The honest counterpoint: if you have 10,000+ labeled examples and your domain is close to ImageNet, a supervised pretrained backbone (like a torchvision ResNet or a pretrained ViT) fine-tuned on your labels will match or beat SSL in practice, with less engineering effort. The supervised pretrained models have been optimized for decades. They're well-understood, well-documented, and fast to train. Don't use SSL for the novelty — use it because your data situation demands it.

And here's the shortcut most practitioners take: they never train SimCLR or BYOL themselves. They grab DINOv2 or CLIP features off the shelf, freeze the backbone, and train a lightweight head on their labels. That's still leveraging self-supervised learning — someone else's self-supervised learning. Don't reinvent the pretraining wheel unless your domain is genuinely unusual.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a labeling bottleneck — 50,000 trail camera photos and only 47 labels. We saw how pretext tasks like rotation prediction, jigsaw puzzles, and colorization taught models to understand images by repairing damage. We watched contrastive learning replace those hand-designed tasks with a more general principle: pull similar things together, push different things apart. We traced the evolution from SimCLR's big-batch approach through MoCo's momentum queue, to BYOL and DINO's startling demonstration that you can drop negatives entirely. We saw MAE bring masked prediction — BERT's trick — to vision, and CLIP extend contrastive learning across images and text. And we built the vocabulary to evaluate all of it: linear probing for feature quality, fine-tuning for practical performance.

My hope is that the next time you see "self-supervised pretraining" in a model card or paper abstract, instead of mentally filing it as "unsupervised with extra steps," you'll have a pretty darn good mental model of what happened during those pretraining hours — which pairs were pulled together, which were pushed apart, and why the resulting features transfer to tasks the model never saw during training.

Resources and Credits

Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" (2020) — The SimCLR paper. The augmentation ablation alone is worth the read. The most cited paper in SSL for a reason.
He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (2020) — The MoCo paper. Elegantly simple. The queue + momentum encoder idea influenced everything that came after.
Grill et al., "Bootstrap Your Own Latent" (2020) — The BYOL paper that proved negatives aren't necessary. Still not fully explained theoretically, which makes it more interesting, not less.
Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (2021) — The DINO paper. The emergent attention maps are unforgettable.
He et al., "Masked Autoencoders Are Scalable Vision Learners" (2021) — The MAE paper. Beautifully simple architecture, and the 75% masking result is one of those findings that makes you rethink assumptions.
Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (2023) — The closest thing to a universal visual backbone. If you only read one SSL paper for practical value, make it this one.
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (2021) — The CLIP paper. Changed the game for multimodal understanding and zero-shot classification.

← Previous Multimodal Models Next → Meta-Learning and Few-Shot Learning