Self-Supervised & Representation Learning

Chapter 6: Unsupervised Learning How modern AI learns without labels

The Uncomfortable Truth About Labels

I avoided self-supervised learning for longer than I'd like to admit. Every time someone mentioned contrastive losses or momentum encoders, I'd nod along and change the subject. I'd been comfortable in the supervised world — hand me labeled data, train a model, check the metrics. But then I started noticing that every major breakthrough in the past five years — BERT, GPT, DINO, the foundation models everyone keeps talking about — all started with self-supervised pretraining. The discomfort of not understanding what was happening under the hood finally grew too great. Here is that dive.

Self-supervised learning is a family of techniques where the model generates its own training signal from raw, unlabeled data. Instead of asking a human "is this a cat?", you hide part of the data and ask the model to reconstruct it. The idea has roots going back to autoencoders in the 1980s, but it became the dominant paradigm in AI starting around 2018 with BERT, and has since spread from language to vision, audio, and beyond.

Before we start, a heads-up. We're going to cover contrastive losses, momentum encoders, cross-correlation matrices, and a whole zoo of acronyms. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The Map We'll Follow

Why labels are the bottleneck
What a representation actually is
Pretext tasks — making data supervise itself
The similarity game — contrastive learning from scratch
SimCLR — elegant but expensive
MoCo — the queue that changed everything
Rest stop
Dropping negatives — BYOL and the collapse problem
Barlow Twins — redundancy reduction
DINO — when a model learns to see without being told what to look at
MAE — the BERT of images
Using representations — linear probes and fine-tuning
The full picture

Why Labels Are the Bottleneck

Imagine you're building a photo organizing app. You want it to group similar photos together — all the beach sunsets here, all the birthday parties there, all the dogs over in that corner. The supervised approach says: hire people to tag a million photos, then train a classifier. ImageNet went that route. It took years, cost millions of dollars, and required tens of thousands of annotators to label 14 million images across a thousand categories.

Medical imaging is worse. You need board-certified radiologists who charge hundreds of dollars per hour to label chest X-rays. Legal documents need lawyers. Satellite imagery needs geospatial experts. Meanwhile, the internet generates roughly 2.5 quintillion bytes of unlabeled data every single day. There's a staggering mismatch between how much data exists and how much of it has been labeled by a human.

Self-supervised learning grew out of a practical frustration: we're drowning in data but starving for labels. The insight was this — what if the data could label itself?

What a Representation Actually Is

Before we go further, we need to be precise about one word that gets thrown around a lot: representation. When a neural network processes an image or a sentence, each layer transforms the raw input into a new set of numbers — a vector. That vector is a representation. It's a compressed summary of the input, as seen through the lens of whatever the network has learned to care about.

Think of it like a cartographer drawing a map. The physical terrain — mountains, rivers, roads — is the raw data. The map is the representation. A hiking map emphasizes trails and elevation. A road map emphasizes highways and cities. Same terrain, different maps, because different tasks need different features highlighted. A good representation is a map that happens to be useful for many tasks at once — one where the important features are easy to read off and the irrelevant details have been stripped away.

Our photo organizing app, for instance, doesn't need a representation that preserves every pixel. It needs one where photos of the same kind of scene end up with similar vectors, and photos of different scenes end up far apart. The entire field of self-supervised learning is ultimately about building the best possible map without anyone telling the cartographer what the important landmarks are.

We'll come back to this cartography analogy. It keeps paying off.

Pretext Tasks — Making Data Supervise Itself

Here's the trick. You take unlabeled data, deliberately damage it in some known way, and then train a model to undo the damage. The model doesn't care about your vandalism. What matters is that in order to fix the damage, it has to understand the structure of the data — and that understanding is exactly the representation we want.

A pretext task is the name for this kind of artificial puzzle. Let's walk through a tiny example to see why it works.

Take three photos from our organizing app: a dog on a beach, a cat on a couch, and a bird in a tree. Now rotate each photo by a random amount — 0°, 90°, 180°, or 270°. Show the model the rotated photo and ask: how much was this rotated? That's a pretext task called rotation prediction. To get the answer right, the model has to understand what "upright" means for dogs, couches, trees, and beaches. It has to learn about gravity, object orientation, and scene structure. None of that was in the labels — it emerged from the task.

The same pattern appears everywhere. In NLP, BERT hides 15% of the words in a sentence and asks the model to fill in the blanks — that's masked language modeling. To predict that "The cat sat on the ___" should end with "mat," the model has to learn grammar, semantics, even some world knowledge. GPT does something even more direct: given all the words so far, predict the next one. That's next token prediction, and it turned out to be so powerful that it bootstrapped the entire large language model revolution.

In vision, you can shuffle the patches of an image like a jigsaw puzzle and ask the model to figure out the original arrangement. You can convert a color photo to grayscale and ask the model to predict the original colors. Each of these tasks forces the model to internalize something meaningful about the data.

The pretext task is the excuse. The representation — the map the model builds internally — is the actual prize.

There's a catch, though. These tasks are hand-designed, and the features that are optimal for "guess the rotation angle" don't always transfer well to "detect tumors in a chest X-ray." The rotation-predicting model might learn a lot about which way gravity points but not much about texture differences between healthy and cancerous tissue. This limitation is what pushed the field toward a more general approach.

The Similarity Game — Contrastive Learning from Scratch

Instead of designing a clever puzzle, what if we asked the model to answer an even more fundamental question: which things are similar, and which things are different?

Let's make this concrete with our photo app. Take the dog-on-beach photo. Create two augmented versions of it — one randomly cropped to show mostly the dog, the other color-jittered to make the beach look sunset-orange. These two versions still show the same content, the same scene. They form what's called a positive pair. Now take the cat photo and the bird photo. These are different content entirely — they form negative pairs with our dog views.

The idea behind contrastive learning is to train the model so that the representations of the two dog views end up close together in the vector space, while the representations of the dog and the cat (and the dog and the bird) end up far apart. If the model can do this reliably across thousands of images, the representations it builds will capture what makes each image unique — and that's a map useful for almost any downstream task.

The loss function that drives this is called InfoNCE, introduced by van den Oord and colleagues in 2018. It works like a softmax over cosine similarities. For each positive pair, we want their similarity to be high relative to all the negative pairs in the batch. The math looks dense, but the intuition is clean: maximize the similarity score of the positive pair while minimizing the scores of all the negatives.

import torch
import torch.nn.functional as F

def info_nce_loss(z_i, z_j, temperature=0.07):
    batch_size = z_i.shape[0]
    z = torch.cat([z_i, z_j], dim=0)
    z = F.normalize(z, dim=1)

    # cosine similarity between all pairs, scaled by temperature
    sim = z @ z.T / temperature

    # don't let an image match with itself
    mask = torch.eye(2 * batch_size, dtype=torch.bool)
    sim.masked_fill_(mask, -1e9)

    # each image's positive is its augmented partner
    labels = torch.cat([torch.arange(batch_size, 2 * batch_size),
                        torch.arange(0, batch_size)])

    return F.cross_entropy(sim, labels)

That temperature parameter is worth pausing on. It controls how sharply the model distinguishes between positives and negatives. A low temperature (like 0.07) produces a very peaked distribution — the model puts almost all its probability mass on the positive pair and almost none on the negatives. This forces it to work harder on the difficult cases where a negative happens to look similar to the positive. A higher temperature softens the distribution, making the model more forgiving. Most contrastive methods use a temperature between 0.05 and 0.2.

Wang and Isola formalized why contrastive learning works in a 2020 paper. They showed that a good contrastive loss does two things simultaneously: alignment (pulling positive pairs together) and uniformity (spreading all representations evenly across the surface of a hypersphere). Alignment alone would collapse everything to a single point. Uniformity alone would give you random noise. The tension between the two is what produces useful structure.

That's the core idea. But making it work in practice required solving some engineering problems that turned into research problems. Two groups solved them in very different ways.

SimCLR — Elegant but Expensive

SimCLR, published by Ting Chen and colleagues at Google in 2020, stripped contrastive learning down to its bare essentials: augment the image, encode it, project it, contrast it. That's the entire method. No memory banks, no special architecture, no tricks. The name stands for "Simple Framework for Contrastive Learning of Visual Representations," and for once, the name is honest.

But SimCLR's elegance came with three hard-won insights that weren't obvious until they ran the experiments.

The first was that augmentation quality matters more than architecture. Random cropping combined with aggressive color jittering turned out to be the single most important design choice — more important than the choice of encoder, the depth of the network, or the dimensionality of the embeddings. The reason is intuitive once you think about it: if the two views of the same image look too similar, the task is too easy and the model learns nothing. It needs views that are genuinely different in appearance but still contain the same semantic content.

The second insight was the projection head. SimCLR adds a small two-layer MLP on top of the encoder. The contrastive loss is computed on the output of this projection head, but when it's time to use the features for downstream tasks, you throw the projection head away and use the encoder's output directly. I'll be honest — when I first read that, I didn't believe it could matter that much. But it turned out that the projection head absorbs information about the augmentations (it "learns" what color jitter is), freeing the encoder to focus on semantic content. The difference in downstream accuracy is substantial — sometimes 10 percentage points.

The third insight was also SimCLR's biggest limitation: you need a very large batch size. In contrastive learning, every other image in the batch is a negative. More negatives means a harder, more informative task. SimCLR's best results used batches of 4,096 images (8,192 augmented views), which required 128 TPUs to train. That's not a lab experiment — that's a Google-sized budget.

This was the pain point. SimCLR proved that contrastive learning works beautifully in principle, but made it impractical for anyone who didn't have a warehouse full of TPUs. And that pain motivated the next breakthrough.

MoCo — The Queue That Changed Everything

MoCo, short for Momentum Contrast, was published by Kaiming He and colleagues at Facebook AI around the same time as SimCLR. It asked a pointed question: why should the number of negatives be shackled to the batch size?

The answer was a queue. MoCo maintains a rolling buffer of recent embeddings — 65,536 of them in the original paper. Each training step, the new batch's embeddings get pushed onto the queue, and the oldest ones get popped off. For the contrastive loss, each image is compared not against the other images in its batch, but against every entry in this enormous queue. Suddenly you have tens of thousands of negatives without needing to fit them all in GPU memory at once.

But there's a subtlety that took me a while to appreciate. If the encoder changes rapidly during training (as it does with gradient descent), the embeddings in the queue come from different versions of the model. The oldest entries might have been computed by a model that looked very different from the current one. Comparing against stale, inconsistent embeddings degrades the quality of the learning signal.

MoCo's solution was the momentum encoder. Instead of using the same encoder for both the query (current image) and the keys (queue entries), MoCo uses two encoders. The query encoder is updated normally by gradient descent. The key encoder is updated slowly, as an exponential moving average of the query encoder.

# the key encoder drifts slowly, keeping queue entries consistent
momentum = 0.999
for param_q, param_k in zip(query_encoder.parameters(),
                             key_encoder.parameters()):
    param_k.data = momentum * param_k.data + (1 - momentum) * param_q.data

That momentum of 0.999 means the key encoder changes very slowly — each update is 99.9% the old weights and only 0.1% the new ones. This ensures that the embeddings sitting in the queue, even the ones computed many steps ago, are all from approximately the same model. Consistency preserved.

MoCo was the first self-supervised method to match the performance of supervised pretraining on ImageNet — and it did it on a single machine with 8 GPUs. No TPU warehouse required. MoCo v2 later borrowed SimCLR's augmentation strategy and projection head, combining the best ideas from both camps. The field was converging.

Going back to our cartography analogy: SimCLR said "to draw a good map, compare every landmark against every other landmark in a huge survey." MoCo said "keep a reference library of past surveys, and compare new landmarks against the library." Both produce good maps. MoCo's approach is much cheaper.

Rest Stop

If you've made it this far, congratulations. You can stop here if you want.

You now have a mental model of the core self-supervised learning pipeline: take unlabeled data, create an artificial task (predicting masks, contrasting augmented views), train a model on that task, and harvest the learned representations for downstream use. You understand that contrastive learning works by pulling similar things together and pushing different things apart, and you know two major implementations — SimCLR (simple but expensive) and MoCo (efficient via a momentum-updated queue).

That's enough to follow most conversations about self-supervised learning and to understand why pretrained models work the way they do. You're about 60% of the way through the full picture.

What this mental model doesn't yet explain is why some of the best methods don't use negative pairs at all, how a model can learn to segment objects without ever being shown a segmentation label, and why masking 75% of an image produces better representations than masking 25%. Those are the questions that drove the next wave of research — and they're genuinely surprising.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Dropping Negatives — BYOL and the Collapse Problem

Everything we've built so far relies on negative pairs. We pull positive pairs together, and we need negatives to push against — otherwise, why wouldn't the model collapse everything to a single point? If the loss only rewards similarity, the trivial solution is to make everything identical. Zero loss, nothing learned. This failure mode is called representation collapse, and it haunted the early self-supervised learning literature.

So when BYOL — Bootstrap Your Own Latent — was published by DeepMind in 2020, and the authors claimed it worked without any negative pairs and outperformed both SimCLR and MoCo, the reaction from the community was something between fascination and suspicion.

BYOL uses two networks. The online network has an encoder, a projector, and a predictor. The target network has the same encoder and projector, but no predictor. The target network's weights are updated slowly via exponential moving average of the online network — exactly like MoCo's momentum encoder. During training, both networks see different augmented views of the same image. The online network's job is to predict what the target network will output for the other view.

view_a, view_b = augment(image), augment(image)

# online network has the extra predictor head
prediction_a = online_net(view_a)
prediction_b = online_net(view_b)

# target network: no predictor, no gradients
with torch.no_grad():
    target_a = target_net(view_a)
    target_b = target_net(view_b)

# predict each view from the other
loss = cosine_distance(prediction_a, target_b) + cosine_distance(prediction_b, target_a)
loss.backward()

# target drifts slowly behind the online network
ema_update(online_net, target_net, momentum=0.996)

Why doesn't this collapse? I'll be honest — I'm still developing my full intuition for it, and I'm not alone. The research community has debated this extensively. But the consensus points to three things working together.

First, the predictor head in the online network breaks the symmetry. It means the two networks aren't doing the same computation, so the trivial solution where both output constants isn't actually a stable minimum of the loss landscape — it's more like balancing a ball on top of a hill. Technically reachable, practically unreachable.

Second, the stop-gradient on the target network means gradients only flow through the online network. The target can't "cooperate" with the online network to find a shortcut. It's a moving reference point that the online network chases, but it moves slowly enough to provide a stable signal.

Third, the EMA update creates a kind of temporal ensembling. The target network is an average over many recent snapshots of the online network, which smooths out noise and provides a richer signal than any single snapshot would.

Think of it this way, using our cartography analogy. The online network is an apprentice cartographer, and the target network is the mentor. The apprentice draws a map, shows it to the mentor, and the mentor says "that's close to what I see, but here's where you're off." The mentor's own map changes slowly over time as they learn from the same terrain. The apprentice can't cheat by drawing a blank map because the mentor's map isn't blank — and the mentor can't be fooled because they have their own independent view.

Chen and He later published SimSiam (2021), which stripped BYOL even further — removing the momentum encoder entirely and relying solely on the stop-gradient and predictor asymmetry. It still worked. That was a strong signal that the architectural asymmetry, not the momentum, was doing the heavy lifting against collapse.

Barlow Twins — Redundancy Reduction

While BYOL solved the collapse problem through architectural tricks, Barlow Twins (Zbontar et al., 2021) took an entirely different road. It was inspired by a 60-year-old idea from neuroscience.

In 1961, neuroscientist Horace Barlow proposed the redundancy reduction hypothesis: that the goal of sensory processing in the brain is to produce a neural code where different neurons carry different, non-redundant information about the world. Your visual cortex doesn't have a thousand neurons all encoding "is it bright?" — that would be wasteful. Instead, each neuron captures a distinct aspect of the scene.

Barlow Twins translates this directly into a loss function. Take two augmented views of the same image, encode them both, and compute the cross-correlation matrix between the two sets of embeddings across the batch. This matrix has shape D × D, where D is the embedding dimension. Each entry tells you how correlated one feature in view A is with one feature in view B.

The objective is to push this matrix toward the identity matrix. Diagonal entries should be 1 — meaning each feature captures the same invariant content regardless of which augmentation was applied. Off-diagonal entries should be 0 — meaning different features capture different, non-redundant information.

def barlow_twins_loss(z_a, z_b, trade_off=0.005):
    # normalize each feature across the batch
    z_a = (z_a - z_a.mean(dim=0)) / z_a.std(dim=0)
    z_b = (z_b - z_b.mean(dim=0)) / z_b.std(dim=0)

    # cross-correlation matrix (D × D)
    correlation = (z_a.T @ z_b) / z_a.shape[0]

    # diagonal: each feature should agree across views
    invariance = ((1 - correlation.diagonal()) ** 2).sum()

    # off-diagonal: different features should be independent
    off_diag_mask = ~torch.eye(correlation.shape[0], dtype=bool)
    redundancy = (correlation[off_diag_mask] ** 2).sum()

    return invariance + trade_off * redundancy

No negatives. No momentum encoder. No asymmetric architecture. No stop-gradient. The collapse prevention is explicit and built into the loss itself — if features become redundant (correlated), the off-diagonal penalty catches it. If features become invariant to the actual content (all zeros), the diagonal penalty catches it.

VICReg (Bardes et al., 2022) later formalized this even further with three explicit terms: Variance (prevent collapse by ensuring each feature has sufficient spread), Invariance (pull positive pairs together), and Covariance (decorrelate features). Same family of ideas, made even more transparent.

I find Barlow Twins satisfying in a way the other methods aren't. There's no mystery about why it doesn't collapse — the loss function explicitly penalizes the two ways collapse could happen. It's the kind of method where reading the paper once is enough to fully understand it, which is rare in this field.

DINO — When a Model Learns to See Without Being Told What to Look At

DINO — Self-Distillation with No Labels — was published by Caron and colleagues at Facebook AI in 2021, and it produced what I consider the most visually striking result in self-supervised learning.

The architecture is conceptually close to BYOL: a student network and a teacher network (the teacher is an EMA of the student), both processing different augmented views. DINO adds a twist — it uses a Vision Transformer (ViT) as the backbone instead of a CNN, and it uses a centering and sharpening operation on the teacher's output to prevent collapse (an alternative to the predictor head used in BYOL).

But what made DINO famous wasn't the architecture. It was what emerged from the trained model. When you visualize the self-attention maps of a DINO-trained ViT, you see something remarkable: the attention heads learn to segment objects from their backgrounds. A dog stands out sharply from the grass. A car separates cleanly from the street. No segmentation labels were ever provided during training. The model discovered object boundaries purely from the statistical structure of images and the self-supervised objective.

My favorite thing about this result is that, aside from high-level explanations about how contrastive-style objectives encourage the model to focus on "the thing that's consistent across augmentations," no one is completely certain why it works so well. The emergent segmentation wasn't designed into the method. It appeared.

DINOv2 (2023, Meta AI) scaled this up dramatically — larger models, more diverse training data, better augmentation — and produced what many consider the best general-purpose visual feature extractor available today. DINOv2 features transfer well to classification, detection, segmentation, and depth estimation, often matching or beating supervised models trained specifically for those tasks. The cartographer has drawn a genuinely universal map.

MAE — The BERT of Images

While contrastive and non-contrastive methods were battling over how to use pairs of augmented views, Kaiming He (who also developed MoCo) took a step back and asked: what if we borrowed BERT's approach directly? In language, masking words and predicting them produced extraordinary representations. Could the same idea work for images?

MAE — Masked Autoencoders Are Scalable Vision Learners — was published in 2022, and the answer was a resounding yes, but with a twist. Where BERT masks 15% of tokens, MAE masks 75% of image patches. Three quarters of the image is hidden, and the model has to reconstruct the missing patches from the visible 25%.

The architecture has two parts. A Vision Transformer encoder processes only the visible patches — the masked patches aren't fed to it at all, which makes encoding very efficient (you're processing a quarter of the usual input). Then a lightweight decoder takes the encoded patches plus learnable mask tokens for the missing positions and reconstructs the full image. The loss is mean squared error on the pixel values of the masked patches only.

Why 75%? I'll be honest — the exact optimal masking ratio is still somewhat empirical. But the intuition is that images have much more spatial redundancy than text. If you mask 15% of an image, a model can fill in the gaps through local interpolation without understanding the global structure. Hide a small patch of sky, and you can guess it's blue from the neighboring patches. But mask 75%? Now the model needs to understand what a dog looks like, what a beach looks like, how they relate spatially — it has to build an actual understanding of the scene to hallucinate three quarters of the image from a few scattered fragments.

MAE is remarkably efficient to train — processing only 25% of patches means the encoder runs four times faster than it would on the full image. It also produces excellent representations that transfer well to downstream tasks, though its strength relative to DINO-style methods varies by task and evaluation protocol.

Using Representations — Linear Probes and Fine-Tuning

All the methods we've covered share the same endgame: you get a pretrained encoder that turns raw inputs into useful feature vectors. The question is how to use it. There are two main approaches, and the choice between them reveals something important about the quality of the representations.

A linear probe freezes the encoder entirely — no gradient updates — and trains only a single linear layer on top. If this works well, it means the representations are already organized in a way that makes classes linearly separable. Think of our cartography analogy: the map is so well drawn that you can separate neighborhoods with straight lines. Linear probes are the gold standard for evaluating representation quality in research papers.

Fine-tuning unfreezes the encoder and trains everything end-to-end with a small learning rate. This allows the representations to adapt to the specific task. It almost always gives better downstream performance, but it's more expensive and requires more labeled data. The gap between linear probe accuracy and fine-tuning accuracy tells you how much the representation still needs to be reshaped for the task at hand. A small gap means the self-supervised method learned something close to universally useful.

Yoshua Bengio and colleagues formalized the properties of good representations in a 2013 survey paper. The key properties are: invariance to things that don't matter (lighting changes, camera angle, background clutter), sensitivity to things that do (object identity, scene type, semantic content), and disentanglement — different features capturing different independent factors of variation. Self-supervised methods learn these properties implicitly. Contrastive learning enforces invariance through augmentation. Masking enforces sensitivity to global structure. The cross-correlation objective in Barlow Twins explicitly encourages disentanglement.

The Full Picture

If you're still with me, thank you. I hope it was worth it.

We started with a simple problem: labels are expensive, and we wanted to learn useful representations from raw data. We saw how pretext tasks — hiding words, rotating images — were the first approach, clever but limited by hand design. Contrastive learning generalized the idea: learn by knowing what's similar and what's different, driven by the InfoNCE loss. SimCLR made it work elegantly but demanded enormous compute. MoCo decoupled the number of negatives from the batch size with a queue and a momentum encoder, making contrastive learning practical. Then BYOL shocked the field by dropping negatives entirely, relying on architectural asymmetry to prevent collapse. Barlow Twins simplified further, turning collapse prevention into an explicit redundancy-reduction objective. DINO showed that self-supervised ViTs develop emergent segmentation abilities that nobody designed in. And MAE proved that masking 75% of an image and reconstructing it — the simplest possible self-supervised task — produces representations that rival the best contrastive methods.

The progression from SimCLR to MAE looks inevitable in hindsight, but I can assure you it didn't feel that way in real time. Each step was a genuine surprise. That's worth remembering the next time a new method comes along that seems like it shouldn't work.

My hope is that the next time you see someone mention self-supervised pretraining, or representation learning, or contrastive loss, or any of these acronyms — instead of nodding along and changing the subject like I used to — you'll have a pretty solid mental model of what's happening under the hood, why it works, and what the tradeoffs are.

The self-supervised progression in one breath: Pretext tasks (hand-designed puzzles) → Contrastive learning via SimCLR (needs massive batches for negatives) → MoCo (queue + momentum decouples negatives from batch size) → BYOL (no negatives needed, asymmetry prevents collapse) → Barlow Twins (explicit redundancy reduction, no tricks) → DINO (self-distillation, emergent segmentation in ViTs) → MAE (mask and reconstruct, the BERT of images). Each method solved a concrete limitation of its predecessor.

Resources

A few things that helped me build the understanding I tried to share above:

What You Should Now Be Able To Do