Multimodal Models

Chapter 16: Advanced Deep Learning Section 3 Flagship Deep Dive

I avoided multimodal AI for longer than I'm proud of. Every time a paper mentioned "vision-language fusion" or "cross-modal alignment," I'd skim the abstract, nod thoughtfully, and move on to something I actually understood. The diagrams always showed two boxes, an arrow between them, and the word "shared space" — as if that explained anything. Finally the discomfort of not knowing what happens inside those boxes grew too great for me. Here is that dive.

Multimodal models are systems that process more than one type of input — text and images, text and audio, or all of these at once. The idea isn't new (humans are multimodal by default), but the practical ability to build these systems exploded starting around 2021 with CLIP, and has since reshaped everything from search engines to chatbots to document processing.

Before we start, a heads-up. We're going to be talking about embedding spaces, contrastive losses, cross-attention, and a handful of model architectures. You don't need to know any of them beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

What we'll build up:

The single-modality ceiling
A shared space for images and text
How contrastive learning actually trains CLIP
Zero-shot classification from a shared space
The fusion problem — early, late, and cross-attention
Teaching a language model to see (Flamingo, LLaVA)
The embarrassingly simple approach that worked
Answering questions about images
Reading documents with spatial layout (LayoutLM)
Binding six modalities through one anchor (ImageBind)
The modality gap — why shared spaces aren't really shared
Unified architectures: GPT-4o and Gemini
Rest stop and off ramp

The Single-Modality Ceiling

Let's start with a concrete problem. Imagine we're building a smart museum guide. A visitor snaps a photo of a painting and wants to know about it. Our system needs to identify the artwork and return a description.

If we only have text, we could search a database of artwork descriptions by keyword. The visitor types "oil painting, woman, blue background." Maybe we get lucky. Probably we don't — the visitor doesn't know the right words, or the painting is abstract, or a hundred other descriptions could match.

If we only have images, we could do reverse image search — match the photo against a database of known paintings. This works better, but the visitor can't ask follow-up questions. "Who painted this?" requires language. "Is there anything similar in the next gallery?" requires understanding both the visual style and the museum's layout.

The ceiling is clear: a system locked into one modality can only answer questions that modality can express. The real world doesn't work that way. A doctor reads an X-ray while reviewing patient notes. A driver watches the road while listening for sirens. A child learns the word "dog" by hearing it while seeing one. The senses don't work in isolation, and neither should our models.

That observation sounds obvious. But building a system that actually combines modalities requires answering a question that turns out to be surprisingly deep: how do you get a number that represents an image and a number that represents a sentence to mean the same thing?

A Shared Space for Images and Text

Here's the core idea, and it's worth sitting with for a moment. Imagine you have a 512-dimensional space — a coordinate system with 512 axes. Now imagine you train two separate neural networks: one takes in images and outputs a point in that 512-D space, and one takes in text and outputs a point in the same space. If you train them so that a photo of a golden retriever and the sentence "a photo of a golden retriever" end up near each other, while a photo of a car and the sentence "a photo of a golden retriever" end up far apart — then you've created something powerful.

You've created a translation layer. Not between English and French, but between seeing and reading.

Think of it like the translation desk at the United Nations. A speech in Mandarin and a speech in Arabic might express the same idea. The interpreters' job is to convert both into a shared representation — some internal understanding — that preserves the meaning. Our two neural networks are doing the same thing. The image encoder is the visual interpreter. The text encoder is the language interpreter. The shared 512-D space is the internal understanding they both map into.

This is the foundation of CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, and it's the Rosetta Stone of modern multimodal AI. Almost everything that followed builds on this idea. Let's see how it actually gets trained.

How Contrastive Learning Actually Trains CLIP

Let's go back to our museum guide. Suppose we have three photos of artworks, each with a caption:

Photo 1: a landscape painting → "Rolling hills under a cloudy sky"
Photo 2: an abstract sculpture → "Twisted metal reaching upward"
Photo 3: a portrait → "Woman with a pearl earring"

We feed all three images through the image encoder to get three vectors. We feed all three captions through the text encoder to get three more vectors. Now we compute the similarity — the cosine similarity, which is a fancy way of saying "how much do these two arrows point in the same direction" — between every image vector and every text vector. That gives us a 3×3 grid.

The diagonal entries are the correct pairs: Photo 1 with Caption 1, Photo 2 with Caption 2, Photo 3 with Caption 3. Everything off the diagonal is a wrong pair. Training pushes the diagonal similarities up and the off-diagonal similarities down. That's it. That's the entire training signal.

The real version of this uses batches of 32,768 pairs, not 3. That number isn't a typo. With contrastive learning, every other sample in the batch serves as a negative example — a wrong pairing that teaches the model what doesn't match. A batch of 3 gives you 2 negatives per sample. A batch of 32,768 gives you 32,767. More negatives means a harder task, which means the model has to learn more nuanced representations to succeed. If you're ever training a contrastive model and it's not working, insufficient batch size is the first thing to check.

There's one more detail that matters: the temperature parameter. The similarity scores get divided by a learned temperature τ before being passed through a softmax. A lower temperature makes the distribution more peaked — the model has to be more confident about which pairs match. A higher temperature makes it more permissive. CLIP learns this temperature during training rather than fixing it, which lets the model calibrate its own confidence.

The loss function is symmetric: we compute cross-entropy from image→text (which image matches which caption?) and from text→image (which caption matches which image?), then average them. Both directions matter because the relationship is bidirectional — understanding goes both ways at the UN translation desk.

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIPContrastiveLoss(nn.Module):
    """The actual training objective behind CLIP.
    This is the mechanism that creates a shared image-text space."""

    def __init__(self, initial_temperature=0.07):
        super().__init__()
        self.log_temp = nn.Parameter(
            torch.log(torch.tensor(1.0 / initial_temperature))
        )

    def forward(self, image_embeddings, text_embeddings):
        # Both encoders output vectors. We normalize them to unit length
        # so the dot product becomes cosine similarity.
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)

        temperature = self.log_temp.exp().clamp(max=100)

        # The 3×3 grid (or 32768×32768 in real training).
        # Each row: one image compared against ALL texts.
        # Each column: one text compared against ALL images.
        similarity = (image_embeddings @ text_embeddings.T) * temperature

        # The correct answer: pair 0 matches pair 0, pair 1 matches pair 1...
        targets = torch.arange(similarity.shape[0], device=similarity.device)

        # Image→text: for each image, which text is the right one?
        loss_i2t = F.cross_entropy(similarity, targets)
        # Text→image: for each text, which image is the right one?
        loss_t2i = F.cross_entropy(similarity.T, targets)

        return (loss_i2t + loss_t2i) / 2

That code block is the engine that makes the entire shared space work. The similarity matrix is the 3×3 grid (or N×N grid). The targets diagonal says "pair i should match pair i." The cross-entropy loss does the pushing — diagonal up, off-diagonal down. And the temperature controls how sharp the model's opinions have to be.

Zero-Shot Classification from a Shared Space

Here's where the payoff of a shared space becomes visceral. After CLIP finishes training, you have two encoders that map images and text into the same coordinates. We never need to fine-tune them again.

Back at our museum. A visitor photographs a painting we've never catalogued. We don't have a classifier trained for this painting. But we can write text descriptions of possible categories — "a landscape painting," "a portrait," "an abstract sculpture," "a still life" — and encode each one. Then we encode the visitor's photo and check which text description lands closest in the shared space.

That's zero-shot classification. No training data for the specific task. No fine-tuning. The model has never seen these categories as classification labels — it learned from natural language captions, and that's enough. CLIP matched or beat models that were fully supervised on ImageNet, on benchmarks it had never trained for. That result shocked the field.

One subtle detail that took me a while to appreciate: prompt engineering matters here too. "dog" as a text input works worse than "a photo of a dog" because CLIP was trained on natural captions, not single words. The original paper tested 80 different prompt templates and found that ensembling them improved accuracy by 3-5%. The lesson: the text encoder learned the distribution of how people describe things, not a dictionary of labels.

But zero-shot classification is where CLIP's limitations also become apparent. "Three cats on a sofa" and "one cat on a sofa" produce nearly identical embeddings. "A cup to the left of a plate" and "a plate to the left of a cup" are hard to distinguish. CLIP doesn't count, and it doesn't really understand spatial relationships. This makes sense when you think about it — the model was trained on whole-image, whole-caption pairs. It learns what's in the image, not where things are or how many there are.

That limitation creates a pull toward something more capable. We can match images and text. But we can't answer questions. We can't say why something is in the image, or generate new text that describes what we see. For that, we need to wire vision into a language model.

The Fusion Problem: How Do Modalities Combine?

Here's the question every multimodal system has to confront: at what point do you combine the information from different senses? The answer to this shapes everything about what the model can do, how expensive it is, and how hard it is to build.

Let's use our museum guide to think about this concretely, with three scenarios.

Scenario 1: Late fusion. The visitor's photo goes through an image model that produces a summary: "19th century oil painting, impressionist style, water lilies." Separately, the visitor's spoken question goes through a speech model that produces: "Who painted this?" At the end, some simple logic matches these two summaries. This is late fusion — each modality gets its own specialist, and they only talk at the very end. It's like two experts writing independent reports and then a manager combining them.

Late fusion is the simplest approach. You can plug in any pre-trained image model and any pre-trained text model. Each one is independently debuggable. CLIP is a late fusion model — two separate encoders, combined only at the final similarity computation. But the limitation is that the text never gets to look at specific image patches, and the image never gets to attend to specific words. The conversation between modalities is limited to one exchange at the very end.

Scenario 2: Early fusion. We convert the photo into a sequence of visual tokens (small patches of the image, each represented as a vector) and convert the question into text tokens. Then we concatenate them into one long sequence and process the whole thing through a single transformer. Every visual token can attend to every text token from the very first layer. This is early fusion — throw everything into one pot from the start.

Early fusion gives the richest possible cross-modal interaction. The model can learn that the word "who" should look at the painter's signature in the corner of the image. But it's expensive — you can't reuse pre-trained unimodal models easily, you need aligned tokenization across modalities, and training from scratch at this level requires enormous compute. Google's Gemini and OpenAI's GPT-4o use early fusion. Most teams cannot.

Scenario 3: Cross-attention fusion. Keep the separate encoders, but at intermediate layers, let the text model peek into the image model's internal representations. Specifically, the text model generates queries (what information do I need?), and the image model provides keys and values (here's what I have). This is cross-attention — the mechanism we saw in the original transformer for translation, now repurposed to let one modality interrogate another.

Cross-attention sits in the sweet spot. You preserve pre-trained models (the image encoder stays frozen, the language model stays frozen) and add learnable cross-attention layers in between. This is how Flamingo works, how BLIP-2 works, and how most production vision-language models are built.

I'll be honest — I still sometimes second-guess which fusion strategy to pick for a new task. The heuristic I've settled on: start with late fusion for prototyping, move to cross-attention when you need fine-grained reasoning ("what color is the object left of the cat?"), and reach for early fusion only if you have Google-scale compute and Google-scale ambitions.

Teaching a Language Model to See: Cross-Attention in Practice

Let's get concrete about how cross-attention works when text reads from an image. This is the mechanism inside Flamingo, BLIP-2, and most production vision-language models, so it's worth understanding at the mechanical level.

Say we have 20 text tokens (the visitor's question: "What art movement does this painting belong to?") and 196 image patches (a 14×14 grid from a Vision Transformer processing the painting). Each text token becomes a query: "I'm the word 'movement' — what visual information is relevant to me?" Each image patch becomes a key (my address) and a value (my content). The text query computes similarity against all 196 image keys, gets a distribution of attention weights, and reads a weighted combination of image values.

The result: each text token gets enriched with visual information from the specific image patches that are relevant to it. The word "movement" might attend strongly to the brushstroke textures. The word "painting" might attend to the frame and canvas. The word "this" might attend broadly to the whole image.

import torch
import torch.nn as nn
import torch.nn.functional as F

class CrossModalAttention(nn.Module):
    """Text tokens query image patches — the core of Flamingo, BLIP-2, etc.
    This is where 'the language model learns to see.'"""

    def __init__(self, text_dim=768, vision_dim=1024, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = text_dim // num_heads

        # Queries come from text. Keys and values come from the image.
        self.q_proj = nn.Linear(text_dim, text_dim)
        self.k_proj = nn.Linear(vision_dim, text_dim)
        self.v_proj = nn.Linear(vision_dim, text_dim)
        self.out_proj = nn.Linear(text_dim, text_dim)

        # The gate: starts at zero, meaning the model initially ignores
        # visual information entirely. During training, it gradually learns
        # how much vision to mix in. This preserves the LLM's existing
        # language abilities while adding sight.
        self.gate = nn.Parameter(torch.zeros(1))
        self.norm = nn.LayerNorm(text_dim)

    def forward(self, text_tokens, image_patches):
        residual = text_tokens
        B, T, _ = text_tokens.shape
        P = image_patches.shape[1]

        Q = self.q_proj(text_tokens).view(
            B, T, self.num_heads, self.head_dim
        ).transpose(1, 2)
        K = self.k_proj(image_patches).view(
            B, P, self.num_heads, self.head_dim
        ).transpose(1, 2)
        V = self.v_proj(image_patches).view(
            B, P, self.num_heads, self.head_dim
        ).transpose(1, 2)

        # Each text token attends to all image patches
        attn_weights = (Q @ K.transpose(-2, -1)) * (self.head_dim ** -0.5)
        attn_weights = F.softmax(attn_weights, dim=-1)

        visual_context = (attn_weights @ V).transpose(1, 2).reshape(B, T, -1)
        visual_context = self.out_proj(visual_context)

        # Gated residual: tanh(gate) starts near 0, grows during training
        return self.norm(residual + torch.tanh(self.gate) * visual_context)

The gate in that code is worth pausing on. It's initialized at zero, which means tanh(0) = 0, which means the visual context gets multiplied by zero and thrown away. The model starts as a pure language model. During training, the gate gradually opens, allowing visual information to flow in. This is DeepMind's trick from Flamingo, and it's elegant because it means you never have to worry about the visual encoder's random initial outputs corrupting a perfectly good language model. The model decides for itself how much to trust its new sense of sight.

Flamingo: The Careful Architecture

DeepMind's Flamingo (2022) was the first model that could handle interleaved sequences of images and text — like showing it two paintings and asking "How are these different?" The architecture has three key pieces.

First, a frozen vision encoder extracts features from each image. The output is a grid of 196+ tokens per image — one for each patch the Vision Transformer divides the image into.

Second, a Perceiver Resampler. This is where it gets interesting. 196 tokens per image is a lot to feed into a language model at every layer. If the visitor shows us three paintings, that's nearly 600 visual tokens competing for attention with the text tokens. The Perceiver Resampler compresses those down. It works like this: start with a small set of learned query tokens (typically 64). These queries cross-attend to all 196 image patches, distilling the image into a compact representation. Think of it as asking 64 targeted questions about the image and getting 64 focused answers, instead of dumping the entire raw visual field into the language model's lap.

Third, the gated cross-attention layers we saw in the code above, inserted at regular intervals into a frozen LLM. The LLM stays frozen — its language abilities are preserved. Only the cross-attention layers and the Perceiver Resampler get trained.

Flamingo proved that you could add vision to a powerful language model without destroying the language model. But its architecture is complex. There's the resampler, the gated cross-attention, the careful freezing strategy. Someone looked at this complexity and had a radical thought.

LLaVA: The Embarrassingly Simple Approach That Worked

LLaVA (Large Language and Vision Assistant) essentially asked: what if we skip the Perceiver Resampler, skip the gated cross-attention, and do the dumbest possible thing? Take the image features from CLIP's Vision Transformer. Pass them through a small MLP — two linear layers with a GELU activation in between — to project them into the LLM's embedding dimension. Then concatenate those projected visual tokens with the text tokens and feed the whole thing into the LLM as if it were all text.

That's it. No special attention mechanisms. No gates. No frozen components in the second training stage. The visual tokens sit alongside the text tokens, and the LLM's existing self-attention treats them all the same.

The training happens in two stages. First, freeze the LLM and train only the MLP projection on image-caption pairs — teaching the projection layer to map visual features into the right neighborhood of the LLM's embedding space. Second, unfreeze everything and fine-tune on visual instruction-following data — conversations where a user asks questions about an image and gets detailed answers.

My favorite thing about LLaVA is that, aside from high-level explanations about why combining simple components should work, no one is completely certain why it works this well. Flamingo's careful design — the resampler, the gating, the freezing — all seemed necessary. LLaVA showed that a two-layer MLP and brute-force concatenation gets you surprisingly far. The authors used GPT-4 to generate 150,000 visual instruction-following examples, and that training data turned out to matter more than the architecture.

The open-source community seized on this. LLaVA variants now rival commercial systems, built by researchers who don't have Flamingo-scale compute budgets. The lesson: when in doubt, try the simple thing first.

Visual Question Answering: Where Fusion Gets Tested

Visual Question Answering (VQA) is the task that most directly tests whether a model truly understands both modalities. Show it an image. Ask a question. Get an answer.

Back to our museum guide. The visitor photographs a painting and asks: "What art movement is this?" A useful answer requires understanding the visual content (brushstroke style, color palette, composition) and language (the question itself, plus knowledge of art movements). Neither modality alone is sufficient.

Early VQA systems (2015-2019) used a pipeline approach: extract image features with a CNN, encode the question with an LSTM, combine them with element-wise multiplication or a small fusion network, and predict an answer from a fixed vocabulary. These worked for simple questions ("What color is the car?") but failed on anything requiring reasoning ("Is there enough food on the table for everyone?").

Modern VQA is handled by the vision-language models we've been building toward. LLaVA, GPT-4V, and Gemini don't have a fixed answer vocabulary — they generate answers as free-form text. This means they can say "This painting shows characteristics of Post-Impressionism, particularly in its use of bold, non-naturalistic color and visible brushstrokes, reminiscent of Van Gogh's later work." That's a qualitative leap from picking "post-impressionism" out of a dropdown.

But VQA also exposes the limits. A model might correctly answer "What sport is being played?" by recognizing the uniform colors, not by understanding the action. It might answer "How many people are in the photo?" wrong because — as we saw — counting is hard for these models. Getting the right answer for the wrong reasons is one of the persistent challenges in evaluating multimodal systems, and it's one reason human evaluation remains the gold standard for anything you'd deploy to real users.

Image Captioning and the Text-to-Image Connection

Image captioning — generating a natural language description of an image — is VQA's cousin. Instead of answering a specific question, the model describes what it sees. The same vision-language architectures handle both tasks: encode the image, feed the visual features into a language model, and let it generate text.

What makes captioning interesting for our story is its relationship to text-to-image generation. Captioning goes from image → text. Text-to-image goes from text → image. They're inverse problems, and systems like Stable Diffusion use the same CLIP text encoder that powers captioning and VQA to condition the image generation process. The text encoder converts a prompt into embedding vectors, and those vectors guide a diffusion model through cross-attention — the same cross-attention mechanism we saw for vision-language models, now used inside a U-Net denoiser. Each denoising step "reads" the prompt to decide how to refine each region of the image.

I won't go deeper into diffusion here (that's its own chapter), but the connection is worth noting: the same shared embedding space that enables understanding also enables generation. The UN translation desk works in both directions.

Rest Stop and an Off Ramp

Congratulations on making it this far. You can stop if you want.

At this point, you have a solid mental model of multimodal AI. You understand that shared embedding spaces (CLIP) let different modalities talk to each other. You know the three fusion strategies — late, early, and cross-attention — and when to reach for each one. You've seen how Flamingo and LLaVA wire vision into language models, and why the simpler approach sometimes wins. You understand VQA and captioning as tests of genuine cross-modal understanding.

That's genuinely useful. If someone asks you in an interview "how do multimodal models work?", you can give a clear, grounded answer.

It doesn't tell the whole story, though. We haven't talked about what happens when modalities aren't images and text — what about audio, documents with spatial layout, or video? We haven't confronted the uncomfortable fact that "shared embedding spaces" aren't as shared as they look. And we haven't traced the path from adapter-based models to the natively multimodal architectures (GPT-4o, Gemini) that are reshaping the field right now.

If the discomfort of not knowing what's underneath is nagging at you, read on.

Audio-Visual Learning: Beyond Text and Images

So far we've focused on the text-image pair because that's where the field matured first. But humans live in a multimodal world that includes sound, touch, and motion. A model that can watch a video of someone speaking and lip-read when the audio is noisy — that's a genuinely useful multimodal capability that requires combining vision and audio.

The approach for audio follows the same pattern we've already seen. Convert the audio signal into a spectrogram (a visual representation of sound frequencies over time), treat it as an image, and encode it with a vision-like model. Or learn audio tokens directly, similar to how BPE tokenizes text. Either way, you end up with a sequence of vectors that can participate in the same fusion strategies — late fusion, cross-attention, or early fusion — as any other modality.

The more interesting development is ImageBind from Meta (2023), which demonstrated something surprising. You can create a shared embedding space across six modalities — images, text, audio, depth, thermal, and IMU (motion sensor data) — by anchoring everything to a single modality: images.

Here's the trick. You don't need paired data for every combination of modalities. You don't need audio-thermal pairs or depth-text pairs. You need image-text pairs (CLIP already has those), image-audio pairs (videos give you those for free), image-depth pairs (RGB-D cameras), and so on. Images serve as the hub modality. Because every other modality is trained to align with images, they end up aligned with each other — even without explicit pairings. Audio-to-text retrieval works, despite the model never seeing audio-text pairs during training.

This emergent alignment genuinely surprised me when I first read about it. It feels like it shouldn't work — aligning A with C, and B with C, doesn't guarantee A and B align well with each other. But in high-dimensional spaces with enough training data, the transitive property holds well enough to be useful. The UN translation desk analogy extends: if every interpreter translates into the same internal representation, then speakers of any two languages can communicate, even if those two interpreters never worked together directly.

Reading Documents with Layout: LayoutLM

There's a type of multimodal understanding that most people overlook, and it turns out to be enormously valuable in production: document understanding. A form, an invoice, a receipt — these are visual documents where the spatial layout carries meaning. The number "1,234.56" means something very different depending on whether it appears next to "Total Due" or next to "Item Count."

Microsoft's LayoutLM family tackled this by extending BERT in a deceptively clever way. The insight: OCR (optical character recognition) gives you not only the text on a page, but also the bounding box coordinates — where each word physically sits. LayoutLM takes each word's text embedding (same as regular BERT) and adds a 2D position embedding encoding the word's spatial location: the x and y coordinates of the top-left corner and the bottom-right corner of each word's bounding box.

Let's walk through what this means concretely. On an invoice, the word "Total" appears at coordinates (100, 500) and the number "$4,259.00" appears at (350, 500) — same y-coordinate, meaning they're on the same line. The word "Date" appears at (100, 50) with "March 15, 2024" at (350, 50) — also same line, but at the top of the page. Without spatial information, a model would see a jumbled bag of words: "Total", "$4,259.00", "Date", "March 15, 2024". With spatial embeddings, the model understands which values belong together — because things on the same line, near each other, are related.

The later versions (LayoutLMv2, LayoutLMv3) added the actual image pixels as a third modality — so the model sees the text, knows where the text sits, and also sees the visual appearance of the document (lines, boxes, logos, signatures). This turned out to be critical for handling things like checkboxes (is it checked or not?) and tables (where cell boundaries matter more than text proximity).

I'm still developing my intuition for why adding the visual pixels on top of text + layout helps as much as it does — you'd think the bounding boxes capture most of the layout information. My best guess is that visual cues like separator lines, bold headers, and table borders provide structural information that bounding boxes alone miss. The pixels tell the model "there's a horizontal line here separating the items from the total" in a way that coordinates can't.

The Modality Gap: Shared Spaces Aren't Really Shared

Here's something that troubled me when I first learned about it, and I think more people should know about. Remember how CLIP creates a "shared" embedding space where images and text live together? If you actually look at where the embeddings land, they don't overlap.

Image embeddings cluster in one region of the space. Text embeddings cluster in a different region. There's a persistent gap between them — a systematic offset. A photo of a dog and the caption "a photo of a dog" are closer to each other than a random image-text pair, which is why CLIP works. But they're not in the same neighborhood. The images form one cone in the high-dimensional space. The text forms another cone. The cones point in roughly similar directions, but they don't intermingle.

This is called the modality gap, and 2024 research has traced it to two sources. First, neural networks have inductive biases that cause embeddings for each modality to collapse into narrow clusters — the "cone effect." Second, the contrastive loss itself preserves and even reinforces this separation. The loss only cares about relative ordering — is the correct pair more similar than the wrong pairs? — not about absolute proximity. So the model can satisfy the training objective while keeping the two modalities in separate neighborhoods.

What does this mean in practice? It means that direct comparison between an image embedding and a text embedding always has a baseline offset. It means that retrieval systems that mix both modalities in the same index need to be aware of this gap — a text query might be systematically closer to all text items than to any image item, regardless of semantic relevance. And it means that methods like ImageBind's emergent cross-modal alignment have a more nuanced story than "everything lives in the same space."

Some researchers argue the gap should be minimized (AlignCLIP uses parameter sharing and specialized losses to close it). Others argue it should be accepted — forcing the modalities to overlap can actually destroy useful semantic structure within each modality. I find the "accept it" camp more convincing, but this is an active debate. The honest answer is that we're still figuring out the geometry of multimodal embedding spaces.

Unified Architectures: GPT-4o and Gemini

Everything we've discussed so far involves some form of stitching: take a pre-trained vision model, take a pre-trained language model, wire them together with projections or cross-attention. The adapter approach. It works, and it's practical, but there's a ceiling on how deeply the modalities can interact.

The alternative is to build a model that is natively multimodal from scratch. Instead of training separate encoders and then figuring out how to connect them, you design a single architecture that processes text, images, audio, and video as different types of tokens in one unified sequence. Every token can attend to every other token from the very first layer. This is early fusion taken to its logical conclusion.

Google's Gemini (2023) was designed this way. Text, images, audio, and video are all tokenized and interleaved in a single sequence. The model applies standard transformer attention across the entire mixed-modality context. This means an audio token can attend to an image patch can attend to a text token — the orchestra conductor is coordinating all instruments simultaneously from the first measure, not waiting until the finale to bring them together.

GPT-4o (the "o" stands for "omni") from OpenAI took this further in 2024, adding real-time audio input and output. It's a single model that can see images, read text, hear speech, and respond with speech — all processed natively within one architecture. The practical implication: no more pipeline of speech-to-text → text processing → text-to-speech. The model processes the audio directly, which means it picks up on tone, emphasis, and emotion that would be lost in transcription.

The conductor analogy comes back here. Adapter-based models are like an orchestra where the string section rehearses separately, the brass section rehearses separately, and they come together for the performance. Unified architectures are like an orchestra that rehearses together from day one — every section learns to listen to and complement the others from the start. The music is more integrated, but it requires more rehearsal time (read: training compute) and a skilled conductor (read: careful architecture design).

For practitioners, the question is whether you need that level of integration. If your task involves tight cross-modal reasoning — understanding a diagram where text labels, arrows, and visual elements all interact — natively multimodal models have a clear edge. If you need to match images to descriptions or answer basic visual questions, an adapter-based approach like LLaVA will get you 90% of the way there at a fraction of the cost.

Multimodal Embeddings in Practice

With all this theory in hand, let's build something. The most immediately useful multimodal capability for most engineers is using CLIP embeddings as infrastructure for search and retrieval.

Return to our museum guide one last time. We have a catalog of 10,000 paintings with descriptions. We want the visitor to snap a photo and find the closest match, or type "impressionist landscapes with water" and get relevant results. We also want to support reverse queries: given a painting, find related descriptions or exhibition notes.

The pattern: pre-compute CLIP embeddings for all images and all text descriptions. Store them in a vector database. At query time, encode the input (whether it's an image or text) and find the nearest neighbors. Because images and text live in the same space (modality gap notwithstanding), a text query can retrieve images and vice versa.

import torch
import numpy as np
from transformers import CLIPModel, CLIPProcessor

class MuseumSearchEngine:
    """A stripped-down multimodal search index.
    In production, you'd replace the numpy search with FAISS or a
    vector database like Pinecone, Weaviate, or Qdrant."""

    def __init__(self, model_name="openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.embeddings = []
        self.metadata = []

    @torch.no_grad()
    def index_image(self, image, info):
        inputs = self.processor(images=image, return_tensors="pt")
        emb = self.model.get_image_features(**inputs)
        emb = emb / emb.norm(dim=-1, keepdim=True)
        self.embeddings.append(emb.squeeze().numpy())
        self.metadata.append(info)

    @torch.no_grad()
    def search_by_text(self, query, top_k=5):
        inputs = self.processor(
            text=[query], return_tensors="pt", padding=True
        )
        query_emb = self.model.get_text_features(**inputs)
        query_emb = query_emb / query_emb.norm(dim=-1, keepdim=True)

        corpus = np.stack(self.embeddings)
        scores = (query_emb.numpy() @ corpus.T).squeeze()
        top_idx = np.argsort(scores)[::-1][:top_k]
        return [(self.metadata[i], float(scores[i])) for i in top_idx]

That's the entire search engine. Fifty lines. The index_image method encodes paintings into vectors. The search_by_text method encodes a text query into the same space and finds the nearest painting vectors. The L2 normalization (dividing by the norm) ensures we're comparing directions, not magnitudes — cosine similarity in disguise.

In production, you'd swap the numpy search for FAISS (for speed), add a vector database (for persistence and scaling), and potentially add a re-ranking step with a more powerful VLM. But the core pattern — embed everything into a shared space, retrieve by nearest neighbor — is the foundation of every multimodal search system running today.

The Challenges That Remain

We've come a long way, but I want to be honest about what's still hard.

The modality gap means cross-modal retrieval always has a systematic bias. Solutions exist but add complexity. Compositional understanding — "a red cube on top of a blue sphere" versus "a blue cube on top of a red sphere" — remains fragile across all current models, whether they're matching, generating, or answering questions. Counting is unreliable. Spatial reasoning is inconsistent.

Evaluation is its own problem. A generated image can be beautiful and completely miss the prompt. A VQA model can get the right answer for the wrong reason. CLIPScore (cosine similarity between CLIP embeddings of a generated image and the prompt) has become the standard automated metric, but it inherits all of CLIP's limitations. Human evaluation remains the gold standard, and it doesn't scale.

In production, messy inputs are the norm. The camera drops a frame. The user uploads an image with no description. The audio cuts out. Robust multimodal systems use modality dropout during training — randomly zeroing out entire modality inputs 20-30% of the time — so the model learns to make predictions from whatever subset is available. Some systems use learned null embeddings: when a modality is missing, substitute a special "no information" token rather than a zero vector, so the model can distinguish "nothing here" from "I see something that encodes to zero."

And the field is moving fast enough that anything I write about specific architectures may be outdated by the time you read it. The trajectory, though, is clear: separate unimodal models are being replaced by unified architectures. The "any-to-any" model — one system that takes any combination of inputs and produces any combination of outputs — is no longer theoretical. The remaining challenges are efficient tokenization for each modality, managing the enormous context lengths that video and audio require, and training at sufficient scale.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a simple frustration: a single-modality system can only answer questions that one sense can express. We built a shared embedding space (CLIP) that lets images and text live in the same coordinates, trained by a contrastive loss that pushes matching pairs together and wrong pairs apart. We explored three fusion strategies — late, cross-attention, and early — and watched the field evolve from CLIP's late fusion through Flamingo's careful cross-attention to LLaVA's audacious simplicity. We saw how VQA and captioning test genuine cross-modal understanding, how LayoutLM extends multimodal thinking to spatial document layout, and how ImageBind anchors six modalities through a single hub. We confronted the modality gap — the uncomfortable reality that "shared spaces" have seams — and traced the arc toward unified architectures like GPT-4o and Gemini that process all modalities natively.

My hope is that the next time you encounter a problem that involves more than one type of data — images and text, audio and vision, documents with layout — instead of treating each modality as a separate pipeline to be stitched together at the end, you'll reach for a shared embedding space, or a cross-attention layer, or a unified model, having a pretty good mental model of what's going on under the hood.

Resources and Credits

Learning from Scratch: The CLIP Paper — Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (2021). The paper that started it all. Remarkably clearly written for a paper of its impact. arxiv.org/abs/2103.00020

Flamingo — Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning" (2022). The architecture that proved you could add vision to an LLM without breaking it. Sections on the Perceiver Resampler are especially insightful. arxiv.org/abs/2204.14198

LLaVA — Liu et al., "Visual Instruction Tuning" (2023). A masterclass in doing more with less. The supplementary materials on data generation are wildly practical. arxiv.org/abs/2304.08485

The Modality Gap — Liang et al., "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning" (NeurIPS 2022). The paper that made everyone realize shared spaces have seams. Unforgettable visualizations. arxiv.org/abs/2203.02053

ImageBind — Girdhar et al., "ImageBind: One Embedding Space To Bind Them All" (2023). The paper that showed you can align six modalities by anchoring through images alone. Elegant and surprising. arxiv.org/abs/2305.05665

LayoutLM — Xu et al., "LayoutLM: Pre-training of Text and Layout for Document Image Understanding" (2020). Not as flashy as the vision-language models, but incredibly valuable in production. The O.G. of document multimodal understanding. arxiv.org/abs/1912.13318