Modern Breakthroughs in Generative AI

Chapter 15: Generative Models 7 subtopics

I avoided doing a proper deep dive into generative AI for longer than I'd like to admit. Every week there was a new model — DALL·E something, Midjourney v-something, another Stable Diffusion release — and I'd scroll past the demo reels thinking "cool, but I know the gist." Except I didn't, not really. I couldn't explain why one model produced photorealistic faces while another excelled at oil paintings. I couldn't tell you how a text prompt becomes a video, or how a flat image becomes a 3D object, or how an AI generates a piano solo that sounds eerily human. The discomfort of not knowing what's happening under all these hoods finally grew too great. Here is that dive.

Generative AI — the family of models that create new images, videos, 3D objects, and audio from learned data distributions — went from a research curiosity to the backbone of creative tools in roughly three years (2022–2025). The field spans text-to-image systems like DALL·E 3 and Stable Diffusion, video generators like Sora and Runway, 3D creation tools, and audio synthesizers. The underlying techniques are all variations on a handful of core ideas, and that's what makes this dive worthwhile.

Before we start, a heads-up. We'll be touching on diffusion models, transformers, neural radiance fields, and audio tokenization — but you don't need to be an expert on any of them beforehand. If you followed the earlier chapters on diffusion and transformers, you have more than enough. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

The Text-to-Image Revolution

Image Editing — Inpainting, Outpainting, and Img2Img

From Still to Motion — Video Generation

Rest Stop

Escaping Flatland — 3D Generation

The Sound of AI — Audio and Music Generation

Multimodal Generation — Putting It All Together

The Creative AI Ecosystem and What It Means

Resources and Credits

The Text-to-Image Revolution

Let's start with a thought experiment. Imagine you're a one-person creative studio. You've been hired to produce a short animated film — characters, environments, soundtrack, the whole thing. Five years ago, you'd need a team of dozens and a budget measured in hundreds of thousands. Today, the tools to do each piece of this exist, and they're powered by generative AI. We'll build our understanding of the entire generative landscape through the lens of this studio — one capability at a time.

The first thing our studio needs is still images. Concept art, backgrounds, character designs. And the text-to-image revolution is where this all began in earnest.

The Three Families

By late 2022, three systems had emerged that could generate photorealistic images from text descriptions: DALL·E (OpenAI), Midjourney (an independent research lab), and Stable Diffusion (Stability AI / open source community). They look similar from the outside — type words, get pictures — but the architectures underneath are meaningfully different, and understanding those differences is what separates "I use AI tools" from "I understand AI tools."

DALL·E's lineage tells the story of the field in miniature. DALL·E 1 (January 2021) used a discrete VAE with an autoregressive transformer — it generated tokens representing image patches, one at a time, conditioned on text. Slow, limited resolution, but genuinely novel. DALL·E 2 (April 2022) switched to a diffusion model guided by CLIP embeddings — a technique called unCLIP. You encode the text with CLIP's text encoder, map that embedding into CLIP's image embedding space, then use a diffusion model to generate an image conditioned on that image embedding. The intermediate CLIP embedding acted as a bridge between language and pixels. DALL·E 3 (September 2023) went further: it put a language model (similar to GPT-4) directly in the loop to rewrite and expand user prompts before feeding them to the diffusion backbone, which by then was a transformer-based architecture. The result was dramatically better prompt adherence — the model actually generates what you asked for, including spatial relationships and quantities that earlier systems botched.

I'll be honest — when I first learned that DALL·E 3's biggest improvement was essentially having a chatbot rewrite your prompt before generating, it felt almost anticlimactic. But then I tried prompting "a blue cube on top of a red sphere, with a green cylinder behind both" in DALL·E 2 versus DALL·E 3, and the difference was stark. The language model doesn't add creativity — it adds comprehension.

Midjourney remains a black box. The architecture is proprietary and undisclosed. What we know: it's diffusion-based, optimized aggressively for aesthetic quality, and clearly uses custom training data curation and loss functions that favor visual appeal over strict prompt fidelity. Midjourney images have a distinctive "look" — slightly dreamy, with cinematic lighting and rich color palettes. That's not an accident; it's a design choice baked into the training pipeline. For our creative studio, Midjourney is like having an art director with strong opinions: the output is beautiful, but getting it to match your exact mental image requires learning its language.

Stable Diffusion is the open-source family, and its evolution tells us the most about the technical trajectory of the field because we can actually see what changed. Stable Diffusion 1.5 (2022) used a U-Net backbone operating in latent space (the "latent diffusion" breakthrough from Rombach et al., 2022), with a CLIP text encoder and a VAE for compressing images into a smaller latent representation. It worked surprisingly well for its size, but the text encoder was limited — CLIP wasn't designed for generation, and complex prompts often produced incoherent results.

SDXL (2023) scaled the U-Net up and added a second, larger text encoder (OpenCLIP ViT-G) alongside the original CLIP model. Bigger model, better text understanding, higher resolution. But the fundamental architecture was still a U-Net. Then came Stable Diffusion 3 (2024), which replaced the U-Net entirely with a Multimodal Diffusion Transformer (MMDiT). This was the pivotal shift. Instead of processing text and image tokens through separate pathways joined by cross-attention, MMDiT feeds both into the same transformer — text tokens and image-latent tokens attend to each other in every layer. It also switched from the traditional DDPM noise schedule to flow matching, a framework where the noise-to-data path is a straight line rather than a curved stochastic trajectory. Straighter paths mean fewer steps to generate an image, and the math is cleaner.

FLUX (by Black Forest Labs, founded by key Stable Diffusion researchers) pushed this further in late 2024. Same core ideas — flow matching, DiT backbone — but with refined architecture and training at larger scale. FLUX.1-dev became the go-to open model for many practitioners.

The pattern across all three families: early systems used patchwork architectures (VAE + CLIP + U-Net + separate text encoders), and the field converged toward unified transformer-based systems that process text and images together. Think of it like the film industry moving from separate departments that barely talk to each other — writing, cinematography, editing — toward a single integrated production pipeline. The unified approach doesn't mean each part is identical; it means information flows freely between all of them, and that makes the final output more coherent.

The Limitation

Text-to-image is remarkable, but for our creative studio, it has a frustrating constraint: you get what the model gives you. If the character's left hand looks wrong, or the background is perfect except for one distracting element, you can't reach in and fix it. You can re-roll the dice (generate again), but you can't edit. That's what we need next.

Image Editing — Inpainting, Outpainting, and Img2Img

Our creative studio has generated a beautiful concept art piece for our film. But the protagonist's eyes look slightly off, and we want to extend the background to the left for a wider composition. We don't want to start over. We want to edit.

It turns out that the denoising process at the heart of diffusion models is naturally suited for editing. The core insight is beautiful: if generation means starting from pure noise and denoising to a clean image, then editing means starting from a mostly clean image — with specific regions replaced by noise — and letting the model fill in only those regions.

Inpainting: Filling What's Missing

Inpainting is the task of filling in masked (erased) regions of an image. You give the model an image, a binary mask indicating which region to regenerate, and optionally a text prompt describing what should go there.

The simplest approach works during inference: at each denoising step, keep the unmasked pixels at their original values and only update the masked region. The model "sees" the surrounding context and generates content that blends naturally with it. It's like tearing a hole in a painting and asking an artist to fill it in — the surrounding brushstrokes guide what goes in the gap.

The more sophisticated approach — used by Stable Diffusion's dedicated inpainting model — trains the denoiser with extra input channels. The standard model takes a 4-channel latent as input (the VAE's latent space). The inpainting model takes 9 channels: 4 for the noisy latent, 4 for the masked image's latent, and 1 for the mask itself. The model learns the concept of "fill this specific region" as part of its training, not as an inference-time hack. The results are noticeably more coherent.

Outpainting: Expanding the Canvas

Outpainting is inpainting's sibling. Instead of filling a hole in the middle, you expand the image beyond its original borders. You place the original image on a larger canvas, mark the new blank areas as "masked," and run inpainting. The model generates scenery, patterns, or whatever makes sense given the existing content at the edges.

For our studio, this is incredibly practical. We generated a portrait-oriented character design but now need a landscape version for a background plate. Outpainting extends the scene without regenerating the parts we already like.

Img2Img: Controlled Transformation

Image-to-image (img2img) is a different beast. Instead of masking specific regions, you take the entire source image, encode it into latent space, add a controlled amount of noise (the strength parameter), and then denoise it — but conditioned on a new text prompt. Low strength (say, 0.3) preserves most of the original structure — colors, composition, rough shapes — and makes subtle changes. High strength (0.8+) basically uses the original image as a rough sketch and regenerates aggressively.

Think of the strength parameter as a dial between "touch up" and "reimagine." At 0.0, you get the original image back unchanged. At 1.0, you get a completely new generation that ignores the input entirely. Somewhere in between, you get creative alchemy: a pencil sketch becomes a photorealistic scene, a daytime photo becomes a moody night shot, a rough 3D render becomes concept art.

I still occasionally get tripped up by the strength parameter — it's more art than science, and the "right" value depends on the specific image, model, and prompt in ways that resist clean formulas. But that's the territory we're in with generative AI: powerful tools with knobs that require intuition to turn well.

ControlNet: Precise Spatial Control

Inpainting, outpainting, and img2img give us broad control. But what if we need something more precise? What if we want the generated image to follow a specific edge map, or match a particular human pose, or respect a depth map from a 3D scene?

ControlNet (Zhang et al., 2023) solved this elegantly. The idea: take a pre-trained diffusion model, clone its encoder, and feed your control signal — a Canny edge map, a depth map, an OpenPose skeleton, a segmentation mask — into the clone. The clone's outputs are injected back into the original model through zero convolution layers, which are convolutional layers initialized with all weights at zero.

That initialization trick is key. At the start of training, the zero convolutions output nothing, so ControlNet has zero effect on the base model. Gradually, during training, it learns to inject exactly the right amount of spatial guidance. The base model's weights stay frozen throughout — you never risk degrading its generation quality.

For our creative studio, ControlNet is transformative. We can pose a character using a stick-figure skeleton and have the model fill in the details. We can use a rough 3D blockout as a depth map and get a fully rendered environment. The filmmaker's visual intent flows directly into the generation process.

The Limitation

We can now generate and edit still images with remarkable control. But our studio is making a film, and a film needs motion. A sequence of independently generated frames won't cut it — each one would look slightly different from the last, creating an uncanny flickering effect. We need models that understand time.

From Still to Motion — Video Generation

The jump from generating a single image to generating a video is like the jump from photography to cinema. It's not about generating 30 images per second — it's about making those images tell a coherent story across time. A ball thrown in frame 1 has to arc correctly through frame 30. A person turning their head must maintain consistent facial features throughout the rotation. Lighting can't randomly shift between frames.

This is temporal coherence, and it's brutally hard. An image model needs to learn the statistics of what natural scenes look like. A video model needs to learn what natural scenes look like, how they change over time, and — to some degree — the physics that governs that change. You're no longer sampling from a distribution of images; you're sampling from a distribution of physically plausible spacetime volumes.

I'm still developing my intuition for why temporal coherence is so much harder than spatial coherence. My best mental model is this: in a single image, nearby pixels are strongly correlated (a blue sky pixel predicts that neighboring pixels are also blue sky). In video, the correlations span both space and time, but the temporal correlations are much longer-range and more complex — a hand that starts reaching for a cup in frame 1 might not grasp it until frame 60, and the model needs to maintain a plausible trajectory for the entire motion. That's a dependency spanning seconds, not pixels.

How Video Diffusion Works

Most video generation models extend image diffusion architectures in one of three ways. The first is temporal attention: take a standard image diffusion model and insert additional attention layers that attend across frames at the same spatial position. You interleave spatial attention (within a frame) with temporal attention (across frames). Each frame stays sharp, and the temporal attention layers learn to keep things consistent across time.

The second approach treats video as a 3D volume. Instead of 2D patches of an image, you extract spacetime patches — little cubes that span width, height, and time. A transformer processes the sequence of these spacetime tokens. This is the approach that Sora (OpenAI, 2024) popularized. A 10-second video becomes a long sequence of 3D patches, and the DiT (Diffusion Transformer) learns the relationships between all of them — spatial, temporal, and cross-frame.

The third approach is a hybrid: generate a few keyframes with image diffusion, then interpolate between them using a separate model. This is more modular — you can use an excellent image model for the keyframes and a lighter model for interpolation — but the interpolation model needs to invent plausible motion between keyframes, which is its own hard problem.

The Major Players

Sora demonstrated that scaling a spacetime DiT produces remarkably coherent videos — up to a minute long — with what appears to be emergent physics understanding. Water flows, reflections behave, objects have weight. The model processes variable-length, variable-resolution videos natively, compressing the raw video with a visual encoder into spacetime patches, applying a massive DiT, and decoding back. OpenAI has been deliberately slow to release it widely, citing safety concerns — which tells you something about the realism it achieves.

Runway Gen-3 (and its successor Gen-4) is perhaps the most production-ready tool. Cinematic quality, excellent camera control, smooth motion. Clips run 10–16 seconds, and Gen-4 introduced "world consistency" — characters, lighting, and objects stay coherent across multiple generated shots. For our creative studio, this is the workhorse: you describe a scene, maybe provide a reference image, and get usable footage.

Kling (by Kuaishou, a Chinese tech company) pushes the duration frontier. It can generate clips up to 1–2 minutes with impressive temporal coherence, maintaining character identity and scene consistency over much longer durations than competitors. It also offers "virtual director" tools — storyboard-based interfaces where you specify camera angles, pacing, and shot transitions.

Stable Video Diffusion and CogVideoX represent the open-source frontier. Quality trails the best closed systems, but they're accessible to researchers and the community continues to improve them rapidly. For our studio, they're the R&D sandbox — less polished, but we can look inside and modify them.

What's Still Hard

The honest truth: video generation in 2025 is roughly where image generation was in mid-2022. The results are often stunning in short clips, but break down in predictable ways. Hands gain or lose fingers across frames. Objects occasionally teleport. Physics deteriorates after about 10 seconds. The compute cost is enormous — generating a one-minute video with Sora-level quality takes orders of magnitude more compute than a single image.

And there's a deep evaluation problem. There's no consensus metric for video quality. Fréchet Video Distance (FVD) captures statistical similarity to real video distributions but misses the temporal artifacts that humans spot instantly — a briefly-appearing sixth finger, a shadow that jumps between frames. We know good video when we see it, but we can't reliably measure it automatically.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model of how generative AI creates and edits visual content: text-to-image systems that evolved from U-Net + CLIP patchworks to unified transformer architectures, image editing techniques (inpainting, outpainting, img2img, ControlNet) that leverage the denoising process for precise control, and video generation that extends these ideas into the temporal dimension with spacetime patches and temporal attention.

That mental model is genuinely useful. You can have informed conversations about these tools, evaluate which is right for a given task, and understand the technical announcements that will keep coming.

But the story doesn't stop at flat screens. Our creative studio needs 3D objects, a soundtrack, and eventually the ability to weave all these modalities together. The techniques for generating 3D content from 2D priors are some of the most clever ideas in the whole field — and audio generation introduces a completely different way of thinking about the problem, one that involves treating sound the way we treat language.

The short version: 3D generation uses 2D diffusion models as "critics" to sculpt 3D shapes. Audio generation tokenizes sound into discrete symbols and predicts the next one, like a language model. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Escaping Flatland — 3D Generation

Our creative studio has a problem. We've generated beautiful concept art and short video clips, but the film needs 3D assets — characters that can be viewed from any angle, environments that a virtual camera can fly through. Traditionally, 3D modeling is one of the most labor-intensive parts of production. A single detailed character might take an artist weeks. Can generative AI help?

The answer involves some of the most ingenious thinking in recent AI research. The core challenge is this: we have diffusion models that understand what objects look like from essentially every angle — because they were trained on billions of images from billions of viewpoints. But they generate flat 2D images. Can we somehow use that 2D understanding to create 3D objects, even though the model has never seen 3D data?

Neural Radiance Fields — The 3D Foundation

Before we can generate 3D content, we need a way to represent 3D content that neural networks can work with. Neural Radiance Fields (NeRF), introduced by Mildenhall et al. in 2020, provided exactly that. A NeRF is a neural network that takes a 3D position (x, y, z) and a viewing direction (two angles) as input, and outputs a color and a density at that point. To render an image from any camera angle, you trace rays through the scene, querying the NeRF at many points along each ray, and composite the results.

Think of it like this: instead of storing a 3D model as a mesh of triangles (the traditional approach), you store it as a function — "given a point in space and a direction you're looking from, here's what color you see." The advantage is that NeRFs can represent incredibly complex scenes — reflections, translucency, fine detail — because the neural network can learn arbitrary functions. The disadvantage is speed: rendering requires hundreds of network evaluations per pixel, per ray. A single frame might take seconds to render. That's far too slow for real-time applications.

3D Gaussian Splatting — The Speed Revolution

In mid-2023, 3D Gaussian Splatting (Kerbl et al.) arrived and upended the field. The approach is fundamentally different from NeRF. Instead of representing a scene as a continuous function, it represents it as a cloud of millions of tiny 3D Gaussians — ellipsoidal blobs, each with a position, size, orientation, color, and opacity. To render a view, you project each Gaussian onto the 2D image plane and blend them together. No ray marching. No per-pixel network evaluations.

The result: real-time rendering at 60+ frames per second on consumer GPUs. Same visual quality as high-end NeRF methods, but hundreds of times faster at rendering time.

I'll admit — when I first read about Gaussian splatting, the idea felt almost too crude to work. Millions of colored blobs? How could that produce photorealistic scenes? But then I saw the demos, and it clicked. The secret is that with enough Gaussians (millions of them), you can approximate any visual appearance. Each Gaussian is like a tiny, soft brushstroke in 3D space. The "painting" emerges from the aggregate. And because splatting (projecting Gaussians onto a 2D plane) is a well-understood, GPU-friendly operation, the rendering pipeline is blazingly fast.

The analogy to our film studio: NeRF is like oil painting — slow, meticulous, beautiful results if you're patient. Gaussian splatting is like a mosaic — thousands of discrete pieces that, from a slight distance, form a cohesive, stunning image, and you can rearrange them in real time.

Score Distillation — The Bridge from 2D to 3D

Now the clever part. DreamFusion (Poole et al., 2022) introduced Score Distillation Sampling (SDS), which lets a 2D diffusion model guide the creation of 3D content without ever training on 3D data.

Here's how it works. You start with a randomly initialized 3D representation — a NeRF, or a set of Gaussians. You render it from a random camera angle to get a 2D image. Then you ask a pre-trained diffusion model: "Does this 2D rendering look like [the text prompt]?" The diffusion model's noise prediction tells you how the rendered image should change to better match the prompt. You backpropagate that signal through the renderer into the 3D representation, nudging it to look better from that particular viewpoint. Then you pick a new random camera angle and repeat.

Over thousands of iterations from hundreds of viewpoints, the 3D representation converges toward something that looks like the text prompt from every angle. The diffusion model is never fine-tuned — it acts as a frozen critic, like an art director who can't paint but has incredibly discerning taste. The 3D representation learns to satisfy the critic from all directions simultaneously.

It's a genuinely brilliant idea. But the early results had a telltale problem: the Janus problem. Since the diffusion model is independently asked "does this look good?" from each viewpoint, it tends to create objects with recognizable features on every side — a dog with a face on both its front and back, for instance. The model finds it easier to stick a face everywhere than to create a consistent 3D structure that looks like a dog's face from the front and a dog's back from behind.

Beyond SDS

Zero-1-to-3 took a different approach entirely. Instead of using SDS optimization, it fine-tuned a diffusion model to generate novel views of an object given a single input image. You show it a photo of a chair from the front, and it generates plausible views from the side, back, and top. Then you reconstruct 3D geometry from these generated multi-view images using standard 3D reconstruction techniques. No iterative optimization loop. No Janus problem (because the model learns view consistency during training).

More recently, direct 3D generation models — trained on actual 3D datasets — are emerging. Point-E and Shap-E (both from OpenAI) generate 3D point clouds and implicit 3D representations directly from text prompts. They're faster than SDS-based approaches and avoid the 2D-to-3D optimization entirely, but they require large 3D training datasets, which are much harder to curate than 2D image datasets.

For our creative studio, the practical choice depends on the workflow. SDS is slow but works with any text prompt and any pre-trained image model. Zero-1-to-3 is fast if you have a reference image. Direct 3D models are fastest but currently limited in the diversity and detail of what they can generate. The field is converging toward large 3D-native models, but the 2D-to-3D distillation trick remains important — both as a practical technique and as a conceptual bridge that showed the AI community what's possible.

The Sound of AI — Audio and Music Generation

Our creative studio now has images, video clips, and 3D assets. What it lacks is a soundtrack — music, sound effects, maybe even narration. And here is where the generative AI landscape takes a surprising turn, because the most successful approach to audio generation doesn't look like diffusion at all. It looks like language modeling.

That surprised me when I first encountered it. Images and video are continuous — smooth gradients of color and motion — and diffusion (which operates on continuous values) is a natural fit. But audio is different. Raw audio is a sequence of samples at 16,000 to 48,000 per second. Training a diffusion model on raw waveforms is possible but extraordinarily expensive — the sequences are enormous.

The Tokenization Insight

The breakthrough came from an unexpected direction: neural audio codecs. Models like Google's SoundStream (2021) and Meta's EnCodec (2022) learned to compress raw audio into discrete tokens — a vocabulary of sound fragments, analogous to how a tokenizer breaks text into subword units. A 10-second audio clip at 24kHz sampling rate contains 240,000 raw samples. After compression through EnCodec, it becomes a sequence of perhaps 750 discrete tokens from a vocabulary of roughly 1,024.

This reframes the entire audio generation problem. Instead of generating 240,000 continuous values (intractable for most architectures), you generate 750 discrete tokens from a fixed vocabulary. And generating sequences of discrete tokens from a vocabulary is exactly what language models do.

My favorite thing about this approach is how it collapses two seemingly different problems — language generation and audio generation — into the same framework. The codec acts as a translator: continuous audio in, discrete tokens out. A transformer generates new token sequences. The codec's decoder converts tokens back into continuous audio. The only part that "knows" about audio is the codec. The generator is a generic sequence model. That's elegant.

AudioLM — Language Modeling for Sound

AudioLM (Google, 2022) was the first model to show this approach could work at high quality. It uses SoundStream's codec tokens and generates them autoregressively — predict the next audio token given all previous tokens, exactly like GPT predicts the next word token. The model operates in a hierarchical fashion: first it generates "semantic tokens" that capture the high-level content (what words are being said, what melody is being played), then it generates "acoustic tokens" that capture the fine details (the timbre of the voice, the texture of the instrument).

The results were startling. Given a few seconds of a person talking, AudioLM could continue the speech in the same voice, with coherent words and natural prosody. Given a piano piece, it could continue the melody in a musically plausible way. No one explicitly taught it grammar or music theory. The statistical patterns in the audio tokens were enough.

MusicGen — Text-Conditioned Music

MusicGen (Meta, 2023) took this framework and added text conditioning. The architecture: encode music with EnCodec into discrete tokens, train a transformer to generate these token sequences conditioned on a text description ("upbeat electronic dance music with a heavy bass line"). The transformer uses a clever delay pattern for the multiple parallel streams of codec tokens — instead of generating all codebook levels simultaneously or one after another, it offsets them by one timestep each, allowing a single transformer to handle multi-stream generation efficiently.

The output quality is good enough that short MusicGen clips regularly fool casual listeners into thinking they're hearing human-composed music. For our creative studio, this is the background music generator — describe the mood, get a score.

Bark and Beyond

Bark (by Suno, open-sourced in 2023) pushed text-to-audio in a more general direction. It generates not only speech in multiple languages but also laughter, sighs, music, and sound effects — all from text prompts, including annotations like [laughs] or [music]. The architecture uses a cascade of three transformer models: text to semantic tokens, semantic tokens to coarse acoustic tokens, coarse to fine acoustic tokens. Each stage adds more detail, like a painter blocking in shapes, then adding color, then adding texture.

The broader audio generation landscape now includes diffusion-based approaches too — Stable Audio (Stability AI) uses latent diffusion for music generation, treating spectrograms the way Stable Diffusion treats images. Udio and Suno (the company behind Bark) offer commercial music generation services that combine multiple techniques. The field hasn't fully converged on a single approach the way image generation converged on diffusion, which makes it an active and somewhat unpredictable area of research.

I'm still developing my intuition for why the codec-plus-language-model approach works as well as it does for audio. My best guess is that the compression performed by the neural codec isn't "lossy" in the naive sense — it doesn't discard information so much as reorganize it into a form that makes sequence-level patterns easier to learn. The codec doesn't throw away the emotion in a voice or the groove in a beat; it distills them into a sequence that a transformer can model.

Multimodal Generation — Putting It All Together

Our creative studio now has access to tools that generate images, edit them precisely, produce video, create 3D assets, and compose audio. Each of these started as a separate research thread. The frontier of generative AI is weaving them together.

Multimodal generation means models that can understand and produce content across multiple modalities — text, images, audio, video, 3D — within a single system. The idea is that a model trained jointly on all these modalities develops richer internal representations than one trained on any single modality. A model that has seen millions of videos with audio develops an understanding of the relationship between visual events and their sounds — a ball bouncing and the "thud" that accompanies it — that a vision-only model never could.

The architectural trend is converging toward a pattern: tokenize everything into a shared vocabulary. Text is already tokens. Images become patches (visual tokens). Audio becomes codec tokens. Video becomes spacetime patch tokens. 3D objects become multi-view image tokens. Feed them all into a single large transformer. The model learns cross-modal relationships from the data — which sounds go with which images, which motions go with which camera angles, which 3D structures produce which 2D appearances.

This unification isn't complete yet. As of 2025, most production systems are still single-modality specialists. But the research prototypes are increasingly multimodal, and the trajectory is clear. For our creative studio, the endgame is a single model that takes a script as input and produces a film — visuals, motion, sound, and dialogue — as output. We're not there yet. We're probably not close. But the pieces are all on the table.

The Creative AI Ecosystem and What It Means

We've spent this entire journey building technical understanding. Now we need to look up from the engineering and reckon with what we've built.

Our creative studio — the one-person operation producing a short film with AI tools — is no longer a thought experiment. People are doing this right now. The tools exist, they work, and they're getting better at a rate that makes even practitioners uncomfortable. That discomfort deserves examination.

The Copyright Question

Every model we've discussed was trained on data created by humans. Stable Diffusion was trained on billions of images scraped from the internet — photographs, illustrations, paintings, designs. MusicGen was trained on licensed music. The legal question of whether training on copyrighted material constitutes fair use is being litigated in courts worldwide as of 2025, with no clear resolution.

The practical reality is thornier still. These models can generate images "in the style of" specific living artists. They can generate music that sounds like specific genres or performers. They can mimic voices with a few seconds of sample audio. Whether this constitutes copying, transformative use, or something entirely new that our legal frameworks weren't designed for — I genuinely don't know, and I'm skeptical of anyone who claims to.

Deepfakes and Trust

Video generation that can produce realistic footage of events that never happened poses obvious risks. The detection-versus-generation arms race is already underway. The EU AI Act (2024) requires labeling of AI-generated content. The C2PA standard for content provenance is being adopted by major platforms. Whether these measures will be sufficient is an open question — the history of digital watermarking does not inspire unbounded optimism.

Creative Disruption

The film industry is already being reshaped. The 2023 Hollywood writers' and actors' strikes were partly about AI's role in creative production. Concept artists, illustrators, voice actors, and musicians are all watching these developments with a mixture of anxiety and adaptation. Some are integrating AI tools into their workflows and finding that it amplifies their productivity. Others are watching their freelance markets contract.

I don't have a neat conclusion for this section, because the situation doesn't have one. What I can say is this: understanding the technology — how these models actually work, what they can and can't do, where their outputs come from — is the prerequisite for having informed opinions about any of these questions. And that's what we've been building.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with the text-to-image revolution — how DALL·E, Midjourney, and Stable Diffusion evolved from patchwork architectures to unified transformer-based systems. We learned how to edit images using inpainting, outpainting, img2img, and ControlNet. We watched the jump from still images to temporally coherent video, powered by spacetime patches and temporal attention. We escaped flatland into 3D generation — NeRF, Gaussian splatting, the brilliant SDS trick that uses 2D diffusion models as 3D critics. We discovered that audio generation looks like language modeling thanks to neural audio codecs. And we surveyed the emerging multimodal landscape where all these threads are being woven together.

My hope is that the next time you see a headline about a new AI model generating stunning images or music or video, instead of scrolling past thinking "cool, but I know the gist," you'll pause and ask the specific questions — what's the backbone architecture? How does it handle conditioning? What are the failure modes? — having a pretty darn good mental model of what's going on under the hood.

Resources and Credits

The following resources shaped my understanding and are worth your time: