Emerging Patterns — The Frontier Is Shifting Under Our Feet

Chapter 12: Large Language Models 10 subtopics

I avoided writing about "emerging patterns" for a long time. The field moves so fast that anything I write today risks being obsolete next Tuesday. Every week a new model drops, a new technique goes viral on Twitter, and the benchmarks reshuffle. I kept telling myself I’d wait until things settled. They didn’t settle. So here I am, writing about a moving target. Here is that dive.

The patterns covered here are the tectonic shifts reshaping how we build, train, and deploy language models. These aren’t speculative research directions—they’re showing up in production systems right now. Multimodal models that see and hear. Tiny models that punch above their weight. Techniques for stitching separate models together like Frankenstein, except the result is better than any individual piece. The open-source ecosystem accelerating at a pace that makes closed-model providers nervous.

Before we start, a heads-up. We’ll touch on model architectures, training techniques, and some light math, but you don’t need specialized knowledge of any of it. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Multimodal Models — Teaching Machines to See, Hear, and Read
World Models — Imagining Before Acting
Synthetic Data — When Models Train Models
Distillation at Scale — Compressing Giants into Pocket-Sized Models
Small Language Models — The Data Quality Revolution
On-Device AI — Intelligence Without the Cloud
Rest Stop
Federated Learning for LLMs — Training Without Sharing Secrets
Continual Learning — The Catastrophic Forgetting Problem
Model Merging — Frankenstein, but It Works
The Open-Source Ecosystem — An Acceleration Nobody Predicted
Wrap-Up

To make all of this concrete, imagine we’re building a personal health assistant called MediPal. It needs to read text (patient notes), look at images (photos of skin conditions), listen to audio (spoken symptom descriptions), run on a phone without an internet connection when needed, and learn new medical guidelines without forgetting old ones. Over the course of this section, MediPal will force us to confront every single emerging pattern listed above.

Multimodal Models — Teaching Machines to See, Hear, and Read

Our MediPal can’t survive on text alone. A patient says “this rash appeared yesterday” and holds up their arm. MediPal needs to process both the words and the image together, in context, as one unified thought. That’s the multimodal problem.

For years, the approach was what I think of as the duct-tape method: train a vision model over here, train a language model over there, then glue them together with a small projection layer. A vision encoder (say, a Vision Transformer) converts the image into a sequence of embedding vectors. A projection layer maps those vectors into the same dimensional space as the language model’s token embeddings. Then you feed them all—image embeddings and text tokens—into the language model as one long sequence. The language model doesn’t know the difference. It processes image tokens and word tokens with the same attention mechanism.

This is how models like LLaVA (Large Language and Vision Assistant) work. It’s modular, it’s practical, and you can swap out either piece. But there’s a ceiling. The vision encoder was trained separately with its own objectives. The language model was trained separately with its own objectives. The projection layer has to bridge two worlds that were built independently. It’s like hiring a translator between two people who each think in fundamentally different ways. Communication happens, but nuance gets lost.

The alternative is what Google did with Gemini: train a single model on text, images, and audio from the ground up. Every modality gets tokenized into a shared embedding space. There’s no projection layer because there’s no gap to project across. The model learns, from the very beginning, that the visual pattern of a rash and the word “rash” and the spoken sound /ræʃ/ are all facets of the same concept.

I’ll be honest—when I first heard "natively multimodal," I assumed it was marketing language for the same duct-tape method with a bigger training budget. It isn’t. The architectural difference is real. In a natively multimodal model, attention flows between modalities at every layer, not only at the input boundary. A patch of an image can attend directly to a word in a question, and vice versa, throughout the full depth of the network.

Here’s a toy example to make this concrete. Say MediPal receives three inputs: a text prompt (“Is this melanoma?”), a photo of a mole, and an audio clip of a patient describing when it appeared.

# Modular approach (duct-tape)
image_tokens = vision_encoder(photo)           # 256 embedding vectors
text_tokens  = tokenizer("Is this melanoma?")  # 5 token embeddings
audio_tokens = whisper_encoder(audio_clip)      # 80 embedding vectors

# Project image and audio into LLM's space
image_projected = projection_layer_img(image_tokens)  # 256 vectors, now in LLM-dim
audio_projected = projection_layer_aud(audio_tokens)   # 80 vectors, now in LLM-dim

# Concatenate everything into one sequence
combined = concat(image_projected, audio_projected, text_tokens)
# Shape: (341 tokens, d_model)

answer = language_model(combined)

# Unified approach (Gemini-style)
# All inputs go through one shared tokenizer/encoder
all_tokens = unified_tokenizer(photo, audio_clip, "Is this melanoma?")
# Shape: (341 tokens, d_model) — but ALL learned jointly from scratch

answer = unified_model(all_tokens)

The code looks deceptively similar. The profound difference is in what was learned together versus apart. In the modular version, the vision encoder has never “heard” language. In the unified version, every parameter has been shaped by all three modalities simultaneously.

For MediPal, the practical choice depends on our budget. Training a natively multimodal model requires enormous compute—we’re talking thousands of GPUs for months. The modular approach lets us take an existing strong LLM, attach a pre-trained vision encoder, and fine-tune the projection layer. That’s something a team of five engineers can do in a week. Most production systems today use the modular approach. The unified approach is, for now, the province of frontier labs with frontier budgets.

But here’s the limitation that sets up everything that follows: multimodal models are big. GPT-4V is estimated at over a trillion parameters. Even a modest modular setup is tens of billions. Our MediPal can’t run something that large on a phone. We need to make things smaller. A lot smaller.

World Models — Imagining Before Acting

Before we shrink anything, there’s an ambitious idea that’s been simmering in the background, one that might change what “understanding” even means for an AI system.

Think about what happens when a doctor examines a rash. They don’t pattern-match the visual input against a lookup table of rashes they’ve memorized. They have an internal model of how skin works—how infections spread, how allergic reactions progress over time, what happens if treatment is delayed. They can simulate the future in their head. “If this is contact dermatitis and the patient keeps exposing themselves to the allergen, it’ll spread to adjacent areas over the next few days.” That internal simulation is what AI researchers call a world model.

Current LLMs don’t have this. They predict the next token. They’re spectacularly good at it, but they have no internal representation of how the world actually works. They can write convincingly about gravity without having any concept of objects falling.

Yann LeCun at Meta has been the loudest advocate for fixing this. His proposed architecture, called JEPA (Joint Embedding Predictive Architecture), works differently from standard language models. Instead of predicting raw pixels or raw tokens, JEPA predicts future states in an abstract representation space. Think of it as learning to predict the gist of what happens next, rather than the exact pixels.

Here’s the intuition with a tiny example. Imagine MediPal watches a 3-second video of a wound being cleaned.

# Standard next-frame prediction (what most video models do):
# Input: frame_1, frame_2, frame_3
# Predict: every pixel of frame_4
# Problem: there are millions of pixels, most are irrelevant background

# JEPA-style prediction:
# Input: encode frame_1, frame_2, frame_3 into compact representations
# z1 = encoder(frame_1)  → 512-dimensional vector
# z2 = encoder(frame_2)  → 512-dimensional vector
# z3 = encoder(frame_3)  → 512-dimensional vector
#
# Predict: z4_predicted = predictor(z1, z2, z3)
# This is a 512-dimensional vector capturing the "essence"
# of what happens next — not the exact pixels.
#
# Train by: making z4_predicted close to encoder(frame_4)
# in representation space, NOT in pixel space.

The key difference is the level of abstraction. Predicting exact pixels forces the model to waste capacity on irrelevant details—the exact texture of the background wall, the shadow of the nurse’s hand. Predicting in representation space lets the model focus on what matters: the wound is being cleaned, the gauze is absorbing fluid, the bleeding is slowing.

I’m still developing my intuition for why this approach leads to better planning capabilities. The theoretical argument is compelling: if your model has learned to predict abstract future states, it can “mentally rehearse” different actions and pick the one with the best predicted outcome. That’s closer to how humans actually think than next-token prediction is.

OpenAI’s Sora, the video generation model, hints at what world models might look like in practice. Sora generates physically consistent videos—objects fall, reflections behave correctly, physics mostly works. Whether Sora truly “understands” physics or has memorized spectacular patterns is genuinely debated. My favorite thing about this line of research is that nobody is entirely certain where memorization ends and understanding begins.

World models are still early. For MediPal, we can’t use them in production yet. But the reason I bring them up here is that they represent where the frontier is heading—away from “predict the next token” and toward “build an internal simulation of reality.” And that shift, if it works, would change everything about how we build AI systems.

For now, though, we have a more pressing problem: we need training data. Lots of it. And the real-world medical data we need for MediPal is expensive, scarce, and privacy-restricted. What if we could generate our own?

Synthetic Data — When Models Train Models

Here’s the uncomfortable truth about training data: the internet is running out of high-quality text. Seriously. Researchers at Epoch AI estimated that we might exhaust publicly available, high-quality English text data by 2026. Meanwhile, model appetite for data keeps growing. Something has to give.

The answer that the industry has converged on is synthetic data—training data generated by AI models themselves. And it works far better than it has any right to.

Let me walk through how this would work for MediPal. We need tens of thousands of doctor-patient dialogues, but we can’t use real patient conversations due to privacy regulations. Instead, we can use a powerful teacher model (say GPT-4) to generate synthetic dialogues.

# Step 1: Create diverse patient personas
personas = [
    "45-year-old construction worker, lower back pain, skeptical of doctors",
    "22-year-old college student, recurring headaches, anxious",
    "68-year-old retired teacher, joint stiffness, hard of hearing",
]

# Step 2: Generate synthetic dialogues with a teacher model
for persona in personas:
    prompt = f"""You are a patient: {persona}
    Generate a realistic dialogue between this patient and a primary
    care physician during a 10-minute appointment.
    Include medical terminology where a doctor naturally would.
    Include realistic moments of confusion or miscommunication."""

    dialogue = teacher_model.generate(prompt)

# Step 3: Quality filtering — this is where the magic lives
# Not every generated dialogue is good.
# Filter by: medical accuracy, dialogue realism, diversity
for dialogue in generated_dialogues:
    accuracy_score = medical_verifier(dialogue)
    diversity_score = compare_to_existing(dialogue, all_dialogues)
    if accuracy_score > 0.85 and diversity_score > 0.3:
        training_data.append(dialogue)

That filtering step in Step 3 deserves emphasis. The difference between synthetic data that works and synthetic data that produces garbage models is almost entirely about filtering. Unfiltered synthetic data contains repetitive patterns, factual errors, and a kind of bland uniformity that erodes model capabilities over time.

That erosion has a name: model collapse. When models are trained recursively on their own outputs—model A generates data, model B trains on it, model B generates data, model C trains on that—the distribution of the training data narrows with each generation. Rare but important patterns disappear. The analogy I keep coming back to is photocopying a photocopy. Each generation loses some fidelity. Do it enough times, and you can’t read the text anymore.

Researchers at Oxford and Cambridge demonstrated this rigorously in 2023: after several generations of recursive training, models lose the tails of their distributions. The unusual, the rare, the nuanced—all of it fades. What remains is a bland, confident, homogeneous output. The model doesn’t crash. It keeps generating fluent text. It’s worse than crashing, because you might not notice the degradation until it matters.

The mitigation strategies are pragmatic. Mix synthetic data with real data—even a small fraction of genuine human-written text anchors the distribution. Enforce diversity during generation—vary prompts, personas, temperature settings. And filter aggressively against the existing dataset to reject generations that are too similar to what you already have.

For MediPal, synthetic data solves our privacy problem. We can generate hundreds of thousands of realistic medical conversations without touching a single real patient record. But the model generating that data is enormous. We’re paying API costs for every synthetic example. What if we could compress all that knowledge into a smaller model?

Distillation at Scale — Compressing Giants into Pocket-Sized Models

Knowledge distillation is one of those ideas that sounds too good to be true: take a large, expensive teacher model, and train a small, cheap student model to mimic it. The student doesn’t learn from the raw data directly. It learns from the teacher’s behavior—the teacher’s output probabilities, its intermediate representations, its “thought process.”

The core idea is elegant. When a teacher model predicts the next token, it doesn’t output a single answer. It outputs a probability distribution over the entire vocabulary. The word “fever” might get 0.62, “temperature” might get 0.20, “heat” might get 0.05, and so on. Those probabilities contain far richer information than a hard label (“the answer is fever”). The student model learns not only what the right answer is, but how confident the teacher was and which alternatives were plausible. This is sometimes called dark knowledge—the knowledge hidden in the shape of the probability distribution, not visible in the top-1 prediction.

Here’s how distillation at scale works in practice for MediPal:

# Teacher: a large 70B parameter model
# Student: our target 3B parameter model for on-device deployment

# Phase 1: Generate teacher outputs with full probability distributions
for prompt in medical_prompts:
    teacher_logits = teacher_model(prompt)  # raw scores over vocabulary
    # Shape: (seq_len, vocab_size) — e.g., (128, 32000)

    # We keep the FULL distribution, not only the top answer
    teacher_probs = softmax(teacher_logits / temperature)
    # temperature > 1 "softens" the distribution
    # making the dark knowledge more visible

# Phase 2: Train student to match teacher's distributions
for prompt, teacher_probs in distillation_dataset:
    student_logits = student_model(prompt)
    student_probs = softmax(student_logits / temperature)

    # KL divergence: how different are student and teacher distributions?
    distill_loss = kl_divergence(teacher_probs, student_probs)

    # Also train on hard labels for grounding
    hard_loss = cross_entropy(student_logits, correct_token)

    # Combined objective
    total_loss = 0.7 * distill_loss + 0.3 * hard_loss
    total_loss.backward()

The temperature parameter is worth pausing on. When temperature equals 1, you get the model’s normal output distribution. Crank it up to 2 or 3, and the distribution flattens out—low-probability alternatives get boosted, making the dark knowledge more visible. The student sees more of the teacher’s uncertainty. It’s like asking the teacher to think out loud instead of confidently stating the answer.

This is how models like Phi-3 and Gemma were trained. Microsoft didn’t train Phi-3 on raw web text. They used larger models to generate “textbook-quality” training data and distilled capabilities from more powerful systems. The result: a 3.8 billion parameter model that competes with models ten times its size.

But distillation has limits. You cannot compress infinite knowledge into a finite small model. The student will always sacrifice something. In MediPal’s case, a distilled 3B model might nail common conditions but struggle with rare diseases. The question becomes: how small can we go while remaining useful?

Small Language Models — The Data Quality Revolution

For years, the LLM narrative was a straight line: more parameters equals better performance. GPT-2 had 1.5 billion. GPT-3 jumped to 175 billion. The scaling laws said this was the way. Then Microsoft released Phi-2 at 2.7 billion parameters, and it matched models that were 10 to 25 times larger on reasoning benchmarks. That result broke something in the collective assumption about what size means.

I’ll be honest—when I first saw those benchmarks, I was skeptical. Models that “match GPT-3.5” often turn out to match it on cherry-picked tasks and fall apart on others. But as more SLMs appeared—Phi-3, Gemma, Qwen-2, SmolLM—the pattern held across increasingly diverse evaluations. Something real was happening.

What was happening was a shift from “scale the model” to “scale the data quality.” The Phi team at Microsoft described their approach as training on “textbook-quality” data. Instead of feeding the model billions of tokens of raw internet text (Reddit comments, abandoned forums, SEO spam), they curated and generated training data that teaches reasoning step by step, like a well-written textbook. This is sometimes called curriculum learning—presenting training examples in a deliberate order, from foundational concepts to advanced applications.

Here’s a toy example to illustrate why data quality matters more than quantity. Imagine training MediPal to diagnose common colds versus flu.

# Low-quality web-scraped training example:
# "ugh flu season is the worst lol my throat hurts 🤧
#  anyone else feel like death?? taking dayquil rn"
#
# What does the model learn from this?
# Maybe that "flu" correlates with informal language.
# Maybe that emoji appear near illness words. Not useful.

# High-quality textbook-style training example:
# "Patient presents with sudden onset of high fever (>101°F),
#  body aches, and fatigue. Key differentiator from common cold:
#  cold symptoms develop gradually over 1-3 days with predominant
#  nasal congestion, while influenza onset is abrupt with systemic
#  symptoms preceding respiratory ones."
#
# What does the model learn from this?
# Diagnostic reasoning. Differential features. Clinical logic.
# One example like this is worth thousands of Reddit posts.

The implication is startling. A 3B parameter model trained on 3 trillion tokens of carefully curated data outperforms a 30B model trained on 3 trillion tokens of unfiltered web crawl. The parameters are the same hardware. The data is the software that determines what the hardware can do.

For MediPal, this is directly relevant. We can’t afford to run a 70B model on a phone. But a 3B model trained with medical-domain textbook-quality data, distilled from a frontier model? That’s not only feasible—it might outperform a general-purpose 13B model on our specific task. The key insight of the SLM revolution: when you know your domain, data curation is worth more than parameter count.

But a 3B model is still over 6 GB in full precision. A phone has maybe 6-8 GB of RAM total, shared with the operating system and other apps. We need to go further.

On-Device AI — Intelligence Without the Cloud

Getting MediPal onto a phone means solving three simultaneous constraints: the model has to fit in memory, it has to run fast enough for real-time conversation, and the quality has to remain high enough to be medically useful. These three constraints fight each other constantly.

The weapon of choice is quantization: representing model weights with fewer bits. In full precision (FP32), each weight uses 32 bits—4 bytes. In half precision (FP16), each weight uses 2 bytes. In 4-bit quantization, each weight uses half a byte. A 3B parameter model goes from 12 GB (FP32) to 6 GB (FP16) to about 1.5 GB (4-bit). That last number fits comfortably on a modern phone.

# Memory math for MediPal's on-device model
#
# Model: 3 billion parameters
#
# FP32: 3B × 4 bytes = 12 GB   ← won't fit on any phone
# FP16: 3B × 2 bytes = 6 GB    ← tight, leaves no room for OS
# INT4: 3B × 0.5 bytes = 1.5 GB ← fits with room to spare
#
# Quality comparison (roughly):
# FP32 → FP16: negligible quality loss (~0.1% on benchmarks)
# FP16 → INT4: noticeable but manageable (~1-3% quality drop)
#
# At INT4, our 3B MediPal takes ~1.5 GB RAM
# A typical modern phone has 8 GB
# Leaving ~5 GB for the OS, apps, and KV-cache during inference

Modern phones also ship with dedicated AI accelerators—Apple calls theirs the Neural Engine, Qualcomm has the Hexagon NPU, Google has the Tensor Processing Unit. These chips are optimized for the matrix multiplications that transformers depend on, running them an order of magnitude faster than the general-purpose CPU would.

But—and this is important—on-device does not mean on-par. A 3B quantized model on a phone is impressive engineering, but it is not replacing a 70B model in the cloud for complex medical reasoning. The honest architecture for MediPal is a hybrid: the on-device model handles quick, common tasks (symptom triage, medication reminders, basic Q&A) with zero latency and full privacy. When it encounters something beyond its capabilities—rare conditions, complex drug interactions, ambiguous imaging—it routes the query to a more powerful cloud model. The on-device model acts as a first-pass filter, and the cloud model is the specialist called in for hard cases.

This hybrid pattern is where the industry is converging: on-device for speed and privacy, cloud for capability. That split requires knowing what the small model is good at and what it isn’t—which, in practice, means building confidence calibration into the small model so it knows when to say “I should escalate this.”

Rest Stop

If you’ve made it this far, congratulations. You can stop here if you like.

You now have a mental model of the first wave of emerging patterns: multimodal models that combine vision and language through modular or unified architectures. World models that try to build internal simulations of reality. Synthetic data that lets us generate training sets from teacher models. Distillation that compresses large models into small ones. Small language models that prove data quality trumps parameter count. And on-device deployment that brings capable AI to phones through quantization and hybrid cloud architectures.

That’s a useful picture. It doesn’t tell the whole story, though.

We haven’t talked about what happens when MediPal needs to learn from patients across hospitals without sharing their data. We haven’t addressed what happens when MediPal gets updated with new medical guidelines and forgets everything it knew about the old ones. We haven’t explored the strange but effective practice of merging separately-trained models by averaging their weights. And we haven’t discussed the open-source ecosystem that makes all of this accessible to teams that aren’t Google or OpenAI.

The short version: federated learning lets you train across distributed data. Continual learning fights catastrophic forgetting. Model merging lets you combine specialists. And open source is accelerating everything. There. You’re about 60% of the way to the full picture.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

Federated Learning for LLMs — Training Without Sharing Secrets

MediPal is deployed across 50 hospitals. Each hospital has thousands of patient interactions that could improve the model. But patient data is regulated—HIPAA in the US, GDPR in Europe. We cannot collect all that data into one central server. The data has to stay where it is.

Federated learning solves this by flipping the script: instead of bringing data to the model, you bring the model to the data. Each hospital gets a copy of MediPal. Each hospital fine-tunes its copy on its local data. Then, instead of sharing data, they share only the model updates—the changes to the weights. A central server aggregates those updates and produces an improved global model. The actual patient conversations never leave the hospital.

# Federated learning: one round
#
# Start: global model with weights W_global
#
# Hospital A: fine-tunes on local data → gets W_A
# Hospital B: fine-tunes on local data → gets W_B
# Hospital C: fine-tunes on local data → gets W_C
#
# Central server receives only the DELTAS (changes):
# ΔW_A = W_A - W_global
# ΔW_B = W_B - W_global
# ΔW_C = W_C - W_global
#
# Aggregate (simplest version — FedAvg):
# W_new_global = W_global + (1/3)(ΔW_A + ΔW_B + ΔW_C)
#
# Send W_new_global back to all hospitals
# Repeat for many rounds

The most common aggregation algorithm is Federated Averaging (FedAvg): each client trains for a few local epochs, then the server averages the resulting weights. It’s surprisingly effective for its simplicity, but it hides some real challenges.

The big one is data heterogeneity. Hospital A might be a children’s hospital. Hospital B might be a geriatric center. Hospital C might be a trauma center. Their data distributions are radically different. When you average model updates from these three hospitals, you get a model that’s mediocre at everything instead of good at anything. This is called the “non-IID problem” (non-independent and identically distributed), and it remains one of the hardest challenges in federated learning.

For LLMs specifically, full fine-tuning in a federated setting is impractical—sending billions of weight updates back and forth is expensive and slow. The practical solution is federated fine-tuning with parameter-efficient methods like LoRA. Each hospital trains only a small set of adapter weights (maybe 0.1% of the full model), and only those tiny updates get shared. This slashes communication costs by orders of magnitude while still allowing meaningful personalization.

I still get tripped up by the privacy guarantees of federated learning. The data never leaves the device, which sounds ironclad. But research has shown that model updates themselves can leak information about the training data—an adversary can sometimes reconstruct training examples from gradient updates alone. The defense is differential privacy: adding carefully calibrated noise to the updates before sharing them. This provably bounds how much any individual data point can influence the shared model. The tradeoff is that noise degrades model quality. Tuning that tradeoff—privacy versus utility—is more art than science.

Federated learning keeps data private. But it introduces a new problem: MediPal keeps getting updated. And every update risks overwriting what it learned before.

Continual Learning — The Catastrophic Forgetting Problem

Say we fine-tune MediPal on dermatology data. It becomes excellent at skin conditions. Then we fine-tune it on cardiology data. It becomes excellent at heart conditions. But now we test it on dermatology again, and it’s forgotten half of what it knew. The cardiology training overwrote the dermatology knowledge. This is catastrophic forgetting, and it is one of the most stubborn problems in deep learning.

It happens because neural networks store knowledge distributed across shared parameters. When you train on a new task, you change those shared parameters to fit the new data, and in doing so, you distort the representations that encoded the old data. It’s as if you had a single notebook where you wrote all your medical notes, and learning cardiology required erasing some of the dermatology pages to make room.

Here’s a concrete example of what catastrophic forgetting looks like in practice:

# Before cardiology fine-tuning
medipal("Describe treatment for contact dermatitis")
→ "Contact dermatitis is treated by identifying and avoiding the
   allergen. Topical corticosteroids reduce inflammation. For severe
   cases, oral prednisone tapers over 2-3 weeks..."
# Accurate, detailed, clinically sound.

# After 3 epochs of cardiology fine-tuning
medipal("Describe treatment for contact dermatitis")
→ "Contact dermatitis treatment involves... monitoring cardiac
   rhythms and... applying appropriate interventions for the
   presenting condition..."
# The model is leaking cardiology language into dermatology answers.
# The dermatology knowledge has been partially overwritten.

Three families of solutions have emerged, each with real tradeoffs.

The first is regularization-based. The idea: identify which weights are important for old tasks and penalize changes to them. Elastic Weight Consolidation (EWC) does this by computing a Fisher information matrix that estimates how important each weight is to previously learned tasks. During new training, a penalty term discourages large changes to important weights. The problem: computing and storing the Fisher matrix for billions of parameters is expensive, and it gets increasingly rigid as you add more tasks.

The second family is replay-based. Keep a small buffer of examples from old tasks and mix them into new training. This sounds crude, and it is. It’s also remarkably effective. Even replaying 1-5% of old data during new training dramatically reduces forgetting. For MediPal, we’d keep a curated set of dermatology examples and mix them into every future training run. The downside: storage grows with every task, and for privacy-sensitive domains, you might not be allowed to keep old examples at all.

The third family is architecture-based. Instead of trying to reuse the same parameters for everything, allocate separate parameters for each task. LoRA adapters are naturally suited for this: train a separate small adapter for dermatology, a separate one for cardiology, a separate one for neurology. The base model stays frozen—untouched and unforgotten. At inference time, you load the appropriate adapter based on the task. This is elegant but requires knowing which task you’re doing at inference time, which isn’t always obvious.

I’ll confess something: I expected one of these approaches to be the definitive winner. None is. The field is genuinely unsettled. In practice, most production systems use a combination—replay plus adapter isolation—and accept that some forgetting will happen. Continual learning in LLMs is an active and humbling area of research.

That said, there’s a completely different approach to combining knowledge from multiple models, one that sidesteps the training process entirely. What if you could merge separately-trained models after the fact?

Model Merging — Frankenstein, but It Works

This is the part that made me do a double-take when I first encountered it. Take two separately fine-tuned models—one good at dermatology, one good at cardiology—and average their weights. That’s it. Average them. And the result is a single model that’s good at both. No additional training. No replay buffer. No adapter switching.

I didn’t believe it. It sounds like averaging a French chef and a Japanese chef and expecting the result to cook both cuisines. But it works, and the reason it works has to do with the structure of the loss landscape in neural networks.

The simplest version of model merging is called model soups, introduced by Wortsman et al. in 2022. The recipe is straightforward:

# Model Soups: simplest model merging
#
# Start with: base_model (pre-trained, same for all)
#
# Fine-tune separately:
# model_derm = fine_tune(base_model, dermatology_data)
# model_card = fine_tune(base_model, cardiology_data)
# model_neuro = fine_tune(base_model, neurology_data)
#
# Merge by averaging weights:
# merged_weights = (model_derm.weights + model_card.weights
#                   + model_neuro.weights) / 3
#
# That's it. merged_model now has capabilities from all three.

Why does this work? Because all three models started from the same base model. Fine-tuning moves each model a relatively short distance in weight space from that common starting point. If you picture the loss landscape as a terrain, all three fine-tuned models are in neighboring valleys. Averaging puts you somewhere in the middle of all three—and if the landscape is smooth (which it tends to be for well-trained large models), that middle point is also a good location.

The analogy I keep returning to is paint mixing. If you have three tubes of paint—red, blue, and green—all made by the same manufacturer with the same base, mixing them produces a predictable result. If they were made by different manufacturers with different bases, the result is unpredictable mud. This is why model merging requires a shared base model. Models trained from scratch on different data with different initializations cannot be meaningfully averaged.

Uniform averaging is the baseline. More sophisticated techniques improve on it.

TIES-Merging (Trim, Elect Sign, and Merge) addresses a specific problem: when two fine-tuned models push the same weight in opposite directions, averaging them cancels out, and you lose both signals. TIES resolves this by computing “task vectors” (the difference between each fine-tuned model and the base model), trimming small deltas that are noise, resolving sign conflicts by majority vote, and then merging only the surviving, sign-consistent deltas.

# TIES-Merging walkthrough
#
# Base model weight for one parameter: 0.50
# Dermatology model pushed it to: 0.65  → task vector: +0.15
# Cardiology model pushed it to: 0.38   → task vector: -0.12
#
# Naive average: (0.65 + 0.38) / 2 = 0.515
# The two changes nearly cancel out — knowledge destroyed.
#
# TIES approach:
# 1. TRIM: drop task vectors smaller than a threshold
#    If threshold = 0.10, both survive (|+0.15| > 0.10, |-0.12| > 0.10)
#
# 2. ELECT SIGN: majority vote on direction
#    Two vectors: +0.15 and -0.12. One positive, one negative.
#    If more task vectors across all models vote positive: keep positive.
#    Say positive wins → discard the -0.12 vector.
#
# 3. MERGE: average the surviving vectors
#    Result: 0.50 + 0.15 = 0.65
#    The dermatology knowledge is preserved instead of being cancelled.

DARE (Drop And REscale) takes a different approach: randomly set a fraction of task vector elements to zero, then rescale the survivors to compensate. This sounds like it would destroy information, but it acts as a form of regularization that reduces interference between merged models. It’s similar in spirit to dropout during training—the randomness forces each surviving parameter to carry more independent information.

What makes model merging so exciting for MediPal is the workflow it enables. Train a dermatology specialist. Train a cardiology specialist. Train a pediatrics specialist. Merge them. Ship one model instead of three, with no adapter switching, no routing logic, no extra infrastructure. The merged model is used the same way as any single model. And if the merge doesn’t work well on some combination, you try a different merging technique or different weights. The iteration cost is minutes, not days of retraining.

I haven’t figured out a great way to explain why some merges work beautifully and others produce garbage. The honest answer is that our theoretical understanding of loss landscape geometry in large models is incomplete. We know that models fine-tuned from the same base tend to merge well. We know that merging models from very different domains is riskier than merging within similar domains. Beyond that, a lot of it is empirical. Try it, evaluate, iterate. Model merging is the most “let’s see what happens” technique in the modern LLM toolkit, and somehow that works.

The Open-Source Ecosystem — An Acceleration Nobody Predicted

Everything we’ve discussed in this section—multimodal models, SLMs, distillation, model merging—would be academic curiosities for most practitioners if it weren’t for the open-source ecosystem that makes them accessible.

When Meta released LLaMA in February 2023, something changed. Within weeks, the community had fine-tuned it into instruction-following models (Alpaca, Vicuna), quantized it to run on consumer GPUs (GPTQ, GGML), and started merging variants on Hugging Face. It was an explosion of experimentation that no single company could have orchestrated.

By 2024, the open-source landscape had matured into a genuine ecosystem. Llama 3 from Meta. Mistral and Mixtral from Mistral AI. Qwen from Alibaba. DeepSeek from a Chinese startup that stunned the community with reasoning capabilities rivaling much larger closed models. Each release pushed the others forward. Each raised the floor of what was freely available.

The infrastructure around these models is equally important. Hugging Face became the GitHub of model weights—over 500,000 models hosted, searchable, downloadable. vLLM made high-throughput inference accessible. llama.cpp brought LLM inference to CPUs and consumer hardware through aggressive quantization. GGUF became a standard format for sharing quantized models. LoRA and QLoRA made fine-tuning possible on a single GPU. The LMSYS Chatbot Arena provided a crowdsourced evaluation platform that many trust more than any corporate benchmark.

For MediPal, the open-source ecosystem is what makes our project feasible. We can take Llama 3 8B as a base, fine-tune it with LoRA on our medical data, quantize it to 4-bit with GGUF, merge it with a separately trained specialist using TIES, and deploy it on-device with llama.cpp. Every single link in that chain is open source. Two years ago, any one of those steps would have required a team at a well-funded company. Now a solo developer can do it on a weekend with a rented A100.

I want to be direct about something, though: open-weight is not the same as open-source. Most of these models release weights under licenses that permit commercial use but don’t release training data or training code. You can use the model but can’t reproduce how it was made. There are also genuine safety concerns—releasing a capable model openly means anyone can fine-tune away its safety guardrails. The community is still working out where the lines should be drawn. This is an ongoing, imperfect, and deeply important negotiation between capability and responsibility.

The trajectory is clear, though. The gap between closed frontier models and the best open models has been shrinking every quarter. DeepSeek V2 and Llama 3 70B trade blows with GPT-4 on many benchmarks. For specific domains with fine-tuning, open models frequently win. The era when closed APIs were the only serious option for production systems is ending. It hasn’t ended yet—frontier closed models still lead on the hardest tasks—but the trend line is unmistakable.

Wrap-Up

If you’re still with me, thank you. I hope it was worth the time.

We started with MediPal needing to see images and hear speech, which pulled us into multimodal architectures—the duct-tape modular approach versus natively unified models. World models showed us where the field might be heading: from next-token prediction toward internal simulation. The need for training data led us to synthetic data generation, with its promise of infinite data and its trap of model collapse. Distillation compressed teacher knowledge into student models. Small language models proved that data quality beats parameter count. On-device deployment forced us through quantization math to get models onto phones. Federated learning let multiple hospitals improve a shared model without sharing patient data. Catastrophic forgetting showed us the cost of sequential learning and the incomplete solutions we have. Model merging offered a surprising shortcut—averaging separately-trained weights to combine capabilities without retraining. And the open-source ecosystem made all of this accessible to teams and individuals who aren’t OpenAI or Google.

My hope is that the next time you hear about a new technique—some novel merging method, a smaller model claiming to beat a larger one, a natively multimodal architecture—instead of it feeling like another item in an overwhelming stream of announcements, you’ll have the mental scaffolding to place it in context, evaluate it critically, and decide whether it matters for what you’re building.

Resources

The original Mixture of Experts concept for transformers is covered in Shazeer et al.’s “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (2017)—the paper that planted the seed for Mixtral and Switch Transformer years later.

For model merging, the wildly helpful starting point is Wortsman et al.’s “Model Soups” (2022), followed by Yadav et al.’s “TIES-Merging” (2023) and Yu et al.’s “Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch” (2023) which introduces DARE—the title alone is worth the read.

The Phi-3 Technical Report from Microsoft Research is the best case study on training small models with curated data. It’s surprisingly readable for a technical report.

Yann LeCun’s “A Path Towards Autonomous Machine Intelligence” (2022) lays out the JEPA vision for world models. Dense but visionary, and the clearest articulation of why next-token prediction might not be enough.

For a sobering look at synthetic data risks, Shumailov et al.’s “The Curse of Recursion: Training on Generated Data Makes Models Forget” (2023) demonstrates model collapse with rigorous experiments. Essential reading before you build a synthetic data pipeline.

The Hugging Face Open LLM Leaderboard and LMSYS Chatbot Arena are the two most trusted evaluation sources for tracking which open models are actually competitive and which are overhyped. Bookmark both.

← Previous LLM Evaluation Next → Nice to Know