ML System Design

Chapter 13: ML Systems & Production Section 1 of 9

I avoided ML system design for longer than I'd like to admit. Every time a conversation turned to "serving architecture" or "feature store," I'd nod along and steer things back to model accuracy, the territory where I felt safe. I could discuss attention mechanisms for hours but couldn't explain how a prediction actually reaches a user at two in the morning. Finally, after watching my third beautifully trained model die on contact with production traffic, the discomfort of not knowing what happens around the model grew too great for me. Here is that dive.

ML system design is the discipline of architecting the entire lifecycle of a machine learning product — from the moment a business person says "we want to predict X" to the moment that prediction reaches a real user, and the moment after that, when you find out whether the prediction was any good. The term gained formal shape after Google published their landmark 2015 paper on technical debt in ML systems, and it has become a core interview topic at every major tech company since.

Before we start, a heads-up. We're going to be talking about data pipelines, feature stores, serving architectures, and feedback loops. You don't need to know any of it beforehand. We'll build everything from a tiny example — a movie recommendation system with three users and five movies — and add complexity one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

The Restaurant That Only Hired a Chef

From Business Problem to ML Problem

Three Layers of Truth: Metrics That Matter

Where Signal Lives: Data Strategy

Features: Compute Once, Serve Everywhere

Choosing Your Weapon: Model Selection

The Training Pipeline

Rest Stop

Batch, Real-Time, or Both

The System That Eats Its Own Tail

Patterns That Survived Production

Walking the Full Design: Four Real Systems

The Flywheel

Resources and Credits

The Restaurant That Only Hired a Chef

Imagine you open a restaurant and pour every dollar into hiring the world's greatest chef. Michelin stars. Flawless technique. The food coming out of the kitchen is extraordinary.

But there's no host to seat guests. No waiter to take orders. No refrigerator — ingredients sit on the counter. No dishwasher — plates pile up. No menu, so customers don't know what's available. And nobody is checking whether the customers actually liked the food.

The chef is brilliant. The restaurant fails in a week.

This is what happens in most ML projects. The model — the chef — gets 95% of the attention. Everything around it — data collection, feature engineering, serving infrastructure, monitoring, feedback — gets treated as an afterthought. Google published a diagram in their 2015 paper that should be required reading for every ML engineer. The ML model code? A tiny rectangle in the center. Everything around it — data verification, feature extraction, configuration, serving, monitoring — absolutely dwarfs it.

I'll be honest: it took me two failed production deployments to fully internalize this. I kept thinking "if the model is good enough, the rest will sort itself out." It doesn't. A mediocre model inside a well-designed system will outperform a brilliant model inside a broken one. Every single time. The brilliant model that can't retrain when data shifts, can't serve within a latency budget, and can't tell you when it's wrong — that model is a liability, not an asset.

We'll keep coming back to this restaurant throughout. The chef matters. But the restaurant around the chef is what determines whether anyone actually eats.

From Business Problem to ML Problem

The first thing to do in any ML system design — and the part that most people rush through — is to not talk about models. Talk about the business.

Let's make this concrete. Imagine we work at a small streaming service called StreamFlix. The product manager walks over and says: "People aren't finding movies they like. Can ML help?" That's a business problem. Our task is to translate it into an ML problem.

Let's start absurdly small. StreamFlix has three users — Alice, Bob, and Carol — and five movies. Alice watched and liked "The Matrix" and "Inception." Bob watched "The Matrix" and "Toy Story." Carol watched "Inception" and "Frozen." Here's the full picture of what they've seen:

              Matrix  Inception  Toy Story  Frozen  Interstellar
Alice           ✓        ✓
Bob             ✓                    ✓
Carol                    ✓                     ✓

The business question is: what should StreamFlix show each user next? This is the translation step. We're turning "people aren't finding movies" into a ranking problem — for each user, score every unseen movie and show the highest-scoring ones first.

That single sentence — "score every unseen movie for each user and rank by score" — is the ML problem formulation. It tells us the input (a user-movie pair), the output (a relevance score), who consumes it (the homepage), and the task type (ranking). Without this clarity, I've seen teams wander for months. I once watched a team spend six months building a binary classifier — "will this user like this movie, yes or no?" — when the product needed a ranker. Those are fundamentally different problems with different loss functions, different evaluation, and different serving patterns.

The framework I now use every time: What is the input? What is the output? Who consumes it? How fast do they need it? And — the question most people forget — what happens when the model is wrong?

For StreamFlix, a wrong prediction means a user sees a bad recommendation. Annoying, but not catastrophic. For a fraud detection system, a wrong prediction means either stolen money or a blocked legitimate customer. The severity of being wrong shapes everything downstream — the model complexity, the threshold, whether humans review the output. We'll explore this more when we walk through real system designs later.

But first, we need to figure out how we'll know whether our system is any good.

Three Layers of Truth: Metrics That Matter

Here's a trap I fell into early in my career: I optimized a recommendation model until its NDCG — Normalized Discounted Cumulative Gain, a metric that measures ranking quality by giving more credit to relevant items placed higher in the list — looked fantastic on held-out data. Shipped it. Watched the click-through rate on the homepage drop. The model was excellent at predicting what users had watched in the past, but terrible at surfacing things they'd actually click on going forward.

The problem is that there are three layers of metrics, and they don't always agree.

The first layer is offline metrics — things you measure on historical data before deploying anything. For StreamFlix, this might be precision@k (of the top k movies we'd recommend, how many did the user actually watch?) or NDCG (how good is the ranking order?). These are the metrics ML engineers care about. They're cheap to compute and fast to iterate on.

The second layer is online metrics — things you measure on live traffic after deployment. Click-through rate. Watch-through rate. Session length. These are the metrics product managers care about. They require an actual deployment, which makes them expensive to measure.

The third layer is business metrics — the things that keep the company alive. Revenue. Subscriber retention. Churn reduction. These are what executives care about, and they often lag by weeks or months.

Let me make this painfully concrete with our StreamFlix example. Suppose our model achieves a perfect NDCG of 1.0 offline — it perfectly predicts the order of movies users watched in the past. Wonderful. But when we deploy it, it keeps recommending "The Matrix" to everyone because that's the most-watched movie in our dataset. Bob clicks on it again because he liked it. Alice ignores it because she already watched it. The CTR goes up slightly (Bob clicked), but subscriber retention tanks (Alice gets bored of seeing the same thing and cancels). The offline metric was perfect. The business metric went the wrong direction.

The critical question — and the one that separates good ML teams from great ones — is: does improving the offline metric actually improve the online metric, which actually improves the business metric? You have to validate each link in this chain empirically, through A/B tests. If your NDCG goes up but CTR doesn't move, your offline metric is lying to you. It's a proxy that has drifted from reality.

I'm still developing my intuition for which offline metrics best predict which online outcomes. It varies by domain, by product, by the specific user population you're serving. The honest answer is that nobody has a universal formula. You have to test it.

This is unsatisfying. And it brings us to a harder question: do we even have the data to build this system?

Where Signal Lives: Data Strategy

Before committing months of engineering, we need to be brutally honest about three things. First: do we have enough data? If StreamFlix has three users and five movies, there's not enough signal for any model to learn from. We need a critical mass. For recommendation systems, a common rule of thumb is at least 10 interactions per user and 10 interactions per item before the data starts to be useful.

Second: is the signal actually in the data? Suppose StreamFlix only tracks which movies users clicked on, not which ones they watched to completion. Clicks are noisy — people click on things they end up hating. If we train on click data, our model learns clickbait, not quality. The signal we need (did the user enjoy the movie?) might not be in the data we have.

Third: can the organization actually act on predictions? If StreamFlix's homepage is hardcoded and the frontend team can't add dynamic recommendations for six months, the best model in the world sits in a warehouse.

Let me expand our StreamFlix example. Suppose we've grown to 10,000 users and 2,000 movies. Now we have enough data. But what data, specifically? We have explicit signals — ratings that users gave (1 to 5 stars). These are high quality but sparse. Most users rate fewer than 1% of movies. We also have implicit signals — what users watched, how much they watched, what they searched for, what they added to their list. These are noisier but abundant.

A user who watched 95% of a movie probably liked it. A user who quit after 8 minutes probably didn't. But what about a user who watched 50%? Maybe the movie was too long. Maybe they fell asleep. Maybe they loved it but had to catch a flight. Implicit signals are ambiguous, and learning to interpret them is more art than science.

Here's the data strategy I'd build for StreamFlix. We have a data warehouse (something like BigQuery or Snowflake) storing all historical user interactions — every watch, click, search, and rating. This is the source of truth for training. We have a streaming layer (Kafka or Kinesis) capturing real-time events — what the user is doing right now. And we have a data validation system that checks incoming data for anomalies — did the watch-time column start emitting nulls because an upstream service changed its schema? If we don't catch these corruptions before they reach the model, the model learns garbage.

The cardinal rule: data should be immutable and versioned. If we can't reproduce the exact dataset a model was trained on, we can't debug that model. This sounds obvious until you're at 3 AM trying to figure out why the model retrained overnight started producing bizarre predictions, and nobody can tell you whether the training data changed.

With data in place, the next question is: what should the model actually look at?

Features: Compute Once, Serve Everywhere

Back to our StreamFlix example. We have raw data: Alice watched "The Matrix" on Tuesday at 9 PM on her phone and finished 92% of it. That's a raw event log. The model doesn't see raw events. It sees features — engineered signals derived from raw data.

For Alice, we might compute: total movies watched (2), average watch completion (88%), favorite genre (sci-fi, based on her history), days since last watch (3), typical watch time (evenings), device (mobile). For "The Matrix" we might compute: average rating (4.2), total watches (847), genre (sci-fi/action), release year (1999), trending score (is it being watched a lot this week?). And for the pair Alice + "The Matrix," we might compute: genre match (high — Alice likes sci-fi), collaborative signal (users similar to Alice watched this).

Here's the problem that kept biting me until I understood it. During training, I compute these features in a batch pipeline — PySpark or a SQL query over historical data. During serving, I need to compute the same features in real-time — a fast lookup when Alice opens the app. If I write the feature logic twice — once in PySpark for training, once in Java or Go for serving — they will diverge. Different null handling. Different float precision. Different timezone assumptions. The model sees different feature values in production than it saw during training. This is called training-serving skew, and it is the silent killer of ML systems.

The fix is a feature store — a system that computes features in one place and serves them to both training and inference. Think of it as the restaurant's prep station. The chef doesn't chop onions during service — the prep cook chops them once, stores them properly, and both the lunch service and the dinner service use the same onions. Same computation, no divergence.

# The feature store pattern: one computation, two access patterns

# TRAINING: fetch historical features with point-in-time correctness
# "What did Alice's features look like at 9 PM on Tuesday?"
training_df = feature_store.get_historical_features(
    entity_df=users_with_timestamps,
    features=[
        "user:movies_watched_30d",
        "user:avg_completion_rate",
        "user:favorite_genre",
        "item:avg_rating",
        "item:trending_score",
    ],
)

# SERVING: fetch latest feature values, low-latency lookup
# "What do Alice's features look like right now?"
online_features = feature_store.get_online_features(
    features=["user:movies_watched_30d",
              "user:avg_completion_rate"],
    entity_rows=[{"user_id": "alice_42"}],
)
# Same logic produced both. That's the whole point.

The "point-in-time correctness" in the training path is a subtlety worth pausing on. When we build training data, we need Alice's features as they were at the time she watched the movie, not her features today. If we accidentally use today's features to predict a decision she made three months ago, we've introduced data leakage — the model looks unrealistically good offline because it's peeking into the future. Then it falls apart in production where no future data is available. I've been bitten by this one, and the debugging took days because the offline metrics looked amazing.

With features in place, we need to pick a model.

Choosing Your Weapon: Model Selection

This is the part most people want to start with. I understand the temptation — models are the exciting part. But notice how much groundwork we laid before we got here. The model is the chef. We've been building the restaurant.

For StreamFlix, here's how I'd think about model selection. Start with the dumbest thing that could work. For recommendations, that's "show the most popular movies." No ML required. This is our baseline, and it's load-bearing infrastructure — not a stepping stone we'll discard, but a fallback that keeps the product running when the fancy model crashes at 2 AM.

Next step up: collaborative filtering. Users who watched similar movies to Alice probably share her taste. We can find them by computing similarity between user watch histories, then recommend what those similar users watched that Alice hasn't seen yet. This is still relatively cheap to run and surprisingly effective. For our toy StreamFlix, Alice watched "The Matrix" and "Inception." Bob also watched "The Matrix." So Bob is somewhat similar to Alice. Bob watched "Toy Story." Maybe Alice would like "Toy Story" too.

When collaborative filtering hits its ceiling — it struggles with new users who have no history, and with new movies nobody has watched yet — we move to a learned ranking model. This takes all those features we engineered (user features, item features, pair features) and learns a function that predicts relevance. A gradient boosted tree like XGBoost is a strong default for tabular features. A two-tower neural network — one tower encoding the user, one encoding the item, trained so that the dot product of their embeddings predicts relevance — is the current industry workhorse for large-scale retrieval.

The key principle: match model complexity to available data and serving constraints. A 500-million-parameter neural ranker is absurd for StreamFlix with 10,000 users. A logistic regression is probably insufficient for Netflix with 200 million users. The right model is the one where adding more complexity stops improving the metric that actually matters — the online metric, not the offline one.

I still occasionally get tripped up by the temptation to reach for a more complex model when a simpler one would do. The antidote is to always have the baseline deployed and running. If the fancy model can't demonstrably beat it on the business metric, the fancy model doesn't ship.

The Training Pipeline

In a Kaggle competition, training happens once. You download a dataset, run your notebook, submit predictions, and move on. In production, training is a recurring pipeline — a piece of infrastructure that runs on a schedule, autonomously, while you sleep.

For StreamFlix, the training pipeline might look like this: every night at midnight, a scheduled job pulls the last 90 days of user interactions from the data warehouse, joins them with features from the feature store (using point-in-time correctness), trains the ranking model, evaluates it against the held-out week, and — if the evaluation passes a quality gate — registers the new model artifact in a model registry.

The model registry is worth calling out. It's a versioned catalog of every model artifact the team has ever produced. Think of it as Git, but for trained models. Each entry records the model version, the training data version, the hyperparameters, the evaluation metrics, and who approved it for deployment. When the new model in production starts behaving strangely, you can roll back to the previous version in minutes because it's sitting right there in the registry.

# A training pipeline in pseudocode

def nightly_training_pipeline():
    # 1. Pull fresh training data
    train_data = warehouse.query(
        "SELECT * FROM interactions WHERE date > today - 90"
    )

    # 2. Join with features (point-in-time)
    features = feature_store.get_historical_features(train_data)

    # 3. Train
    model = RankingModel()
    model.fit(features, labels=train_data["watched"])

    # 4. Evaluate against held-out data
    metrics = evaluate(model, holdout_data)

    # 5. Quality gate — only register if it beats the current model
    if metrics["ndcg"] > current_production_model_ndcg * 0.98:
        registry.register(model, metrics=metrics, data_version="v42")
    else:
        alert_team("New model failed quality gate", metrics)

That 0.98 threshold in the quality gate deserves attention. We're allowing the new model to be slightly worse on the offline metric, because offline metrics have noise. What we won't tolerate is a significant regression. If the new model is dramatically worse, something went wrong in the data or the training, and we want to know about it immediately — not after it's served bad recommendations to a million users.

One more thing about training that trips up newcomers: experiment tracking. When you run 50 experiments varying hyperparameters, feature sets, and model architectures, you need to record what you tried and what happened. MLflow and Weights & Biases are the standard tools. Without them, you end up with a folder of model files named model_v2_final_FINAL_really_final.pkl, and nobody can tell you which one is in production.

Rest Stop

If you've made it this far, you can stop if you want.

You now have a mental model that covers the core of ML system design: you can translate a business problem into an ML problem, define metrics at three layers, design a data strategy, engineer features through a feature store, select a model appropriate to the problem, and build a training pipeline that runs autonomously. That's a solid foundation — enough to have a productive conversation in a system design interview, and enough to avoid the most common mistakes teams make.

But it doesn't tell the complete story. We haven't talked about how the model actually serves predictions to users — batch versus real-time, the infrastructure that makes sub-200ms responses possible. We haven't confronted feedback loops, the phenomenon where your model's predictions warp the data it trains on. And we haven't walked through complete system designs for the problems that come up over and over in interviews and in practice — recommendation systems, search ranking, fraud detection, content moderation.

The short version: batch serving is cheap but stale, real-time serving is fresh but expensive, and almost every production system uses a hybrid. Feedback loops are genuinely dangerous and hard to detect. The retrieve→rank pattern is the workhorse of every major recommendation and search system. There. You're 70% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Batch, Real-Time, or Both

Let's go back to our StreamFlix restaurant. There are two ways to run the kitchen.

The first approach: the chef cooks every possible dish in advance. At midnight, the kitchen prepares personalized meals for every registered customer and stores them in a walk-in refrigerator. When a customer arrives the next day, the waiter opens the fridge and pulls out their pre-made plate. Fast service. Cheap to run. But if the customer came from the gym and wants something high-protein instead of the pasta you pre-made twelve hours ago, you're out of luck. That's batch serving.

The second approach: the chef cooks to order. Customer arrives, places an order, and the kitchen fires up. The food is fresh and personalized to the moment. But the customer waits, the kitchen needs to be staffed at all times, and the bills are higher. That's real-time serving.

For StreamFlix's recommendation system, batch serving means precomputing "top 20 recommendations" for every user overnight and storing them in a fast key-value store like Redis. When Alice opens the app, we look up her pre-computed list. Response time: under a millisecond. Infrastructure cost: low — we run the batch job during off-peak hours.

But here's where batch breaks down. Alice watched "Interstellar" an hour ago. Her tastes have shifted toward cerebral sci-fi. Our batch recommendations were computed twelve hours ago, before she watched it. We're still showing her the same list. The recommendations are stale.

Real-time serving computes recommendations when Alice opens the app. The server fetches her latest features — including the fact that she watched "Interstellar" an hour ago — runs the ranking model, and returns fresh recommendations. Response time: 50–200ms. Infrastructure: always-on model servers, a feature store with low-latency online access, and careful model optimization to stay within the latency budget.

For some problems, freshness is non-negotiable. Fraud detection at the moment of a credit card transaction. Content moderation before a post goes live. Search ranking as the user types. These need real-time inference. For others — a weekly email of "movies you might like" — batch is more than sufficient.

Here's what production actually looks like: almost every system at scale uses a hybrid. The computationally expensive first pass — candidate generation, where we narrow 10,000 movies down to 500 possibilities — runs in batch. The lighter second pass — ranking those 500 by personalized relevance — runs in real-time. We get batch efficiency where freshness doesn't matter and real-time precision where it does.

Netflix, YouTube, and LinkedIn all run some version of this hybrid pattern. When I first learned this, it felt like cheating — aren't you supposed to pick one architecture? But the pragmatism makes sense. The expensive computation (scanning millions of items) doesn't need to happen every request. The cheap computation (scoring 500 items) does. Split the work where it makes economic sense.

I'm still developing my intuition for exactly where to draw the batch-vs-real-time line for a given system. The question I now ask: "If the user's behavior changed five minutes ago, does the prediction need to reflect that?" If yes, that component needs real-time. If not, batch is fine.

The System That Eats Its Own Tail

This is the debt nobody sees coming, and it's the one that scared me the most when I finally understood it.

StreamFlix deploys its recommendation model. The model has learned that "The Matrix" is popular, so it recommends "The Matrix" to a lot of users. Those users click on it — partly because they're interested, and partly because it's at the top of the page and they trust the system. The click gets logged as a positive interaction. The next night, the model retrains on this data and concludes that "The Matrix" is even more popular than it thought. So it recommends it even more aggressively. More clicks. More training signal. The loop tightens.

Meanwhile, "Interstellar" — a movie that many users would love — never gets recommended because it wasn't popular enough to surface in the first round. Nobody clicks on it because nobody sees it. The model never learns that users like it. It lives in a recommendation graveyard.

This is a feedback loop. The model's predictions influence user behavior, which generates the training data for the next model version. The system eats its own tail. The result is a rich-get-richer dynamic where popular items get more popular and niche items get buried — regardless of quality.

Direct feedback loops are the ones I described: model predicts → user acts on prediction → model trains on that action. Indirect feedback loops are sneakier. Suppose StreamFlix has a separate model that predicts churn risk. The recommendation model shows popular content to keep engagement up. The churn model sees that engaged users who watch popular content tend not to churn. So the churn model tells the product team: "everything is fine." But under the surface, users who prefer niche content are quietly leaving because they never see anything they like. The recommendation model's behavior infected the churn model's training data, and neither model flags the problem.

The fixes aren't glamorous, but they're critical. First: exploration traffic. Reserve 5–10% of recommendations for random or diverse picks that the model wouldn't have chosen. This ensures that unpopular items get some exposure and the model gets training signal on them. Second: diversity injection. After the model ranks items, apply a post-processing step that ensures the final list isn't dominated by a single genre or a single item. Third: feedback loop metrics. Track how concentrated the recommendations are. If the top 1% of items are getting 90% of the impressions, the loop is tightening. Sound the alarm.

I'll confess something: detecting indirect feedback loops is genuinely hard. I haven't figured out a reliable way to do it beyond careful logging and skeptical analysis. The best defense I know is to assume they exist and build monitoring that can surface their effects.

Patterns That Survived Production

After working on enough ML systems, you start seeing the same architectural patterns recur. These aren't theoretical — they survived contact with real traffic, real failure modes, and real cloud bills. Let's walk through the four that matter most.

Retrieve → Rank

This is the single most important pattern in production ML, and we've been building toward it with our StreamFlix example all along.

The problem: StreamFlix has 2,000 movies. For each user, we want to score every movie. Our ranking model uses 200 features per user-movie pair and takes about 0.1ms per prediction. Scoring all 2,000 movies takes 200ms. That's right at the edge of our latency budget for a real-time request. But what if StreamFlix grows to 100,000 movies? Now scoring takes 10 seconds. The page never loads.

The solution: don't score everything. Split the problem into two phases.

Phase 1 — Retrieval: Use a cheap, fast model to narrow 100,000 movies down to 500 candidates. This model doesn't need to be precise — it needs to be fast and have high recall (it should include most of the good movies, even if it also includes some bad ones). A two-tower model works well here: encode the user as an embedding, encode each item as an embedding (pre-computed in batch), and find the 500 items whose embeddings are closest to the user's. This is an approximate nearest neighbor (ANN) search, and libraries like FAISS and ScaNN can do it in under 10ms over millions of items.

Phase 2 — Ranking: Now apply the expensive, feature-rich model to those 500 candidates. This model uses the full 200 features — user history, item metadata, context (time of day, device), cross-features (genre match, collaborative signals). Scoring 500 items at 0.1ms each takes 50ms. Well within budget.

There's often a Phase 3 that people forget: business rules. After the ranker scores items, we apply constraints — diversity (don't show five sci-fi movies in a row), freshness (include some new releases), legal restrictions (age-gating), and deduplication (don't show the same movie twice). This is the host seating guests — the chef picked the dishes, but the host arranges the dining experience.

Why split it? Because running a 200-feature neural ranker on 100,000 candidates would cost a small fortune per request. The retrieval stage makes the ranking stage economically feasible. That's the whole trick.

Cascade: Cheap First, Expensive Last

The cascade pattern extends the retrieve→rank idea to any system where you have a series of increasingly expensive models, each filtering traffic for the next. Think of it as a series of increasingly fine sieves. The coarse sieve is cheap and catches the easy cases. The fine sieve is expensive and handles the hard ones.

Content moderation is the canonical example. A social media platform might process billions of posts per day. Running a state-of-the-art transformer model on every post would require a staggering amount of compute. Instead:

# Content moderation as a cascade of sieves

def moderate(content):
    # Sieve 1: keyword + hash matching (~0.1ms)
    # Catches known-bad content: blocklisted words, known
    # CSAM hashes, exact-match spam. Handles ~60% of violations.
    if keyword_and_hash_filter(content):
        return "BLOCKED"

    # Sieve 2: lightweight classifier (~5ms)
    # A small model (distilled BERT or logistic regression on
    # TF-IDF features). Handles another ~30% of violations.
    score = fast_classifier.predict(content)
    if score > 0.95:
        return "BLOCKED"
    if score < 0.05:
        return "APPROVED"

    # Sieve 3: heavy model for the ambiguous middle (~100ms)
    # Full transformer, possibly multimodal (text + image).
    # Only sees the ~10% of traffic that made it past the
    # first two sieves.
    score = deep_model.predict(content)
    if score > 0.7:
        return "HUMAN_REVIEW"
    return "APPROVED"

The result: 90% of traffic is handled by cheap models. The expensive deep model processes the hardest 10%. The GPU bill drops by an order of magnitude while quality stays constant on the cases that matter. And the most ambiguous content — the 2% where even the deep model isn't confident — goes to human reviewers, who make the final call and whose decisions feed back into the next round of training data.

The sieve analogy recurs here: the first sieve has big holes (catches the obvious violations, lets everything else through). The second sieve has medium holes. The third has tiny holes. Each sieve is more expensive to operate, but sees less traffic. The economics work out beautifully.

Shadow Mode: Test Before You Trust

Before promoting a new model to production, run it in shadow mode. The new model receives real traffic and makes real predictions, but those predictions are logged, not served. The existing model still handles all users. You compare the shadow model's outputs against the live model and against ground truth.

Shadow mode is safer than A/B testing for the initial deployment because zero users see bad predictions. You discover problems — latency spikes, edge case failures, unexpected output distributions — before any user is affected. Only when the shadow model proves itself across days of real traffic do you promote it to a small percentage of users for a proper A/B test.

Going back to our restaurant: it's like having a new chef prepare dishes alongside the existing chef for a week. The new chef's food goes to the tasting panel, not to customers. If the tasting panel approves, the new chef starts serving a few tables. Then more. Then all of them.

Human-in-the-Loop: Know When to Defer

For high-stakes decisions — medical diagnoses, loan approvals, content moderation edge cases — the model shouldn't have the final word. Route low-confidence predictions to human reviewers. The model handles the 85% of cases where it's confident. Humans handle the rest. This isn't a crutch — it's a design choice that acknowledges the limits of the model's training data.

The hard part is calibrating the confidence threshold. Set it too high and humans drown in reviews. Set it too low and the model makes confident-but-wrong calls on cases it shouldn't have touched. The threshold is a knob you tune continuously based on the cost of human review versus the cost of a wrong automated decision.

Walking the Full Design: Four Real Systems

Everything we've built so far — problem formulation, metrics, data strategy, features, models, serving, feedback loops, patterns — comes together when you design a complete system. Let's walk through four systems that come up over and over in interviews and in practice. For each one, I'll trace the path from business problem to architecture, using the framework we've developed.

Recommendation System (StreamFlix)

We've been building this throughout. Let me tie it all together.

Business problem: Users can't find movies they like. Subscriber retention is dropping.

ML formulation: For each user, rank all unseen movies by predicted relevance. Show the top 20 on the homepage.

Metrics: Offline — NDCG@20 and Recall@50. Online — CTR on recommendations and watch completion rate. Business — subscriber retention and revenue per user.

Architecture: Hybrid batch + real-time. Candidate generation (two-tower model producing embeddings, ANN index for fast retrieval) runs in batch, refreshed every few hours. Ranking (feature-rich gradient boosted tree or neural ranker using 200+ features) runs in real-time when the user opens the app. A business rules layer handles diversity and freshness. The feature store handles both training-time and serving-time feature access, eliminating training-serving skew.

Feedback loop defense: 5% exploration traffic. Diversity injection in the business rules layer. Weekly audit of impression concentration — if the top 1% of movies receive more than 50% of impressions, the exploration rate increases automatically.

Fallback: If the ranking model is down, serve the batch-computed candidate list (stale but functional). If the batch pipeline also failed, fall back to "most popular this week." The restaurant always has food — even if it's from the backup menu.

Search Ranking

Business problem: Users search for movies but the results are poor. "Action movies with robots" returns irrelevant results.

ML formulation: Given a query and a corpus of movies, rank movies by relevance to the query.

Search ranking introduces a new stage we haven't discussed: query understanding. Before we retrieve anything, we need to understand what the user means. "Action movies with robots" needs to be parsed into intent: genre=action, theme=robots. Misspelled queries need correction. Ambiguous queries ("The Ring" — the horror movie or the jewelry?) need disambiguation. This is often a separate NLP model or a set of heuristics that enriches the raw query before it hits the retrieval stage.

Architecture: Query understanding → retrieval (BM25 keyword matching combined with dense retrieval using query and movie embeddings) → ranking (a learned ranker using query features, movie features, and click-through signals) → re-ranking (a personalization layer that adjusts for the specific user's taste). Each stage is a finer sieve — the same cascade principle at work.

Key metric subtlety: In search, the user tells you what they want (the query), unlike in recommendations where you have to guess. This means the feedback signal is richer — a click on result #5 implies that results #1 through #4 were probably less relevant. This pairwise signal is gold for training learning-to-rank models, which are trained on pairs of documents rather than individual relevance scores.

Fraud Detection

Business problem: Fraudulent transactions cost the platform $150 on average (chargebacks plus investigation). False positives cost $5 (support tickets plus lost goodwill).

ML formulation: For each transaction, predict the probability of fraud. Block or flag high-probability transactions before payment authorization.

Fraud has a fundamental challenge that StreamFlix doesn't: delayed labels. When a user watches a movie, we know almost immediately whether they liked it (did they finish it?). When a transaction happens, we don't know it was fraudulent until the chargeback arrives — days to weeks later. The model makes a decision in 100ms, but the ground truth takes 14 days to materialize. Training on incomplete labels is tricky. Some transactions labeled "not fraud" today will turn out to be fraudulent next week.

Architecture: Real-time serving is non-negotiable — we need to block fraud before the payment goes through. We use a cascade: a rules engine catches the obvious cases (blocked IP addresses, impossible shipping addresses, velocity limits like "5 purchases in 2 minutes from the same card"), a lightweight gradient boosted tree handles 85% of the remaining traffic, and a deep model handles the hardest 5%. Fallback: if the model server is unresponsive, the rules engine takes over alone — higher false positives, but no missed fraud.

Feature strategy: A mix of precomputed user-level features in the feature store (purchase history, device fingerprints, number of transactions in the last hour) and transaction-level features computed at request time (amount, merchant category, time of day). The velocity features — "how many purchases has this card made in the last 10 minutes?" — are the most predictive and the hardest to compute in real-time. They require a streaming aggregation system that maintains running counts and sums over sliding time windows.

Training: Daily retraining on a rolling 90-day window, with a staggered labeling pipeline. Transactions with no chargeback after 7 days are provisionally labeled "not fraud." As chargebacks arrive over the following weeks, labels are corrected and the model retrains with the updated ground truth.

Content Moderation

Business problem: Harmful content (hate speech, violence, spam) damages user trust and creates legal liability. The platform processes millions of posts per day.

ML formulation: For each piece of content (text, image, or video), predict the probability that it violates platform policy. Block, flag for review, or approve.

Content moderation is uniquely hard for three reasons. First, it's multimodal — a benign image with hateful overlaid text, or a video that's fine in the first 30 seconds and violent in the last 10. The model needs to understand multiple modalities and their interactions. Second, it's adversarial — bad actors actively try to evade detection by misspelling slurs, using code words, modifying images to defeat classifiers. The model is playing a game against opponents who adapt. Third, context matters enormously — the word "kill" means very different things in "I'd kill for a pizza" versus a direct threat.

Architecture: The cascade pattern — keyword and hash filter → lightweight text/image classifier → heavy multimodal transformer model → human review. The human reviewers serve a dual purpose: they handle edge cases and they generate the labeled training data that improves the model over time. This is the feedback loop working in our favor — every human decision makes the automated system better.

Key design choice: Content moderation has an asymmetric cost structure. Missing harmful content (false negative) can cause real-world harm and legal liability. Over-blocking (false positive) frustrates users but is less dangerous. This asymmetry pushes the thresholds toward higher recall (catch more bad content) at the expense of precision (some good content gets flagged too). The human review layer absorbs the resulting false positives.

The Flywheel

ML systems aren't shipped and forgotten. The best ones create a flywheel: deploy a model → collect user interactions → use those interactions to improve features and labels → retrain a better model → deploy it → collect more interactions. Each cycle makes the system smarter. The data moat deepens.

At StreamFlix, the flywheel looks like this. V1 launches with collaborative filtering. Users click on recommendations, generating engagement data. That data reveals patterns the V1 model couldn't capture — maybe users who watch documentaries on weekday evenings prefer different genres on weekends. V2 adds time-of-day features and a more expressive model. The improved recommendations drive higher engagement, which generates more data, which enables V3.

The flywheel is why the first version doesn't need to be perfect. It needs to be good enough to generate useful data. The data makes the second version better. The second version generates better data. And so on.

This is also why monitoring matters so much. If you don't measure how V1 is performing in production — which recommendations get clicked, which get ignored, where users bounce — you have no data to build V2 from. The monitoring layer isn't overhead. It's the engine of improvement.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a restaurant that only hired a chef, and used it to understand why the model is the smallest part of a production ML system. We built StreamFlix from three users and five movies into a full system — formulating the business problem as a ranking task, defining metrics at three layers, designing a data strategy, engineering features through a feature store, selecting models from baselines to two-tower networks, building an automated training pipeline, choosing between batch and real-time serving, defending against feedback loops, and applying the retrieve→rank and cascade patterns. Then we walked through four complete system designs — recommendations, search, fraud detection, and content moderation — to see how the same principles play out in different domains.

My hope is that the next time you face a system design interview or a design review at work, instead of jumping to "what model should we use?", you'll start with the restaurant — the business problem, the data, the features, the serving constraints, the monitoring, the fallback plan — and treat the model as one ingredient in a much larger recipe, having a pretty solid mental model of what's going on around it.

Resources and Credits

"Hidden Technical Debt in Machine Learning Systems" (Sculley et al., 2015) — The paper that launched a thousand post-mortems. If you read one ML engineering paper, make it this one.

"Rules of Machine Learning: Best Practices for ML Engineering" (Martin Zinkevich, Google) — 43 practical rules, each earned through pain. Rule #1: "Don't be afraid to launch a product without machine learning." Wildly underappreciated.

Designing Machine Learning Systems by Chip Huyen — The most complete book on production ML systems I've found. It covers everything from data engineering to monitoring with real-world case studies. Insightful and practical.

Made With ML (madewithml.com) — Goku Mohandas's course on MLOps. The system design walkthroughs are excellent and include actual code.

"System Design for Recommendations and Search" by Eugene Yan — A deeply practical series of posts that covers retrieve→rank, feature stores, and evaluation patterns at scale. Unforgettable for anyone building recommendation systems.

The ML Systems Design Interview by Ali Aminian and Alex Xu — Structured walkthroughs of common system design problems. Excellent for interview preparation, though the real value is the thinking framework, not the specific answers.

Next → Data Infrastructure