Model Development

Chapter 13: ML Systems & Production Section 3 of 9

I avoided thinking seriously about model development process for longer than I'd like to admit. I could train models. I could get decent accuracy. But every few weeks I'd find myself staring at a folder full of checkpoints named model_final.pt, model_final_v2.pt, model_ACTUALLY_final.pt, and I couldn't tell you which one was best, what hyperparameters produced it, or whether the data it trained on was even the same data I had now. I'd re-run experiments I'd already run, because I had no record of the results. I'd ship a model, and when someone asked "why did you pick this one over the alternatives?" I'd mumble something unconvincing. Finally the discomfort of building on quicksand grew too great for me. Here is that dive.

Model development — the actual practice of going from "I have data and a problem" to "I have a model I'm confident enough to ship" — is the messy middle of machine learning. It's the part between the clean diagrams in textbooks and the polished metrics in blog posts. It involves experiment tracking, reproducibility, hyperparameter optimization, model registries, versioning, A/B testing, offline evaluation, baseline models, and documentation. Each of these sounds like a separate topic. In practice, they're all woven into a single loop that you run dozens or hundreds of times before you ship anything.

Before we start, a heads-up. We're going to be working through a concrete example — building a churn prediction model for a subscription service — and we'll use it as a vehicle to touch every part of this lifecycle. You don't need prior experience with any specific tool. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The map for this journey:

The Baseline Nobody Wants to Build

Experiment Tracking: The Lab Notebook You Wish You'd Started Earlier

Reproducibility: Same Code, Same Data, Different Results

Hyperparameter Optimization: Searching Smarter

Rest Stop

The Model Registry: Where Models Go to Grow Up

Offline Evaluation: Measuring What Matters Before Anyone Sees It

A/B Testing: Letting Users Vote

Model Cards and Documentation: The README for Your Model

The Full Lifecycle

Resources and Credits

The Baseline Nobody Wants to Build

Here is our scenario. We work at a subscription-based music streaming service — let's call it TuneKeep — and we need to predict which users will cancel their subscription next month. We have 50,000 users, 18 months of behavioral data, and a mandate from the business team: "Build something that catches churners so we can send them offers."

The temptation — and I fell for this more times than I care to count — is to go straight for the interesting model. Gradient-boosted trees. A neural network. Something with "attention" in the name. But here's the thing: without a baseline, you have no idea whether your sophisticated model is genuinely good or whether a coin flip would do the same job.

A baseline model is the dumbest thing that could work. For TuneKeep, that might be a rule: "if a user hasn't listened to any music in the last 14 days, predict churn." No machine learning. No features beyond one. That rule, applied to our historical data, catches 41% of churners with a 67% precision. That's our floor. Everything else gets measured against it.

The next baseline is the simplest statistical model that learns from data. Logistic regression with three features: days since last listen, total listening hours last month, and number of playlist additions.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

baseline = LogisticRegression()
baseline.fit(X_train[["days_since_listen", "hours_last_month", "playlist_adds"]], y_train)
y_pred = baseline.predict(X_val[["days_since_listen", "hours_last_month", "playlist_adds"]])

print(f"Baseline F1: {f1_score(y_val, y_pred):.3f}")
# Output: Baseline F1: 0.583

0.583 F1. Not spectacular. But now we have a number. Every future model earns its right to exist by beating 0.583, and any improvement that doesn't clear that bar is noise pretending to be signal.

I'll be honest — early in my career, I skipped baselines because they felt like a waste of time. They're not. They're the anchor that keeps you from drifting into self-deception. I've watched teams spend months tuning a complex model only to discover that a three-line rule matched its performance. The baseline would have caught that in an afternoon.

The limitation of our baseline is obvious: we picked three features by gut feel, and logistic regression assumes a linear decision boundary. We need a way to explore more features and more complex models without losing track of what we tried. That's where experiment tracking comes in.

Experiment Tracking: The Lab Notebook You Wish You'd Started Earlier

It's Monday. We train our baseline. By Friday, we've run 47 experiments — different feature sets, different models, different hyperparameters. One of them hit 0.72 F1. But which one? What learning rate? What features? What version of the data?

If we can't answer those questions, we can't reproduce our best result. And if we can't reproduce it, we can't ship it. We can't even prove it happened.

An experiment tracker is a system that records everything about each training run — the inputs, the configuration, the outputs — so you can look back and reconstruct any result. Think of it like a lab notebook for a chemist. Except the chemist's notebook doesn't also need to track which version of the periodic table she was using, what temperature the room was, and whether she shuffled the reagents before starting. Ours does.

The things most people log: hyperparameters, final metrics, model checkpoints. The things that separate production-grade tracking from notebook experiments: the data version (a hash or DVC pointer to the exact slice of data), the code version (a git commit SHA, not "I think I used the feature_v3 branch"), the environment (library versions, GPU model, CUDA version — because results genuinely differ across GPU architectures), and training wall time (because a 2% accuracy gain that costs 10x more compute may not be worth it).

MLflow

The most widely adopted open-source experiment tracker. You self-host it, you own your data, and it integrates with a broader ecosystem that includes model registry and serving. The right choice for teams that care about data sovereignty or work in air-gapped environments.

Here's what tracking our TuneKeep experiments looks like with MLflow. Each with mlflow.start_run() block is one experiment — one row in our lab notebook.

import mlflow
import mlflow.sklearn

mlflow.set_experiment("tunekeeper-churn")

with mlflow.start_run(run_name="rf_v1_engagement_features"):
    mlflow.log_params({
        "n_estimators": 200,
        "max_depth": 10,
        "min_samples_leaf": 5,
        "feature_set": "v2_with_engagement",
        "data_version": "d4a8e3f",       # hash of training CSV
        "code_version": "abc1234",        # git SHA
    })

    model = RandomForestClassifier(n_estimators=200, max_depth=10,
                                   min_samples_leaf=5, class_weight="balanced")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mlflow.log_metrics({
        "f1": f1_score(y_val, y_pred),
        "precision": precision_score(y_val, y_pred),
        "recall": recall_score(y_val, y_pred),
        "train_time_seconds": elapsed,
    })
    mlflow.sklearn.log_model(model, "model")

That call to mlflow.set_experiment groups all our TuneKeep churn runs together. Within the run, log_params captures the full configuration — not only the model hyperparameters, but the data version and code version too. log_metrics records the results. And log_model saves the trained artifact itself, so we can load and serve it later without hunting for a pickle file in some forgotten directory.

The power isn't in any single logged value. It's in the combination. Three weeks from now, when someone asks "why did we choose the Random Forest with engagement features over the gradient-boosted one with behavioral features?" we can pull up both runs, compare them side by side, and point at the numbers. No guessing. No "I think."

Weights & Biases

Where MLflow is the reliable workhorse, Weights & Biases (W&B) is the one with the beautiful dashboards. It's a hosted service — SaaS by default, self-hosted option for enterprise — with real-time visualization, automatic sweep plots, and the ability to log rich media (images, audio, custom plots) alongside your loss curves. For deep learning work where you want to see sample predictions evolve across training, it's hard to beat.

import wandb

wandb.init(project="tunekeeper-churn", name="lstm_v1",
           config={"model": "lstm", "lr": 1e-3, "batch_size": 64,
                   "hidden_dim": 128, "data_version": "d4a8e3f"})

for epoch in range(50):
    train_loss = train_one_epoch(model, train_loader)
    val_loss, val_f1 = evaluate(model, val_loader)

    wandb.log({"train_loss": train_loss, "val_loss": val_loss,
               "val_f1": val_f1, "epoch": epoch})

wandb.finish()

The interface difference is real. With MLflow, you query your experiment store and build comparisons yourself. With W&B, you open a browser and the comparison dashboards are already there — interactive, filterable, shareable with teammates who can leave comments on individual runs.

Many production teams use both. W&B for interactive exploration during development, MLflow's model registry for the formal handoff to deployment. They solve different problems at different stages of the lifecycle, and the overhead of running both is lower than the cost of choosing wrong.

But all the experiment tracking in the world is worthless if we can't actually reproduce a result we logged. And that turns out to be harder than it sounds.

Reproducibility: Same Code, Same Data, Different Results

I'll be honest — the first time I ran the exact same training script twice and got different F1 scores, I assumed I'd made a mistake. I hadn't. The script was correct. The data was identical. The hyperparameters were frozen. But the model initialized its weights randomly, the data loader shuffled batches in a different order, dropout chose different neurons to mask, and the GPU used a non-deterministic algorithm for a particular matrix operation. Four invisible coin flips, each changing the final result by a fraction of a percent.

For TuneKeep, this matters. If our best model scored 0.72 on Monday but scores 0.70 on Wednesday with the same code and data, we can't tell whether our "improvement" from last week's experiment was genuine or whether we got lucky with the random seed.

Reproducibility means controlling every source of randomness so that the same inputs always produce the same outputs. There are three layers to this, and each one is harder than the last.

Layer 1: Seeds. Every library that uses randomness has a seed — a starting value for its pseudorandom number generator. Pin all of them.

import torch, numpy as np, random

def set_all_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

That cudnn.deterministic = True line forces the GPU to use deterministic algorithms for operations like convolutions, at the cost of speed. Without it, cuDNN picks the fastest algorithm available, which sometimes uses non-deterministic reductions. The benchmark = False line disables cuDNN's auto-tuner, which would otherwise pick different algorithms for different input sizes across runs.

Even with all of this, some CUDA operations — like atomicAdd, which many scatter and gather operations use internally — are inherently non-deterministic because the order of floating-point additions depends on thread scheduling. The call torch.use_deterministic_algorithms(True) forces PyTorch to error out instead of silently giving non-deterministic results. Some operations have no deterministic implementation at all. You discover this at runtime, which is its own kind of fun.

Layer 2: Environment pinning. Even with seeds locked, running the same code with NumPy 1.24 versus 1.26 can produce different results because internal implementations change. The fix is to freeze the entire software environment.

# Pin exact versions — not >=, not ~=, exact ==
# requirements.txt
torch==2.1.0
numpy==1.24.3
scikit-learn==1.3.2
pandas==2.1.4

This works for libraries, but it doesn't capture the operating system, the GPU driver, or the CUDA toolkit version. For that, we need the third layer.

Layer 3: Docker. A Docker container bundles everything — the OS, the drivers, the Python version, every library — into a single reproducible image. Think of it as a photograph of your entire computing environment, frozen in time.

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3.10 python3-pip
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app
WORKDIR /app
ENTRYPOINT ["python3", "train.py"]

That FROM nvidia/cuda:12.1.0 line pins the exact CUDA version. The base image is tagged, not :latest. The requirements file uses exact versions. If we build this image, push it to a registry, and tag it with our experiment ID, we can recreate the exact training environment months or years later.

I'm still developing my intuition for where the cost-benefit line falls between "pin everything in Docker" and "pin seeds and requirements and accept small variations." For model comparison during development, seeds plus pinned libraries is usually enough — the variations are smaller than the real differences between models. For regulatory environments where you need to prove you can reproduce an exact result, Docker is the floor, not the ceiling.

With reproducibility handled, we face a new problem. Our TuneKeep Random Forest has hyperparameters — n_estimators, max_depth, min_samples_leaf — and we picked them by gut feel. There might be a much better combination we haven't tried. But with five hyperparameters, each having ten possible values, that's 100,000 combinations. We need a smarter way to search.

Hyperparameter Optimization: Searching Smarter

A hyperparameter is a setting we choose before training begins — learning rate, tree depth, regularization strength — as opposed to a parameter, which the model learns from data (like weights in a neural network). Hyperparameter optimization (HPO) is the process of finding the combination of hyperparameters that gives the best performance.

The naive approach is grid search: define a grid of values for each hyperparameter and try every combination. For our TuneKeep model, we might try max_depth in [5, 10, 15] and n_estimators in [100, 200, 500]. That's 9 combinations. Manageable. But add min_samples_leaf in [1, 5, 10], learning_rate in [0.01, 0.1, 0.3], and subsample in [0.7, 0.8, 1.0] — now we have 243 combinations. The curse of dimensionality hits hyperparameter search the same way it hits everything else.

Random search improves on this by sampling randomly from the ranges. The surprising insight, shown by Bergstra and Bengio in 2012, is that random search often finds better configurations than grid search with the same computational budget. The reason: in most problems, one or two hyperparameters matter far more than the rest. Grid search wastes most of its budget testing different values of unimportant hyperparameters. Random search, by sampling randomly, is more likely to explore different values of the important ones.

But both approaches are blind. They don't learn from previous trials. If the last 20 experiments showed that learning rates below 0.001 always perform poorly, grid and random search will keep trying them anyway.

Optuna and the TPE Algorithm

Optuna is a Python library for hyperparameter optimization that learns from its own history. Its default algorithm, TPE (Tree-structured Parzen Estimator), works like this.

Imagine we've run 30 trials so far. TPE takes the results and splits them into two groups: the "good" trials (say, the top 20% by validation score) and the "bad" trials (the rest). It then asks: what do the hyperparameters of good trials look like? What do the hyperparameters of bad trials look like? For each hyperparameter, it fits a probability distribution to each group — one distribution describing "values that led to good results," another describing "values that led to bad results." The next trial samples from the good distribution, then checks that the sample is much more likely under the good distribution than the bad one. This ratio — good likelihood divided by bad likelihood — is the acquisition function that guides the search.

The beauty of this approach is that it builds a model of what works without building a model of the entire objective surface. It doesn't try to predict what score a given configuration will achieve. It only tries to distinguish configurations that are likely to be good from those that are likely to be bad. That distinction is easier to learn, which is why TPE works well even with a small number of trials.

Here's what it looks like for TuneKeep:

import optuna

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 3, 20),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 20),
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
    }
    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)
    return f1_score(y_val, model.predict(X_val))

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print(f"Best F1: {study.best_value:.3f}")
print(f"Best params: {study.best_params}")

That log=True on the learning rate tells Optuna to sample on a logarithmic scale — so it explores 0.001 to 0.01 as thoroughly as 0.01 to 0.1. Without it, the search would spend most of its budget in the 0.1–0.3 range, which is a tiny fraction of the useful space on a linear scale but dominates it.

After 100 trials, Optuna finds an F1 of 0.74 — beating our hand-tuned 0.72 without us making a single decision about what to try next.

Ray Tune and Early Stopping

When each trial takes hours — training a deep learning model, for instance — running 100 of them sequentially is painful. Ray Tune solves this with two ideas: parallelism and early stopping.

Parallelism is straightforward: Ray distributes trials across multiple GPUs or machines. The interesting part is the scheduler. Ray Tune's ASHA (Asynchronous Successive Halving Algorithm) starts many trials with a small resource budget — say, 5 epochs each. After 5 epochs, it looks at the results and kills the bottom half. The survivors get 10 more epochs. Kill the bottom half again. Continue until only a few trials remain, having received the full training budget.

The insight is that a model that performs terribly after 5 epochs rarely becomes the best model after 50 epochs. By cutting losers early, ASHA explores more configurations with the same total compute. In practice, it can evaluate 10x more configurations than a naive approach with the same GPU budget.

The "asynchronous" part means trials don't have to wait for each other. If trial 17 finishes its first 5 epochs while trial 3 is still on epoch 2, ASHA can immediately decide trial 17's fate. This keeps GPU utilization high and avoids the idle time that plagues synchronous approaches.

One subtlety that tripped me up: ASHA assumes that relative ranking between trials is stable across training. If model A is worse than model B after 5 epochs, ASHA assumes A will still be worse after 50 epochs. This is usually true for the same architecture with different hyperparameters, but it can break when comparing architectures with very different learning dynamics. A large model might start slow and overtake a small model later. I haven't found a universal rule for when to trust early stopping and when to let everything run to completion — it depends on how much the learning curves cross over in your specific domain.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model for the core development loop: start with a baseline, track every experiment with MLflow or W&B, pin your seeds and environment for reproducibility, and use Optuna or Ray Tune to search hyperparameters instead of guessing. For our TuneKeep churn model, we went from a 0.583 baseline to 0.74 through tracked, reproducible experiments with automated hyperparameter search.

That's a solid workflow for building models on your own machine. It doesn't tell the full story, though. Once a model is good enough to show other people — teammates, stakeholders, production systems — we need infrastructure for versioning it, evaluating it against real users, and documenting what it does and doesn't do well. That's the bridge between "I trained a good model" and "we deployed a reliable system."

The short version of what comes next: model registries give you version control for trained models (like Git for weights), offline evaluation strategies tell you whether improvements are real before you expose users to them, A/B testing lets users vote on which model is better, and model cards document the model for everyone who will touch it after you.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The Model Registry: Where Models Go to Grow Up

Our Optuna sweep found a good configuration. We've trained the final model. Now what? We need somewhere to put it — not a file on our laptop, but a central place where the model is versioned, has metadata attached, and can be promoted through stages on its way to production.

A model registry is to trained models what Git is to source code. Every registered model gets a name, a version number, and a lifecycle stage. The standard stages flow like this:

None → Staging → Production → Archived

When we first register our TuneKeep churn model, it enters as version 1, stage "None" — it exists, but no one trusts it yet. After offline evaluation passes, we promote it to "Staging," which means it's a candidate for production but still being tested. After it survives an A/B test, we promote it to "Production." When a newer version replaces it, we move it to "Archived."

import mlflow

# Register the model from our best experiment run
model_uri = f"runs:/{best_run_id}/model"
registered = mlflow.register_model(model_uri, "tunekeeper-churn-predictor")

# Promote to staging after offline eval passes
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="tunekeeper-churn-predictor",
    version=registered.version,
    stage="Staging"
)

Each version carries metadata: which experiment run produced it, what data it trained on, what its evaluation metrics were, who promoted it, and when. This is the audit trail. Six months from now, when the model starts behaving oddly and someone asks "what changed?", the registry tells you that version 3 was promoted to production on March 15 after passing an A/B test with 2.1% lift in retention, and it was trained on data from January through February.

Model versioning is more nuanced than code versioning. With code, version 2 is a strict replacement for version 1 — you wouldn't run both simultaneously. With models, you often run multiple versions in parallel during A/B tests, or serve different versions to different user segments. The registry needs to support this: querying which versions are in production, rolling back to a previous version if the new one underperforms, and keeping old versions around (not deleted) in case you need to compare or restore them.

A principle I learned the hard way: never overwrite a model artifact. Every version is immutable once registered. If you need to retrain, register a new version. This sounds wasteful until the day you need to roll back, and the model you want to restore has been overwritten by the one that's failing. Immutable artifacts are insurance you're grateful for when you need them.

The registry gives us a reliable place to store and promote models. But promoting a model to "Staging" requires evidence that it's good enough. That evidence comes from offline evaluation — and not all offline evaluation strategies are created equal.

Offline Evaluation: Measuring What Matters Before Anyone Sees It

Offline evaluation means measuring a model's quality on held-out data, before exposing it to real users. It's cheap, fast, and the primary feedback loop during development. It's also deceptive, because the gap between "good offline" and "good in production" can be enormous.

The classic approach is a random train/test split. Shuffle the data, hold out 20%, train on the rest, measure on the held-out portion. For many problems, this is fine. For TuneKeep, it's dangerously misleading.

The reason: our data is temporal. Users who churned in January behave differently from users who churned in June, because the product changed, a competitor launched, and macroeconomic conditions shifted. A random split mixes January churners into the training set and June churners into the test set. The model gets to train on the future before predicting the past. This is temporal leakage — a form of data leakage specific to time-ordered data — and it inflates offline metrics to levels that production performance will never reach.

The fix is a temporal split: train on everything before a cutoff date, test on everything after it.

# Temporal split: train on Jan-Oct, test on Nov-Dec
cutoff = "2024-10-31"
train_df = df[df["snapshot_date"] <= cutoff]
test_df  = df[df["snapshot_date"] > cutoff]

A single temporal split gives us one estimate. To get a more robust picture, we use backtesting — also called rolling-window or expanding-window validation. We simulate what would have happened if we'd deployed the model in March, then April, then May, retraining each time on all data available up to that point.

# Expanding window backtest
months = ["2024-03", "2024-04", "2024-05", "2024-06", "2024-07"]
for test_month in months:
    train = df[df["month"] < test_month]
    test  = df[df["month"] == test_month]
    model.fit(train[features], train["churned"])
    score = f1_score(test["churned"], model.predict(test[features]))
    print(f"Test month {test_month}: F1 = {score:.3f}")

If performance is stable across all five windows, we can be more confident that the model generalizes. If it swings wildly — 0.74 in March, 0.61 in May, 0.73 in July — the model might be sensitive to seasonal patterns or distribution shifts, and we need to understand why before we ship it.

There's another trap I fell into early on. Model A scores 0.72 F1. Model B scores 0.74 F1. Is B better? Maybe. With a test set of 2,000 examples, that 0.02 difference might be noise. A paired bootstrap test helps: resample the test set thousands of times, compute the metric difference each time, and check whether the difference consistently favors B.

import numpy as np

def paired_bootstrap(y_true, preds_a, preds_b, metric_fn, n_boot=10000):
    n = len(y_true)
    diffs = []
    for _ in range(n_boot):
        idx = np.random.choice(n, size=n, replace=True)
        diff = metric_fn(y_true[idx], preds_b[idx]) - metric_fn(y_true[idx], preds_a[idx])
        diffs.append(diff)
    p_value = np.mean(np.array(diffs) <= 0)
    return {"mean_diff": np.mean(diffs), "p_value": p_value}

If the p-value is above 0.05, the difference isn't statistically significant. I've learned to resist the temptation to claim victory on a 0.02 improvement without this check. "The new model is better" sounds convincing in a slide deck. "The new model is 0.02 better and the confidence interval includes zero" does not. Both statements describe the same result — only one is honest.

Offline metrics tell us whether a model learned something useful. They don't tell us whether users will notice or whether it will move the business metric we care about. For that, we need to put the model in front of real people.

A/B Testing: Letting Users Vote

An A/B test splits live traffic between two models — the current production model (control, "A") and the new candidate (treatment, "B") — and measures which one performs better on a business metric. For TuneKeep, the metric might be 30-day retention among users who received a "stay" offer, because the point of our churn model is to target those offers at the right people.

The mechanics are deceptively straightforward: randomly assign 90% of users to the old model and 10% to the new one. Serve predictions accordingly. Wait. Measure.

The pitfalls, however, have eaten entire teams alive. Three mistakes come up over and over again.

Peeking. You check the dashboard daily. On day 3, the new model shows a 4% lift with p < 0.05. You celebrate and ship it. The problem: checking repeatedly and stopping as soon as you see significance massively inflates false positive rates. If you check every day for 14 days, your effective false positive rate is closer to 25% than 5%. The fix is to decide your sample size upfront — based on the minimum effect size you care about — and not look until you've collected enough data. This requires discipline that feels unnatural, and I still find it hard.

Wrong randomization unit. If you randomize by request instead of by user, the same user might see model A on Monday and model B on Tuesday. Their experience is contaminated by both models, and you can't attribute outcomes to either one. Always randomize by the entity whose behavior you're measuring — users, in our case.

Running too short. Some effects take time to manifest. A recommendation model might show no short-term difference in click-through rate, but over three weeks, the better model increases session length because users develop trust in the recommendations. If we stopped the test after four days, we'd miss this entirely.

Before a full A/B test, many teams run a shadow deployment: the new model receives live traffic and produces predictions, but those predictions are never shown to users. The production model still serves all real responses. This lets you compare the new model's predictions against the old model's on real data, catch crashes and latency issues, and validate that nothing breaks — all without risking user experience. Think of it as a dress rehearsal before opening night.

After shadow deployment looks clean, a canary release exposes the new model to a tiny fraction of traffic — say, 1%. Monitor closely. If metrics hold, ramp to 5%, then 10%, then 50%, then 100%. This graduated rollout limits the blast radius if something goes wrong. It's the bridge between "offline metrics look good" and "we're confident enough to serve everyone."

The progression — shadow → canary → full A/B test → rollout — is the standard path from "this model exists" to "this model serves all users." Each stage adds confidence while limiting risk.

Model Cards and Documentation: The README for Your Model

Here is something that caught me off guard. The model works. It passed offline evaluation. It survived an A/B test. It's in production. Six months later, a new team member asks: "What does this model do? What data was it trained on? Where does it fail?" And nobody can answer confidently, because the person who trained it left the company, and the experiment logs show metrics but not context.

A model card, proposed by Margaret Mitchell and collaborators at Google in 2019, is a structured document that accompanies a trained model — a README, but for a model instead of a codebase. The idea emerged from a real problem: as ML models get deployed in sensitive contexts (hiring, lending, content moderation), the people making deployment decisions need to understand what the model does and where it fails, even if they weren't involved in building it.

A model card for our TuneKeep churn predictor would include:

Model details: GradientBoostingClassifier, trained January 2024, version 3, trained by the retention team. Input: 23 user behavioral features from the last 30 days. Output: probability of churn in the next 30 days.

Intended use: Prioritize retention offers for users likely to churn. Not intended for automated account termination or pricing decisions.

Training data: 50,000 TuneKeep users, Jan 2023 – Oct 2023. Class balance: 12% churners, 88% retained. Oversampled minority class during training using class_weight="balanced".

Evaluation results: F1 = 0.74, precision = 0.71, recall = 0.77 on Nov–Dec 2023 temporal holdout. Performance broken down by user segment: power users (F1 = 0.81), casual listeners (F1 = 0.68), free-tier users (F1 = 0.59).

Limitations and failure modes: Performs poorly on users with fewer than 7 days of activity history (cold-start problem). Does not account for external factors like competitor launches or credit card expiration. Performance degrades by ~3% F1 per month without retraining.

Ethical considerations: Model uses listening behavior and engagement metrics, not demographic data. However, engagement patterns correlate with age and income; disparate impact across demographic groups has not been formally tested.

That last section is the one most teams skip. It's also the most important one. I'll be honest — writing "we haven't tested for disparate impact" feels uncomfortable. It's supposed to feel uncomfortable. The model card isn't meant to make you look good. It's meant to tell the truth about what the model does and doesn't do, so the next person who touches it can make informed decisions.

Beyond model cards, the everyday documentation that makes a difference is unglamorous: a decision log ("We tried LSTM but Random Forest was 0.03 better F1 and 20x faster to train, so we went with RF"), a data dictionary ("days_since_listen: integer, capped at 90, 0 means listened today"), and a runbook ("To retrain, run scripts/train.py with the latest monthly snapshot. Expected training time: 12 minutes on a single GPU. If F1 drops below 0.70, escalate to the retention ML team").

None of this is exciting. All of it saves dozens of hours when something breaks at 2 AM and the person on call wasn't the one who built the model.

The Full Lifecycle

Let's zoom out and trace the complete path our TuneKeep model traveled. This is the model development lifecycle — not as a textbook diagram, but as the sequence of things that actually happened.

We started with a business problem: predict churn to target retention offers. We built a baseline — a heuristic rule, then logistic regression — and established an F1 floor of 0.583. We set up experiment tracking with MLflow, so every attempt was logged with its configuration, data version, and results. We pinned our seeds and environment for reproducibility, ensuring that improvements were real and not random noise.

We ran an Optuna hyperparameter search, exploring 100 configurations of a GradientBoostingClassifier, and found a configuration that scored 0.74 F1. We registered the trained model in the MLflow model registry as version 1, stage "None."

We ran offline evaluation using temporal splits and backtesting, confirming that the 0.74 held across multiple months. We used a paired bootstrap test to verify the improvement over the baseline was statistically significant. We promoted the model to "Staging."

We deployed a shadow version alongside the existing rule-based system, confirmed there were no crashes or latency issues, then ran a canary deployment to 5% of users. After a week with stable metrics, we launched a full A/B test: 50/50 split between the old rule-based system and the new model. After three weeks, the new model showed a 2.1% lift in 30-day retention among targeted users, with p < 0.01. We promoted the model to "Production."

We wrote a model card documenting the model's details, training data, performance across user segments, known limitations, and ethical considerations. We set up monthly retraining on fresh data, with automatic offline evaluation that flags any significant performance degradation before the new version gets promoted.

That loop — baseline → track → reproduce → optimize → register → evaluate offline → test online → document → retrain — is the lifecycle. It's not linear. We went back to the experiment tracking phase three times when results didn't make sense. We rebuilt the feature set once when we discovered temporal leakage. The diagram in a textbook shows this as a clean circle. In practice, it's more like a spiral with detours.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a folder of unlabeled checkpoints and no way to tell which model was best. We built baselines to anchor our expectations, set up experiment tracking so no result was ever lost, pinned every source of randomness for reproducibility, used TPE and ASHA to search hyperparameters without wasting compute, registered models in a versioned registry with promotion stages, evaluated them with temporal splits and bootstrap tests, tested them against real users with shadow deployments and A/B tests, and documented everything in model cards so the next person wouldn't have to start from scratch.

My hope is that the next time you train a model and it comes out well, instead of renaming it model_ACTUALLY_final_v3.pt and praying you remember what produced it, you'll log it, version it, test it, and document it — having a pretty good mental model of the infrastructure that turns a trained model into a reliable system.

Resources and Credits

MLflow Documentation — The definitive reference for experiment tracking, model registry, and serving. Thorough and well-organized.

Weights & Biases Docs — Excellent tutorials, especially for deep learning visualization workflows. The "Reports" feature alone is worth exploring.

Optuna Documentation — Clear explanation of TPE internals and practical recipes. The visualization tools for understanding search behavior are wildly helpful.

Model Cards for Model Reporting (Mitchell et al., 2019) — The O.G. paper on model documentation. Short, readable, and quietly influential on the entire industry.

Massively Parallel Hyperparameter Optimization (Li et al., 2020) — The ASHA paper. Insightful if you want to understand the theory behind early stopping schedulers.

ML Reproducibility Checklist (Pineau et al.) — A one-page checklist that I print and tape above my desk. If you take one thing from this section, make it this.