Data Infrastructure

Chapter 13: ML Systems & Production Section 2 of 9

I spent the first two years of my ML career believing the model was the product. I’d spend weeks squeezing out an extra 0.3% AUC with fancier architectures, then deploy and watch performance quietly degrade over months. Every post-mortem told the same story: it wasn’t the model. It was the data. A schema change nobody caught. A feature computed differently in training vs. serving. A labeling batch where the annotators used outdated guidelines. I kept treating data infrastructure as the boring plumbing I’d “deal with later,” and it kept biting me. Here is that overdue dive into the plumbing.

Data infrastructure is the set of systems that ingest, store, transform, validate, label, version, serve, and protect the data that ML models consume. It includes everything from where your bytes physically live (warehouses, lakes, lakehouses) to how features reach a model at serving time (feature stores), how pipelines are orchestrated (Airflow, Prefect, Dagster), how labels get created at scale (Labelbox, active learning), and how you ensure nothing silently rots along the way (Great Expectations, TFDV). The field has matured rapidly since roughly 2019, when companies began publishing the hard lessons of running ML in production.

Before we start, a heads-up. We’re going to be touching on distributed systems, streaming, orchestration, even a bit of cryptography when we reach differential privacy. You don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Contents

The running example: BiteBridge

Where data lives: lakes, warehouses, and lakehouses

How data moves: batch pipelines and streaming

Who tells the pipes when to run: orchestration

Rest stop

The training-serving skew trap, and feature stores

Versioning data like code

Getting labels: from spreadsheets to active learning

Trusting your data: validation and quality gates

Finding your data: data catalogs

Protecting your data: privacy

Wrap-up

Resources

The Running Example: BiteBridge

Throughout this section, we’ll build the data infrastructure for a tiny food delivery startup called BiteBridge. Right now, BiteBridge has three restaurants, five regular customers, and one overwhelmed engineer (you). The app tracks three events: order_placed, delivery_completed, and rating_submitted.

Our goal is to build a recommendation model that predicts which restaurant a user will order from next. To get there, we need to store these events somewhere, compute features from them, label historical data, validate everything, and serve features to the model at prediction time. Each section of this journey will add a piece of that puzzle.

Starting this small is deliberate. When you can trace every byte through every system with three restaurants, the architecture translates directly to three thousand. The problems don’t change. They magnify.

Where Data Lives: Lakes, Warehouses, and Lakehouses

The first decision any data team faces is where to put the data. This sounds trivial until you realize the choice cascades into everything downstream — what queries are fast, what’s cheap, and how much pain you’ll endure as your five customers become five million.

Imagine BiteBridge’s data as ingredients arriving at a kitchen. A data warehouse is a meticulously organized pantry: every jar labeled, every shelf designated, everything pre-sorted before it goes in. You define the schema before writing any data — a pattern called schema-on-write. Tools like BigQuery, Snowflake, and Redshift work this way. The query engine is built in, so analytical queries are fast. The tradeoff: you can’t store anything that doesn’t fit the predefined shelf layout. Images, audio files, free-form JSON logs — none of these fit naturally into a warehouse’s rigid table structure.

A data lake, by contrast, is a walk-in freezer where you dump everything in whatever container it arrived in. Raw JSON, Parquet files, images, log files — all tossed in. You figure out the structure later when you read it, which is called schema-on-read. Amazon S3 or Azure Data Lake Storage are common homes for lakes. The cost per gigabyte is rock-bottom because you’re using commodity object storage. But there’s a problem. Without structure, a data lake can degrade into what people grimly call a data swamp — terabytes of files that nobody understands, nobody trusts, and nobody can query efficiently.

I’ll be honest — when I first encountered these two patterns, I mentally filed it as “warehouse = structured, lake = unstructured” and moved on. That was an oversimplification. The real issue is that ML teams need both: structured features for model training (warehouse strength) and raw data for reprocessing when you discover a new feature six months later (lake strength). Running both systems in parallel means syncing data between them, which creates its own category of headaches.

The data lakehouse emerged to end this split. It layers open table formats like Delta Lake or Apache Iceberg on top of cheap object storage, giving you ACID transactions, schema enforcement, and warehouse-style query performance on lake-priced storage. For BiteBridge, this means we can dump raw order_placed events into S3, then query them with SQL as if they were in a warehouse, and even “time travel” to see exactly what the data looked like on any given date.

# BiteBridge: time-travel query with Delta Lake
# "What did user features look like the day before that bad model deploy?"
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("bitebrige_features").getOrCreate()

# Current version of the features table
current = spark.read.format("delta").load("s3://bitebrige-lake/user_features")

# Exact data from January 8th — before the bad deploy on January 9th
historical = (
    spark.read.format("delta")
    .option("timestampAsOf", "2024-01-08")
    .load("s3://bitebrige-lake/user_features")
)

That time-travel capability alone has saved me more debugging hours than any single other tool. When your model performance suddenly drops, the first question is always “did the data change?” With a lakehouse, you can answer that in minutes instead of days.

The limitation of the lakehouse pattern, and it took me a while to appreciate this, is that it solves storage and query but says nothing about how data gets there. We have a well-organized kitchen, but no delivery trucks. For that, we need pipelines.

How Data Moves: Batch Pipelines and Streaming

BiteBridge has three events flowing in: orders, deliveries, ratings. Right now, with five customers, we could process them by hand. But we’re building infrastructure that works when those five become five thousand. The question isn’t whether to automate data movement. It’s how frequently the data needs to arrive.

A batch pipeline runs on a schedule — hourly, daily, weekly. It wakes up, processes a chunk of accumulated data, writes the results somewhere, and shuts down. For BiteBridge’s recommendation model, yesterday’s orders are plenty fresh. We don’t need to know about an order placed 3 seconds ago to recommend a restaurant for tomorrow’s dinner. A nightly batch pipeline that computes “orders in the last 30 days per user per restaurant” does the job.

# BiteBridge: nightly batch pipeline with PySpark
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("nightly_features").getOrCreate()

orders = spark.read.format("delta").load("s3://bitebrige-lake/raw/orders/")

user_restaurant_features = (
    orders
    .filter(F.col("order_date") >= F.date_sub(F.current_date(), 30))
    .groupBy("user_id", "restaurant_id")
    .agg(
        F.count("*").alias("orders_30d"),
        F.avg("order_total").alias("avg_order_value"),
        F.max("order_date").alias("last_order_date"),
    )
)
user_restaurant_features.write.mode("overwrite") \
    .format("delta") \
    .save("s3://bitebrige-lake/features/user_restaurant_stats/")

Batch is the workhorse of ML infrastructure. It is predictable, debuggable, and you only pay for compute when the job runs. If something goes wrong, you re-run it. That simplicity matters more than most architects want to admit.

But imagine BiteBridge adds a fraud detection model. A user places an order with a stolen credit card. You need to score that transaction before the delivery driver leaves the restaurant — within seconds, not hours. Batch fails here. You need a streaming pipeline that processes events continuously as they arrive.

Apache Kafka is the backbone of most streaming architectures. Think of it as an infinitely long conveyor belt. Producers (your app, your payment processor) drop events onto the belt, organized by topic. Consumers (your fraud model, your analytics dashboard) read events off the belt at their own pace. Crucially, events stay on the belt for days or weeks, so a slow consumer doesn’t lose data — it catches up later. This decoupling is Kafka’s superpower.

Apache Flink sits downstream of Kafka and does the heavy computation — windowed aggregations, joins across streams, complex event processing. If Kafka is the conveyor belt, Flink is the assembly line worker who watches the belt and computes “total spend by this user in the last 5 minutes” in real time. Flink handles the ugly realities of streaming: events arriving out of order, late-arriving data, exactly-once processing guarantees, and backpressure when data arrives faster than you can process it.

# Conceptual Flink SQL: real-time fraud features over Kafka stream
-- Count orders per user in a sliding 5-minute window
SELECT
    user_id,
    COUNT(*) AS orders_5min,
    SUM(order_total) AS spend_5min,
    COUNT(DISTINCT restaurant_id) AS unique_restaurants_5min
FROM orders_stream
GROUP BY
    user_id,
    HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '5' MINUTES)

I’ll be honest: I’m still developing my intuition for when streaming is genuinely necessary vs. when it’s overengineering. The heuristic I’ve landed on is this — if the cost of a wrong decision made on stale data (e.g., approving a fraudulent transaction) exceeds the cost of running always-on streaming infrastructure, stream. For everything else, batch. Most teams overestimate how many of their features need to be real-time. BiteBridge’s recommendation model runs happily on nightly batch. The fraud model needs streaming. That’s two different pipelines, and that’s fine.

The limitation of both batch and streaming pipelines is that they solve data movement but not data orchestration. When your nightly batch has five steps that depend on each other, and step three fails at 3 AM, who restarts it? Who makes sure step four doesn’t run on half-written data? That’s where orchestrators come in.

Who Tells the Pipes When to Run: Orchestration

The first orchestrator every team uses is cron. A cron job at 2 AM runs a Python script. It works right up until it doesn’t — the script fails silently, nobody notices for a week, and by then the model has been training on stale data. Cron doesn’t know about dependencies. It doesn’t retry. It doesn’t alert. It’s a timer, not an orchestrator.

Apache Airflow replaced cron for most data teams around 2017. You define your pipeline as a DAG (Directed Acyclic Graph) — a fancy term for “a set of tasks where some must finish before others can start.” For BiteBridge, the DAG might be: ingest raw orders → compute features → validate features → train model. If ingestion fails, everything downstream halts automatically and you get paged. Airflow is battle-tested, has a massive ecosystem of connectors, and is the de facto standard at large companies.

The catch: Airflow was designed for data engineering, not ML. Dynamic workflows (where the number of tasks depends on the data) are awkward. Testing a DAG locally requires spinning up a scheduler. And the XCom system for passing data between tasks makes you feel like you’re working against the tool rather than with it.

Prefect and Dagster are the modern challengers. Prefect wraps plain Python functions with decorators, making pipelines feel like normal code. Dagster goes further with its software-defined assets model — instead of defining “tasks that run,” you define “datasets that exist,” and Dagster figures out what to run to produce them. For ML teams, Dagster’s asset-centric thinking maps naturally to the way we think about features, training sets, and models.

# BiteBridge: daily pipeline in Prefect
from prefect import flow, task

@task(retries=2, retry_delay_seconds=300)
def ingest_orders(date: str):
    raw = fetch_orders_from_api(date)
    write_to_lake(raw, f"s3://bitebrige-lake/raw/orders/{date}/")
    return raw

@task
def compute_features(raw_orders):
    features = build_user_restaurant_features(raw_orders)
    validate_features(features)  # halt if validation fails
    return features

@task
def retrain_if_needed(features):
    model = train_model(features)
    metrics = evaluate(model)
    if metrics["ndcg"] < 0.65:
        raise ValueError(f"NDCG {metrics['ndcg']:.3f} below threshold")
    return model

@flow(name="bitebrige-daily")
def daily_pipeline(date: str):
    raw = ingest_orders(date)
    features = compute_features(raw)
    retrain_if_needed(features)

The specific orchestrator matters less than having any orchestrator. If you’re starting fresh, Prefect or Dagster will cause you less pain than Airflow. If your company already runs Airflow, don’t rewrite — it works fine. The important thing is that your pipelines have dependency management, retry logic, alerting, and a dashboard where you can see what ran and what failed.

With storage, pipelines, and orchestration in place, BiteBridge can move data from event to feature table reliably. But there’s a subtle trap waiting, and it won’t show up until we try to serve the model.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model of the data infrastructure foundation: data lives in a lakehouse (combining the flexibility of a lake with the reliability of a warehouse), moves through batch or streaming pipelines depending on freshness requirements, and is orchestrated by tools like Airflow or Prefect so nothing runs out of order. That mental model is genuinely useful — it covers the architecture of most production ML systems.

What it doesn’t cover is the devil in the details. How do you keep features consistent between training and serving? How do you version data the way Git versions code? How do you get labeled training data without bankrupting your startup? How do you catch silent data corruption before it poisons your model? And how do you do all of this without violating your users’ privacy?

The short version: feature stores for consistency, DVC for versioning, active learning for efficient labeling, Great Expectations for validation, and differential privacy for protection. There. You’re 60% of the way.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

The Training-Serving Skew Trap, and Feature Stores

Here’s a war story. BiteBridge’s recommendation model trains offline with a feature called avg_order_value_30d. The training pipeline computes it with a Spark job over the lakehouse. The serving system, wanting low latency, computes the same feature in a separate Python microservice that reads from a Redis cache. Two codebases. Two definitions of “30 days.” The training pipeline uses calendar days. The microservice uses rolling 720 hours. For most users, the difference is negligible. But for users who order heavily on weekends, the two numbers diverge enough to skew predictions. The model performs worse in production than offline — but not by enough to trigger an alert, so nobody notices for three months.

This is training-serving skew, and I lost more sleep to it than to any model architecture problem. It’s insidious because the pipeline runs, the model trains, metrics look fine offline, and yet production performance slowly degrades.

A feature store kills this problem by giving you a single place to define, compute, store, and serve features. It has two storage layers: an offline store (typically a data warehouse or lakehouse) that holds the full history of feature values for training, and an online store (typically Redis or DynamoDB) that holds the latest values for low-latency serving. Critically, the same feature definition populates both stores. One codebase. No skew.

The piece that confused me the longest was point-in-time correctness. When training a model, you need to know what the features looked like at the moment the event happened, not what they look like today. If a user churned on January 15th, your training example should use their features from January 14th. Using today’s features to label a historical event is data leakage — the model peeks into the future. Feature stores handle this with point-in-time joins: given a list of entities and timestamps, they look up the feature values that were valid at each timestamp.

Let’s see this for BiteBridge using Feast, the most popular open-source feature store.

# bitebrige/feature_repo/features.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

# An entity is the "thing" features describe
user = Entity(name="user_id", join_keys=["user_id"])

# Source: where Feast reads the feature data
user_stats_source = FileSource(
    path="s3://bitebrige-lake/features/user_stats.parquet",
    timestamp_field="event_timestamp",
)

# Feature view: a named, versioned group of features
user_stats_fv = FeatureView(
    name="user_stats",
    entities=[user],
    ttl=timedelta(days=1),  # features older than 1 day are stale
    schema=[
        Field(name="orders_30d", dtype=Int64),
        Field(name="avg_order_value_30d", dtype=Float32),
        Field(name="unique_restaurants_30d", dtype=Int64),
    ],
    source=user_stats_source,
)

# Training: point-in-time correct feature retrieval
entity_df = pd.DataFrame({
    "user_id": ["u001", "u002", "u003"],
    "event_timestamp": [
        datetime(2024, 1, 14),  # the day before u001 churned
        datetime(2024, 1, 20),
        datetime(2024, 2, 1),
    ]
})

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_stats:orders_30d",
        "user_stats:avg_order_value_30d",
    ],
).to_df()

# Serving: latest features, same definition, no skew
online_features = store.get_online_features(
    features=["user_stats:orders_30d", "user_stats:avg_order_value_30d"],
    entity_rows=[{"user_id": "u001"}],
).to_dict()

Same feature definitions. Same code path for both training and serving. That’s the whole point.

Tecton is the managed, commercial cousin of Feast (built by Feast’s original creators). It adds streaming feature transformations, built-in monitoring, and orchestration. If Feast is a self-assembled kitchen, Tecton is a restaurant-grade kitchen that comes pre-installed. The tradeoff is vendor lock-in and cost.

I should admit an oversimplification I made earlier. When I described the lakehouse as the answer to BiteBridge’s storage needs, I glossed over the fact that a lakehouse is great for analytical queries but terrible for point lookups. Looking up one user’s features in a Delta Lake table takes hundreds of milliseconds. A production serving endpoint needs single-digit milliseconds. That’s why feature stores have a separate online store backed by Redis or DynamoDB — it’s a fundamentally different access pattern.

Feature stores solve the consistency problem. But they assume the data in them is correct and reproducible. What happens when someone accidentally overwrites last month’s feature table? That’s where versioning comes in.

Versioning Data Like Code

Every engineer versions code with Git. But for some reason, the data that determines 80% of model performance often lives unversioned in a bucket somewhere with a filename like training_data_final_v3_FINAL.parquet. I’ve been that engineer. It’s embarrassing in retrospect.

The core problem: your model’s performance depends on the exact dataset it trained on. Six months from now, when accuracy drops, you need to answer “did the model change, or did the data change?” Without versioned data, you cannot reproduce any previous result. You’re flying blind.

DVC (Data Version Control) solves this by wrapping Git. It stores lightweight pointer files (a few bytes, containing a hash) in your Git repository, while the actual data — which might be gigabytes — lives in remote storage like S3 or GCS. When you git checkout a previous commit, the pointer file changes, and dvc checkout downloads the exact dataset from that commit. You get Git semantics — branches, tags, diffs — for datasets.

# BiteBridge: version a new training dataset
dvc init
dvc remote add -d storage s3://bitebrige-lake/dvc-store

# Track the dataset (DVC computes a hash, stores a pointer file)
dvc add data/training_set_2024q1.parquet
git add data/training_set_2024q1.parquet.dvc data/.gitignore
git commit -m "Training data: Q1 2024, 12k orders, 3 restaurants"
git tag dataset-v1

# Three months later, after adding two new restaurants...
dvc add data/training_set_2024q2.parquet
git add data/training_set_2024q2.parquet.dvc
git commit -m "Training data: Q2 2024, 48k orders, 5 restaurants"
git tag dataset-v2

# Reproduce the old experiment exactly
git checkout dataset-v1
dvc checkout      # pulls the exact Q1 dataset from S3
python train.py   # guaranteed identical to the original run

What the pointer file actually contains is a hash of the data — think of it as a fingerprint. If even one byte changes in the dataset, the hash changes, and DVC knows it’s a different version. The actual data never enters Git (which would choke on gigabyte files), but the identity of the data is tracked with full Git history.

DVC covers individual files and experiments well. But what about versioning an entire data lake, where multiple pipelines write concurrently? lakeFS takes a different approach: it gives your object storage Git-like branching. Create a branch, write experimental data to it, validate, merge to main. Your production pipeline never sees half-written data. For BiteBridge, this means a data engineer can test a new feature computation on a lakeFS branch without risking corruption of the production feature table.

Versioned data solves reproducibility. But there’s a resource that all the version control in the world can’t conjure from thin air: labels.

Getting Labels: From Spreadsheets to Active Learning

Supervised learning needs labels. For BiteBridge’s recommendation model, the “label” is which restaurant a user actually ordered from. That comes free — it’s in the order event. But imagine BiteBridge adds a food photo quality detector: is this photo appetizing enough to show on the app? Now someone needs to look at every photo and rate it. With five restaurants and ten photos each, one person can knock it out in an afternoon. With five thousand restaurants and twenty photos each, you need a system.

At scale, labeling means distributing tasks to multiple annotators, measuring how much they agree with each other (a metric called inter-annotator agreement, often quantified as Cohen’s kappa), embedding known-answer “gold questions” to catch people clicking randomly, and versioning the labeling guidelines so you know which instructions produced which labels. Labelbox and Scale AI are the dominant platforms for this workflow — Labelbox for teams that want to manage annotators in-house, Scale AI for teams that want to outsource the whole operation.

The deeper problem is cost. Labeling is expensive, and not all labels are equally valuable. If your model already confidently classifies 95% of food photos, paying annotators to label those easy cases is waste. The photos that matter are the ambiguous ones — the ones the model is uncertain about.

Active learning exploits this insight. It trains a model on whatever labels you have, then asks: “which unlabeled examples would help me the most?” The simplest strategy is uncertainty sampling — find the examples where the model’s predicted probability is closest to 50/50 (for binary classification) and send those to annotators.

# BiteBridge: active learning for food photo quality
import numpy as np
from sklearn.linear_model import LogisticRegression

def select_photos_for_labeling(labeled_X, labeled_y, unlabeled_X, budget=50):
    """Pick the `budget` most uncertain photos for human review."""
    model = LogisticRegression().fit(labeled_X, labeled_y)
    probs = model.predict_proba(unlabeled_X)

    # Uncertainty: how close to 50/50 is the model's prediction?
    uncertainty = 1 - np.max(probs, axis=1)

    # Return indices of the most uncertain photos
    most_uncertain = np.argsort(uncertainty)[-budget:]
    return most_uncertain

The trick that took me a while to appreciate: active learning has a cold-start problem. If your initial model is garbage (because you have very few labels), it selects garbage examples — they’re “uncertain” for the wrong reasons. The fix is to bootstrap with around 200 randomly selected labels before switching to active selection. And periodically sprinkle in random samples to prevent the model from only learning about the decision boundary while ignoring the rest of the distribution.

Other selection strategies exist beyond uncertainty sampling: query-by-committee trains multiple models and selects examples where they disagree most, and diversity sampling spreads selections across the feature space to avoid labeling a hundred nearly identical photos. In practice, uncertainty sampling gets you 80% of the benefit with 20% of the complexity.

There’s another path entirely: what if domain experts could encode their knowledge as rules instead of labeling individual examples? A rule like “if the photo is too dark, it’s low quality” is noisy — it’ll get some wrong — but combine enough noisy rules with a system like Snorkel, and you get surprisingly usable labels. Snorkel’s label model learns which rules are accurate and which are correlated, producing probabilistic labels without any ground-truth examples. You still need a small manually-labeled validation set to check the output, but the approach can reduce labeling effort by an order of magnitude.

Labeled, versioned data flowing through orchestrated pipelines into a feature store — this is starting to look like real infrastructure. But there’s a hole we haven’t addressed. How do we know the data is correct?

Trusting Your Data: Validation and Quality Gates

Here’s the thing about data bugs: they are silent. The pipeline runs. The model trains. No errors. Metrics look fine. Three months later, someone notices that recommendations have been terrible since February, and it turns out an upstream schema change renamed order_total to total_amount. The feature pipeline kept running — it filled in nulls. The model trained happily on a dataset where every order appeared to have zero value. No one was alerted because nothing technically failed.

I’m still not over this one. It happened to me, and the fix took thirty minutes. The three months of degraded predictions were not recoverable.

Data validation is not optional. It is a quality gate — a checkpoint in the pipeline that halts everything if the data doesn’t meet explicit expectations.

Great Expectations is the most popular validation framework. You write “expectations” in code — statements about what your data should look like — and the tool checks them on every pipeline run. Think of expectations as unit tests, but for data instead of code.

# BiteBridge: data quality gate before feature computation
import great_expectations as gx

context = gx.get_context()

# Define what "correct" looks like for our order data
suite = context.add_expectation_suite("order_data_quality")
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_total")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="order_total", min_value=0.01, max_value=500.0
    )
)
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeGreaterThan(value=100)
)
# If any expectation fails, the pipeline halts. No silent corruption.

Google’s TensorFlow Data Validation (TFDV) takes a complementary approach: instead of writing rules manually, it auto-generates a schema from your training data, then flags anomalies when new data deviates. A column that used to have 3 categories now has 47? Flagged. The mean of a feature shifted by three standard deviations? Flagged. TFDV is especially good at catching distribution drift — slow, silent changes in the data that gradually poison your model.

# TFDV: auto-detect schema anomalies
import tensorflow_data_validation as tfdv

# Generate baseline from known-good training data
train_stats = tfdv.generate_statistics_from_csv("data/orders_train.csv")
schema = tfdv.infer_schema(train_stats)

# Validate new batch against baseline
new_stats = tfdv.generate_statistics_from_csv("data/orders_2024_02.csv")
anomalies = tfdv.validate_statistics(new_stats, schema)
tfdv.display_anomalies(anomalies)
# Output: "restaurant_id: Unexpected string values ['NEW_REST_006', 'NEW_REST_007']"

Great Expectations and TFDV are not either/or. Great Expectations is better for explicit business rules (“order_total must be positive”). TFDV is better for detecting statistical drift (“the distribution of this feature changed”). In a mature pipeline, you run both.

The minimum you should validate in any ML pipeline: schema integrity (column names and types haven’t changed), completeness (null rates within bounds), volume (row counts in expected range — catches silent upstream outages), distribution stability (feature statistics haven’t drifted dramatically), freshness (the most recent timestamp is actually recent), and uniqueness (primary keys are unique). If your pipeline doesn’t check at least these, you don’t have data quality — you have hope.

Data that we can store, move, validate, and version. Features that are consistent between training and serving. Labels that are efficiently acquired. This covers the operational side. But BiteBridge now has multiple teams — a recommendation team, a fraud team, a search team — and they’re all wondering the same thing: “does the data I need already exist somewhere?”

Finding Your Data: Data Catalogs

At a five-person company, you find data by asking the one engineer who built the pipeline. At fifty people, you ask around on Slack. At five hundred, nobody knows where anything is, teams build redundant pipelines because they don’t know the data already exists, and half of the undocumented tables in the warehouse are abandoned experiments that nobody dares delete.

A data catalog is a searchable inventory of every dataset, table, feature, and model in the organization. It stores metadata: what columns a table has, who owns it, when it was last updated, what pipelines produce it, what models consume it. Think of it as a library catalog, except the books are datasets and the catalog also knows which books were derived from which other books (that’s data lineage).

For BiteBridge, a catalog means the fraud team can search “user order count” and discover that the recommendation team already computes exactly that feature. No duplicate pipeline. No diverging definitions. The major open-source catalogs are DataHub (originally from LinkedIn, strong on lineage and ML metadata), OpenMetadata (newer, excellent governance features), and Amundsen (from Lyft, focused on search and discovery). All three track datasets, tables, features, and models as first-class entities, with lineage graphs showing how data flows from source to consumption.

I’ll be honest — data catalogs feel like the least exciting part of data infrastructure. They don’t make models faster or more accurate. But every organization I’ve seen past about 50 data practitioners hits a wall where the cost of not knowing what data exists exceeds the cost of maintaining a catalog. It’s the kind of infrastructure you regret not having when you need it.

The catalog helps teams find and share data. But some data can’t be shared freely. BiteBridge knows where its users live, what they eat, how often they order. That data is powerful for recommendations and deeply personal. The final piece of data infrastructure is protecting it.

Protecting Your Data: Privacy

BiteBridge’s recommendation model needs to learn from user behavior. But user behavior is private. “This person orders from the vegan restaurant every Friday” reveals dietary preferences. “This person ordered at 3 AM from a restaurant near a hospital” reveals more than anyone consented to share. Training a model on private data risks leaking that information, either through the model’s outputs or through the training data itself being accessed by someone who shouldn’t see it.

Two complementary techniques address this: differential privacy (protect data during computation) and federated learning (protect data during collection).

Differential privacy adds carefully calibrated random noise to computations so that the output looks almost the same whether or not any single individual’s data is included. The intuition: imagine surveying people about an embarrassing question. Instead of asking directly, you have each person flip a coin privately. If heads, they answer truthfully. If tails, they answer randomly. Any individual response is meaningless — “maybe they were telling the truth, maybe they flipped tails” — but in aggregate, with enough responses, the true statistics emerge from the noise.

The formal guarantee is controlled by a parameter called epsilon (ε), the privacy budget. Smaller epsilon means more noise and stronger privacy. Larger epsilon means less noise and more accurate results. This is the fundamental tradeoff, and I’m still developing my intuition for what “good” epsilon values look like in practice. Apple uses ε values around 1-4 for their on-device data collection. The U.S. Census Bureau used ε around 19 for the 2020 census (which was controversial among privacy researchers for being too high).

# Differential privacy: adding noise to a simple aggregate
import numpy as np

def private_mean(values, epsilon=1.0):
    """Compute mean with differential privacy guarantee."""
    true_mean = np.mean(values)
    sensitivity = (max(values) - min(values)) / len(values)
    # Laplace noise scaled by sensitivity / epsilon
    noise = np.random.laplace(0, sensitivity / epsilon)
    return true_mean + noise

# BiteBridge: private average order value
order_values = [12.50, 18.00, 9.75, 22.00, 15.50]
private_avg = private_mean(order_values, epsilon=1.0)
# True mean: 15.55, Private mean: ~15.55 ± some noise
# Any single user's order value is protected

Each query against the data spends a bit of the privacy budget. Once the budget is exhausted, you can’t safely release more information. This is why differential privacy requires thinking about the total set of analyses you’ll run, not individual queries.

Federated learning takes a different approach entirely: the data never leaves the user’s device. Instead of sending BiteBridge’s users’ order histories to a central server, each user’s phone trains a local model on its own data and sends only the model updates (gradients or weight deltas) to the server. The server aggregates updates from many users to improve the global model, but never sees any individual’s data. Google pioneered this for the Gboard keyboard — the model learns from millions of users’ typing patterns without Google ever seeing what anyone typed.

The combination of federated learning and differential privacy is powerful: data stays on-device (federated), and even the model updates are noised before leaving the device (differential privacy). This double protection is the gold standard for privacy-sensitive ML, used in production by Apple, Google, and Meta.

My favorite thing about differential privacy is that, aside from the mathematical definition of epsilon, no one is fully certain how to choose the “right” epsilon for a given application. It remains an active area of research, and the practical guidance is still evolving. That uncertainty is honest — it reflects the genuine difficulty of balancing utility and privacy.

Wrap-Up

If you’re still with me, thank you. I hope it was worth the journey through the plumbing.

We started with a three-restaurant delivery app and a question: where do we put the data? That led us from data lakes to warehouses to lakehouses. We built pipelines (batch and streaming) to move data, then orchestrators to keep the pipelines honest. We discovered training-serving skew and used feature stores to kill it. We versioned data with DVC so we could reproduce any experiment. We built a labeling system that uses active learning to spend annotation budgets wisely. We put quality gates in every pipeline with Great Expectations and TFDV. We cataloged our datasets so teams could find and share them. And we protected user privacy with differential privacy and federated learning.

My hope is that the next time you look at an ML system and feel the pull to tinker with the model architecture, you’ll first ask: “is the data infrastructure solid?” Because in my experience, fixing the plumbing has always delivered more impact than building a fancier faucet.

Resources

Designing Machine Learning Systems by Chip Huyen — The best single book on production ML data infrastructure. Covers feature stores, data pipelines, and monitoring with hard-won practical wisdom.
The Feast documentation (feast.dev) — The clearest explanation of feature store concepts I’ve found, even if you never use Feast itself. The point-in-time join explanation is particularly illuminating.
DVC documentation and tutorials (dvc.org) — Wildly well-written for a tool’s docs. The “Get Started” guide takes fifteen minutes and leaves you genuinely understanding data versioning.
“Hidden Technical Debt in Machine Learning Systems” (Sculley et al., 2015) — The O.G. paper on why data infrastructure matters more than models. Still relevant a decade later.
Great Expectations documentation (greatexpectations.io) — Insightful approach to treating data quality as a first-class engineering concern. The expectation gallery alone is worth browsing.
“The Algorithmic Foundations of Differential Privacy” by Dwork & Roth — Dense but unforgettable. If you want to understand epsilon at a mathematical level, this is the definitive text.

← Previous ML System Design Next → Model Development