Nice to Know

Chapter 13: ML Systems & Production Section 9 of 9

I avoided this collection of topics for a while because each one felt like it belonged to someone else's job. Technical debt? That's for the platform team. Regulations? That's legal. Carbon footprint? That's a policy discussion. Then I spent a year shipping models into production, and every single one of these things bit me personally. Here is that dive.

These are the unglamorous realities of production ML — the things that don't show up in model architecture papers but absolutely determine whether your system survives contact with the real world. Technical debt that grows silently, data errors that compound like interest, feedback loops that quietly go haywire, regulations that reshape what you're allowed to build, and team dynamics that can tank a project faster than any bug.

Before we start, a heads-up. We'll be touching on research papers, regulatory frameworks, and organizational dynamics. You don't need background in any of them. We'll build up what we need as we go.

This isn't a short journey, but I hope you'll be glad you came.

ML Technical Debt
Data Cascades
Feedback Loops Gone Wrong
The AI Incident Database
Rest Stop
Chaos Engineering for ML
The Carbon Cost of ML
The Regulatory Landscape
ML Team Anti-Patterns
Build vs. Buy
Resources

ML Technical Debt

In 2015, a team at Google published a paper called "Hidden Technical Debt in Machine Learning Systems." It became one of the most cited papers in ML engineering — not because it introduced a new algorithm, but because it named the thing everyone was feeling. The paper's central diagram is almost comically blunt: it shows the "ML Code" as a tiny black rectangle in the middle of a massive system, surrounded by enormous blocks labeled configuration, data collection, feature extraction, serving infrastructure, and monitoring. The model code — the thing everyone obsesses over — is the smallest piece.

Let's make this concrete with a running example. Imagine you're building a fraud detection system at a mid-size fintech company. You start with a notebook, a clean dataset, and an XGBoost model that gets 0.96 AUC. Beautiful. Ship it. Six months later, the system is a different beast entirely.

The first form of debt the paper names is glue code — the duct tape holding your system together. Your fraud model doesn't exist in isolation. It needs to pull features from a transaction database, call an external IP geolocation API, read from a feature store that a different team maintains, and push predictions into a downstream rules engine. None of these systems speak the same language. So you write glue. Converters, adapters, serializers, retry logic. In the fraud system, you end up with roughly 2,000 lines of model code and 15,000 lines of glue. The Sculley paper found this ratio is typical — and often worse.

The second form is pipeline jungles. Over months, your fraud system accumulates data pipelines: one for real-time transaction features, one for historical aggregates computed nightly, one for the external enrichment data, one that a colleague wrote to handle a special case during the holidays and never removed. Each pipeline has its own schedule, its own failure modes, its own implicit assumptions about data format. Nobody has a complete mental model of how they all interact. When the nightly aggregate job runs 20 minutes late one Tuesday, three downstream pipelines silently consume stale data and nobody notices for a week.

Then there's configuration debt. The fraud model has dozens of configuration parameters that live outside the code: feature thresholds, sampling rates, model version pointers, serving timeouts, fallback logic. These rarely get the same rigor as code — no code review, no version control, sometimes no documentation. The paper points out that a single misconfigured threshold can silently degrade a model's performance by 20% and go undetected for months, because nobody thought to monitor the config values themselves.

I'll be honest — the first time I read that paper, I thought it was exaggerating. Then I inherited a production ML system and found configuration files that hadn't been updated in two years, still pointing to a feature store column that had been renamed. The model was running fine, pulling zeros for that feature, and compensating with other signals. Technical debt isn't dramatic. It's quiet.

The uncomfortable takeaway: you don't accumulate ML technical debt through carelessness. You accumulate it by doing exactly what's rational in the moment — shipping fast, reusing what's available, deferring cleanup. The debt compounds not in proportion to time, but in proportion to how many other systems interact with yours. And in production ML, everything interacts with everything.

Data Cascades

If technical debt is the slow rot of code and configuration, data cascades are the slow rot of the thing the code operates on. In 2021, Sambasivan and colleagues at Google published a study interviewing 53 ML practitioners across high-stakes domains — healthcare, agriculture, conservation. They coined the term data cascades: compounding events where data quality issues in one stage of the ML pipeline propagate and amplify through subsequent stages, creating problems that are difficult to trace back to their origin.

The title of the paper tells the whole story: "Everyone wants to do the model work, not the data work."

Back to our fraud detection system. Suppose the transaction data you're ingesting has a subtle issue: for transactions in certain European countries, the currency field occasionally arrives as an empty string instead of the ISO currency code. This happens for maybe 0.3% of transactions. Your data pipeline, being robust, fills in a default — USD. Now your feature that computes "transaction amount relative to average for this currency" is slightly wrong for those transactions. Not catastrophically wrong. Wrong enough that the model learns a subtle bias: European transactions look slightly more like fraud than they actually are, because their amounts appear anomalous relative to the wrong currency baseline.

That 0.3% error in the raw data cascades through feature engineering, training, and prediction. Six months later, European fraud analysts notice the model flags their region disproportionately. They file a ticket. An engineer investigates. Three weeks of detective work later, someone traces it back to that currency field default. The original data quality issue was invisible — no test caught it, no monitor flagged it, no dashboard showed it — because it was technically a valid value.

The Sambasivan study found that 92% of the practitioners they interviewed had experienced data cascades. Not once. Repeatedly. The cascades were almost always discovered late, usually by downstream users noticing something "felt off" about the model's behavior. The median time from cascade initiation to discovery was weeks to months.

The deeper problem is cultural. The study found a persistent hierarchy in ML teams where data work — collection, cleaning, labeling, validation — is treated as lower-status activity. The "real" work is architecture and training. This means the people closest to data quality issues often lack the organizational power to delay a launch over a data concern. I've watched this happen in real time: a data engineer raises a flag about inconsistent labels, gets told "we'll clean it up in v2," and v2 never comes because the model is already in production and producing revenue.

This isn't a technology problem. It's a values problem. And it shows up in the hiring pipeline too — teams will interview for months to find the right senior researcher, then hand the data labeling to the cheapest vendor they can find.

Feedback Loops Gone Wrong

A feedback loop in ML is what happens when a model's predictions influence the data that future versions of the model will train on. In the best case, this creates a virtuous cycle — the data flywheel we discussed earlier in this chapter. In the worst case, the model starts reinforcing its own mistakes until they become self-fulfilling prophecies.

The most chilling example is predictive policing. A model is trained on historical arrest data to predict where crimes will occur. Police are dispatched to those predicted locations. More arrests happen there — not necessarily because more crime occurs there, but because more officers are looking. That new arrest data confirms the model's predictions, so the next model version doubles down on those neighborhoods. The cycle accelerates. Entire communities get locked into a loop of over-policing that originated not from actual crime patterns, but from historical enforcement bias baked into the training data.

This isn't hypothetical. Oakland, California deployed predictive policing and found that the system consistently directed officers to predominantly Black and low-income neighborhoods, not because crime rates were objectively higher, but because those neighborhoods had historically heavier police presence — and therefore more documented incidents.

Recommendation systems have a milder but pervasive version of the same problem. Consider a music streaming platform. A user listens to a jazz track once, maybe by accident. The recommender notes this and suggests more jazz. The user, seeing jazz everywhere, listens to another track — maybe out of curiosity, maybe because nothing else is being shown. The recommender interprets this as strong signal. Within weeks, this person's entire feed is jazz. Their actual musical taste hasn't changed, but their recommendation profile has been captured by a filter bubble — a self-reinforcing loop where the system's suggestions shape the very behavior the system measures.

YouTube's recommendation algorithm faced public scrutiny for a more dangerous version of this: users who watched one mildly political video would get recommended increasingly extreme content, because extreme content drove higher engagement metrics, and higher engagement confirmed the recommendation. The algorithm wasn't designed to radicalize anyone. It was designed to maximize watch time, and radicalization happened to be an effective strategy for that objective.

Our fraud system isn't immune either. Suppose the model flags certain transaction patterns aggressively, and human reviewers — trusting the model — confirm those flags at a high rate. The model learns that those patterns are indeed fraudulent (because its predictions keep getting confirmed), and flags them even more aggressively next cycle. Meanwhile, patterns the model misses never get reviewed, never get labeled as fraud, and therefore never appear in the training data. The model develops blind spots that it can never correct because it never gets to see its own mistakes.

I'm still developing my intuition for how to systematically detect feedback loops before they cause damage. The honest answer is that most organizations discover them only after something visibly breaks — a news article, a customer complaint, an internal audit that reveals the bias has been compounding for months.

The AI Incident Database

Given how reliably ML systems fail in the real world, it's surprising that for most of the field's history there was no systematic way to learn from those failures. Aviation has the NTSB. Medicine has adverse event reporting. Software engineering has post-mortem culture. ML had... blog posts and Twitter threads.

The AI Incident Database (AIID), maintained by the Partnership on AI, was created to fill that gap. It's a public, searchable collection of incidents where AI systems caused or contributed to real-world harm. Anyone can submit an incident. Editors review and categorize each submission with metadata: the type of harm, the domain, the technology involved, the affected population.

Some entries are well-known. The COMPAS recidivism prediction system — a risk scoring tool used in the American criminal justice system — was found to systematically rate Black defendants as higher risk than they actually were, and white defendants as lower risk. Amazon built a resume screening tool that taught itself to penalize applications containing the word "women's" (as in "women's chess club") because it was trained on a decade of hiring data that skewed overwhelmingly male. Apple's credit card algorithm offered systematically lower credit limits to women than to their husbands, even when the women had higher credit scores.

What makes the AIID valuable isn't that these individual stories are new — most were covered extensively in the press. The value is in the aggregate. When you browse hundreds of incidents, patterns emerge. A striking number involve the same failure mode: a model trained on historical data inherits historical biases, gets deployed at scale, and amplifies those biases in ways the developers never anticipated. Another common pattern: a system works well in testing but encounters edge cases in the real world that were never represented in the training or evaluation data.

The database also reveals how long incidents take to be discovered — often months or years. And how often the people most affected by the failure are the least empowered to report it. Criminal defendants don't file bug reports against sentencing algorithms.

I'll be honest — spending an hour browsing the AIID is a sobering experience. Not because the failures are exotic or unusual, but because they're so predictable in hindsight. Almost every one could have been caught with more diverse evaluation, more domain expertise in the room, or more attention to who the system was being deployed on rather than who it was being built for.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid grasp of the four biggest systemic risks in production ML: technical debt that accumulates silently, data quality issues that cascade through pipelines, feedback loops that make models reinforce their own mistakes, and a growing public record of real-world incidents that shows how predictable these failures are in hindsight.

That mental model is genuinely useful. It'll change how you evaluate production systems, how you think about monitoring, and what questions you ask during design reviews.

What's ahead is the operational side — how to proactively test for these failures (chaos engineering), the environmental cost of the systems we build, the regulatory frameworks that are reshaping what's permissible, and the human dynamics that make or break ML teams.

If the discomfort of not knowing what's underneath is nagging at you, read on.

Chaos Engineering for ML

Netflix broke things on purpose and called it engineering. In the early 2010s, they built Chaos Monkey — a tool that randomly terminates production servers to verify that the system can handle the loss. The philosophy behind it: if your system can't tolerate expected failures, you'd rather find that out on a Tuesday morning than during a holiday traffic spike.

ML systems have their own failure modes, and most of them are more subtle than a server going down. What happens to your fraud detection system when the feature store returns stale data? When the model serving endpoint is 500ms slower than usual? When a new deployment receives input data with a slightly different schema than expected? When a third-party API that provides enrichment features goes offline?

Chaos engineering for ML means deliberately injecting these failures in controlled conditions and observing what happens. Netflix extended their tools — ChAP (Chaos Automation Platform) and FIT (Failure Injection Testing) — to cover the systems supporting their recommendation models. They don't just test "does the server stay up?" They test "when the primary recommendation model is unavailable, does the system gracefully fall back to the simpler baseline model, and what's the user impact?"

The practice follows a structured loop. First, define your steady state — maybe "99.9% of recommendation requests complete in under 100ms." Then hypothesize a failure: "If the primary model server goes down, the fallback model will serve within 200ms." Inject the failure. Measure the result. If reality doesn't match the hypothesis, you've found a vulnerability before your users did.

For our fraud system, this might look like: deliberately feeding the model a batch of transactions with missing features to confirm the fallback logic activates correctly. Or throttling the model serving endpoint to simulate network degradation and verifying that the system queues transactions for review rather than silently dropping them. Or deploying a model version with corrupted weights to verify that the health check catches it before any predictions are served.

Most ML teams I've talked to don't do any of this. They test that the model works. They don't test what happens when it doesn't. The gap between those two things is where production incidents live.

The Carbon Cost of ML

In 2019, Strubell and colleagues at UMass Amherst published a paper that put hard numbers on something the field had been vaguely uncomfortable about. Training a single large NLP model — a Transformer with architecture search — produced roughly 284 tonnes of CO₂ equivalent. That's about five times the lifetime emissions of an average American car, including the manufacturing.

Those numbers were from 2019. Models have gotten larger since.

The carbon cost of ML comes from two sources: training and inference. Training is the dramatic number — the one that makes headlines. But inference is the quieter, larger cost. A model gets trained once (or maybe retrained weekly), but it serves predictions millions of times a day, every day, for years. The aggregate energy consumption of inference across a fleet of models often exceeds the training cost by an order of magnitude.

For our fraud system, this isn't an abstract concern. Suppose the model processes 50 million transactions per day. Each inference call uses a fraction of a GPU-second, but multiplied by 50 million, that's real energy. Add the nightly retraining job, the data pipelines, the feature computation, the monitoring infrastructure — the system's total energy footprint is substantially larger than the model training alone.

Tools like CodeCarbon (a Python library that estimates the CO₂ emissions of your compute) and ML CO2 Impact (a web calculator) make it possible to measure this. Carbon-aware scheduling takes it further — timing your training jobs to run when the electrical grid in your data center's region is drawing more from renewable sources. Training a model at 2 AM in a region powered by wind farms produces meaningfully less carbon than training it at 6 PM in a region running on natural gas peakers.

I'll admit this is a topic where my thinking has evolved. A few years ago, I would have filed "ML carbon footprint" under "nice sentiment but not my problem." Then I saw the numbers for a production system I was responsible for, and the total annual compute bill — which is a rough proxy for energy consumption — was more than I expected by a factor of four. Most of that was inference, not training. Nobody had ever added it up before because no one thought to ask.

The field is slowly moving toward requiring that papers report not only accuracy but also compute cost and estimated emissions. This is analogous to how drug trials now report not only efficacy but also side effects. The question isn't whether your model is accurate. It's whether the accuracy is worth the cost.

The Regulatory Landscape

For most of ML's history, the regulatory environment was essentially "ship whatever you want and hope for the best." That era is ending. Two regulatory frameworks are reshaping production ML right now, and understanding them isn't a nice-to-have — it's a prerequisite for building systems that won't get you or your company into legal trouble.

The EU AI Act, formally adopted in 2024, classifies AI systems into four risk tiers. At the top: unacceptable risk — systems that are outright banned. These include social scoring by governments, real-time biometric surveillance in public spaces (with narrow exceptions), manipulation of vulnerable populations, and certain forms of predictive policing. If your system falls here, there is no compliance pathway. You don't build it.

The next tier is high risk, and this is where most production ML teams need to pay attention. High-risk systems include AI used in hiring, credit scoring, education, law enforcement, critical infrastructure, and medical devices. If your fraud detection system is used to make automated decisions about people's access to financial services, it likely qualifies. High-risk systems face mandatory requirements: formal risk assessments, high-quality training data with documented provenance, human oversight mechanisms, technical documentation sufficient for third-party audit, and ongoing monitoring after deployment. Noncompliance carries fines of up to €35 million or 7% of global revenue, whichever is larger.

The remaining tiers are limited risk (transparency obligations — tell users they're interacting with AI) and minimal risk (essentially unregulated — most research and open-source tools fall here).

On the other side of the Atlantic, the FDA has been developing its regulatory framework for ML in healthcare through the lens of Software as a Medical Device (SaMD). If your ML model diagnoses, monitors, or treats a medical condition, it's a medical device and needs FDA clearance. The classification follows the standard Class I / II / III structure, with most ML-based SaMDs falling into Class II. The interesting regulatory challenge is what happens when a model updates itself — traditional devices don't change after approval, but ML models retrain on new data. The FDA's response has been to develop Algorithm Change Protocols: pre-approved plans that describe how a model will be updated, what performance thresholds must be maintained, and how changes will be validated. The model can evolve, but only within the bounds the protocol defines.

Both frameworks share a common thread: they don't regulate the technology itself. They regulate the impact of the technology on people. Your model can use whatever architecture it wants. But if it makes decisions about people's health, freedom, employment, or financial access, you need to demonstrate that it does so responsibly, transparently, and with accountability.

My favorite thing about this regulatory landscape is how much of it is still genuinely uncertain. The EU AI Act was passed, but many of its provisions won't be enforced until 2025-2027, and interpretive guidance is still being written. The FDA's approach to continuously learning models is evolving in real time. If you're building production ML in a regulated domain, part of your job is monitoring the regulations themselves, because they're a moving target.

ML Team Anti-Patterns

Every production ML failure I've seen — every single one — was ultimately a people problem wearing a technology costume. The model didn't fail because gradient descent stopped working. It failed because the team was organized in a way that made failure inevitable.

The most common anti-pattern is what I call "all researchers, no plumbers." The team is composed entirely of data scientists and ML researchers with PhDs. They build beautiful models in notebooks. Nobody can deploy them. Nobody wants to deploy them, because deployment is "engineering work" and beneath the team's self-image. The models rot in notebooks while the company waits. In our fraud system scenario, this looks like a team that can achieve state-of-the-art AUC on the benchmark dataset but can't get a model into production without a three-month engineering effort by a different team.

The inverse is equally destructive: "all plumbers, no researchers." A team of software engineers who can build beautiful infrastructure but don't have the statistical grounding to know when a model is subtly broken. They'll ship a model that achieves 95% accuracy on the test set without noticing that the test set has a distribution shift from production, or that the 5% the model gets wrong is concentrated entirely in one demographic group.

Then there's premature specialization. A startup with 8 people hires a dedicated computer vision specialist, a dedicated NLP specialist, and a dedicated recommendation systems specialist — before the company has even validated which ML problem matters most for their product. Six months later, the product pivots to a different ML problem, and two of those three specialists are doing work they weren't hired for. Early-stage teams need generalists who can handle whatever ML problem emerges. Specialization comes later, after the problem space stabilizes.

No MLOps representation is the silent killer. A team of 10 has 8 data scientists, 1 engineering manager, and 1 backend engineer. Nobody owns the deployment pipeline, monitoring infrastructure, data validation, or model versioning. These tasks get done ad hoc by whoever has capacity, which means they get done poorly or not at all. By the time the team realizes they need dedicated MLOps expertise, the technical debt is already substantial.

The final anti-pattern is organizational isolation. The ML team sits in its own corner, separate from product, engineering, design, and domain experts. They build what they think is needed, not what is needed. In our fraud system, this means the ML team optimizes for AUC while the fraud analysts care about precision at the top of the ranked list. The ML team measures success differently from the people who actually use the model, and nobody notices because they don't talk to each other regularly enough.

The fix for all of these is unsexy: cross-functional teams with clear ownership, a healthy balance of research and engineering talent, and enough organizational proximity to the end users that the team can't accidentally build the wrong thing. I've never seen this done perfectly. I've seen it done well enough.

Build vs. Buy

At some point, every ML team faces this question: do we build our own ML infrastructure, or do we buy an existing platform? The answer is always "it depends," and the thing it depends on is almost never what people think.

The instinct — especially in teams with strong engineering talent — is to build. Custom solutions feel more flexible. They're tailored to your exact needs. And there's an undeniable appeal to owning the whole stack. But custom infrastructure carries a maintenance tax that compounds over time. Every component you build is a component you maintain, debug, upgrade, and staff. That feature store you built in-house works great until the engineer who designed it leaves and nobody else understands the custom serialization format.

The vendor route — AWS SageMaker, Google Vertex AI, Databricks, MLflow as managed service — trades flexibility for speed and reduced maintenance burden. You get a working system in weeks instead of months. But you also get someone else's opinions about how ML should be done, baked into the platform's abstractions. When your use case doesn't fit those abstractions, you're fighting the platform instead of using it.

For our fraud system, the decision matrix looks like this. The data pipeline is mostly standard ETL with some real-time streaming — buying this (Kafka managed service, cloud data warehouse) is usually right. Feature engineering is domain-specific and tightly coupled to fraud patterns your team has developed — building custom feature computation logic is usually right. Model training is mostly standard supervised learning — buying a managed training service is reasonable. Model serving with low-latency requirements and custom fallback logic — this often requires building, because vendor serving solutions rarely support the exact failure modes you need to handle.

The practical answer, for most teams, is a hybrid: buy the undifferentiated heavy lifting (compute, storage, orchestration, experiment tracking), build the components that are core to your competitive advantage or that require domain-specific customization. The mistake most teams make isn't choosing build or buy — it's failing to reevaluate the decision as the team and the product evolve. The custom feature store that made sense when you had 3 features and 1 model becomes a liability when you have 200 features and 15 models. The vendor platform that seemed constraining at 2 engineers becomes essential at 20.

One heuristic I keep coming back to: if the component you're considering building would be boring — if it doesn't involve any proprietary logic, domain-specific innovation, or competitive advantage — buy it. Save your engineering hours for the things that make your system uniquely good at its job.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with the silent accumulation of technical debt in ML systems — the glue code, pipeline jungles, and configuration rot that grow beneath the surface. We followed data cascades through pipelines, watched feedback loops turn models into self-fulfilling prophecies, and walked through the sobering catalog of real-world failures in the AI Incident Database. We talked about deliberately breaking things with chaos engineering, confronted the carbon cost of the systems we build, navigated the emerging regulatory landscape across the EU and FDA, examined the team dynamics that make or break production ML, and wrestled with the build-vs-buy decision that every team faces.

My hope is that the next time you encounter one of these topics in a design review, a conference talk, or a late-night production incident, instead of nodding along and hoping nobody asks you to elaborate, you'll have a genuine mental model of what's happening under the hood — and a pretty good sense of which rabbit holes are worth diving into deeper.

Resources

"Hidden Technical Debt in Machine Learning Systems" by Sculley et al. (2015) — the O.G. paper on ML technical debt. Short, readable, and it will make you look at every production ML system differently.

"Everyone wants to do the model work, not the data work" by Sambasivan et al. (2021) — the data cascades paper. Insightful interviews with real practitioners in high-stakes domains. The title alone is worth the citation.

The AI Incident Database at incidentdatabase.ai — maintained by the Partnership on AI. Browse it for an hour. It's the most effective way to develop intuition for how ML systems fail in the real world.

"Energy and Policy Considerations for Deep Learning in NLP" by Strubell et al. (2019) — the paper that put hard numbers on ML's carbon footprint. Sparked an important conversation about compute costs that's still ongoing.

The EU AI Act full text — dense regulatory language, but the risk classification framework is wildly important for anyone building production ML that touches European users.

Netflix Tech Blog: Chaos Engineering — their posts on FIT and ChAP are the best practical guides to applying chaos engineering principles to systems that include ML components.

← Previous AI Engineering with Foundation Models