Data Quality & Preparation

Chapter 3: Data Fundamentals Labeling · Augmentation · Validation · Drift

I spent an embarrassing amount of time tuning hyperparameters on a sentiment classifier before discovering that roughly eight percent of the training labels were wrong. Not subtle edge cases — outright flipped labels. Positive reviews marked negative. Sarcastic comments marked positive. I swapped in a fancier architecture. I tried curriculum learning. I adjusted class weights. Nothing moved the needle beyond 87% accuracy. Then a colleague suggested we audit a random sample of the labels. Three days of manual review and re-labeling later, the same baseline model hit 94%. That was the week data quality stopped being someone else's problem for me. Here is that dive.

Data quality and preparation is the unglamorous machinery that sits between raw data and a trained model. It encompasses everything from how labels are created and validated, to how training examples are synthetically expanded, to how you catch silent data corruption before it poisons a production system. The field has matured rapidly — tools like Snorkel, Cleanlab, Great Expectations, and Evidently AI have turned what used to be ad-hoc scripts into principled frameworks.

Before we start, a heads-up. We'll be writing Python, computing some statistics, and working with a few libraries you may not have seen before. You don't need to know any of them beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

The Labeling Problem Writing Annotation Guidelines That Actually Work Measuring Annotator Agreement Weak Supervision — When You Can't Afford to Label Everything Active Learning — Labeling the Examples That Matter Most Finding Bad Labels After the Fact Data Augmentation — Manufacturing Training Signal Mixing Strategies and Text Augmentation Test-Time Augmentation Rest Stop Data Validation — Unit Tests for DataFrames Drift Detection — When Production Stops Looking Like Training Data Contracts Wrap-Up Resources

The Labeling Problem

Imagine you're building a spam classifier for a small email startup. You have 50,000 unlabeled emails sitting in a database. Your model needs to learn the difference between spam and not-spam, and for that it needs labels — a human judgment attached to each email saying which category it belongs to. That label is the ground truth your model tries to approximate. Without it, supervised learning doesn't start.

Think of labels as ingredients in a kitchen. A brilliant chef with rotten tomatoes produces a terrible sauce. Doesn't matter how precise the knife work is, doesn't matter how perfectly calibrated the stove — the ingredient quality sets a ceiling that technique can never exceed. The same thing happens with models. Your architecture, your optimizer, your learning rate schedule — all of those are the chef's skill. But if the labels are wrong, the model is learning to reproduce mistakes. I'll be honest: this analogy felt overly simple to me until I lived through it. Now it's the first thing I check.

The core tension is this: labels are expensive. A trained annotator might label 200 emails per hour. At that rate, labeling 50,000 emails takes 250 hours of human labor. And that's for a relatively easy binary task. Medical image annotation, where a radiologist examines each scan? That's orders of magnitude slower and more expensive. So the entire field of data preparation revolves around a question: how do we get the most accurate labels for the least cost?

Writing Annotation Guidelines That Actually Work

The most underrated step in any labeling project is writing the rulebook before anyone starts labeling. These are called annotation guidelines — a document that tells annotators exactly how to make decisions, especially on ambiguous cases.

Back to our spam classifier. "Label each email as spam or not-spam" sounds clear enough. But then an annotator encounters a marketing email from a service the user actually signed up for. Is that spam? What about a phishing email that looks like a legitimate bank notification? What about a newsletter the user hasn't opened in six months? Without explicit rules for these boundary cases, every annotator makes a different call. The labels become a reflection of individual judgment rather than a consistent task definition.

Good guidelines include three things: positive examples for each class ("this IS spam because..."), negative examples that look like they might belong but don't ("this looks like spam but isn't because the user opted in"), and an escalation process for genuine ambiguity ("if you can't decide after 30 seconds, flag it for review"). I've found that spending a single day writing thorough guidelines saves weeks of re-labeling later. That's not an exaggeration — I've seen labeling projects restart from scratch because no one wrote guidelines up front.

On the tooling side: Label Studio is open-source and handles images, text, audio, and video. Prodigy, built by the spaCy team, is NLP-optimized with built-in active learning. CVAT is best for bounding boxes and video tracking. The choice depends on your data modality and whether you need a managed workforce.

Measuring Annotator Agreement

If one person labels your entire dataset, you have a problem you can't see. Maybe that person labels conservatively and misses real spam. Maybe they're aggressive and flag legitimate emails. You'd never know. The fix is to have at least two annotators label the same subset of examples, then measure how often they agree.

Raw agreement percentage is tempting but misleading. If 95% of emails are not-spam, two annotators who both guess "not-spam" every time agree 95% of the time — and they've done nothing useful. We need a metric that corrects for chance agreement, and that metric is Cohen's Kappa.

The formula is κ = (p_observed − p_expected) / (1 − p_expected). The numerator measures how much better the annotators agree compared to pure chance, and the denominator normalizes it so perfect agreement gives κ = 1. A κ below 0.60 tells you the task definition or guidelines need work — annotators are making different judgment calls too often. Above 0.80, you're in strong territory. Between those, you're in a gray zone where improving guidelines might still help.

from sklearn.metrics import cohen_kappa_score

annotator_1 = ['spam','spam','ham','spam','ham','ham','spam','spam','ham','spam']
annotator_2 = ['spam','spam','ham','ham','ham','ham','spam','spam','ham','spam']

kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's Kappa: {kappa:.3f}")  # 0.800

That code compares two lists of label decisions, element by element. The score of 0.80 here tells us the annotators agree substantially beyond what chance would predict. When you have three or more annotators, Fleiss' Kappa generalizes the same idea — it still measures agreement corrected for chance, but across N raters instead of two.

There's a quality control trick worth knowing: honeypot examples. These are items with known correct answers that you slip into the annotation queue without telling the annotators. If someone consistently gets honeypots wrong, their entire batch needs review. Pair this with an adjudication process — two annotators label everything, auto-accept where they agree, and route disagreements to a senior reviewer. The combination catches errors without doubling your costs.

Weak Supervision — When You Can't Afford to Label Everything

Let's return to our 50,000 emails. Even with great guidelines, manual labeling is slow. What if, instead of labeling each email by hand, we wrote small programs that guess labels using heuristics? Each program would be noisy and incomplete on its own — maybe it only handles a narrow slice of cases — but what if we combined dozens of these noisy guesses into something useful?

That's the core idea behind weak supervision, and Snorkel is the framework that made it practical. You write labeling functions — small Python functions that encode rules, keyword matches, pattern heuristics, or even the output of a cheap external model. Each function can vote spam, vote ham, or abstain (say nothing) for any given email.

from snorkel.labeling import labeling_function

SPAM, HAM, ABSTAIN = 1, 0, -1

@labeling_function()
def lf_contains_free_money(x):
    # Emails mentioning "free money" are almost certainly spam
    return SPAM if "free money" in x.text.lower() else ABSTAIN

@labeling_function()
def lf_short_email(x):
    # Very short emails from contacts tend to be legitimate
    return HAM if len(x.text.split()) < 5 else ABSTAIN

@labeling_function()
def lf_known_sender(x):
    # Emails from the user's contact list are ham
    return HAM if x.sender in known_contacts else ABSTAIN

@labeling_function()
def lf_suspicious_url(x):
    # Emails with shortened URLs are likely spam
    return SPAM if "bit.ly" in x.text or "tinyurl" in x.text else ABSTAIN

Each labeling function is intentionally narrow. lf_contains_free_money only fires on emails containing that exact phrase — it abstains on everything else. lf_known_sender only knows about the contact list — it says nothing about emails from strangers. Individually, each function is incomplete and noisy. The power comes from combining them.

Snorkel's LabelModel takes all the votes from all your labeling functions and learns which functions to trust. Functions that agree with each other and with high-confidence examples get more weight. Functions that contradict reliable ones get discounted. The output is a set of probabilistic labels — not hard spam/ham decisions, but confidence scores like "82% likely spam." You can use these directly as soft labels or threshold them into hard labels for training.

The first time I saw Snorkel's noisy, heuristic-generated labels outperform labels I'd carefully curated by hand, I didn't believe the result. I re-ran the experiment twice. The explanation is statistical: twenty imperfect signals, properly weighted, can outperform a small set of perfect signals when the imperfect signals cover far more data. Production teams have reported 70% reductions in manual labeling effort using this approach.

The tradeoff is real though. Weak supervision works beautifully for tasks with clear heuristic signals — spam detection, information extraction, medical coding. For highly subjective tasks like sarcasm detection or aesthetic judgment, where the "rules" are too subtle to encode in functions, it produces unreliable labels. Know when to use it and when to invest in manual annotation.

Active Learning — Labeling the Examples That Matter Most

Here's a different angle on the labeling cost problem. You have a million unlabeled emails and budget for 10,000 labels. Which 10,000 do you label? Random sampling is the naive approach, but it wastes budget on easy examples — emails that are obviously spam or obviously legitimate. The model doesn't learn much from those.

Active learning flips the selection process. Instead of picking random emails to label, you ask the model which examples it's most confused about, then label those first. Think of it this way: if the model is already 99% sure an email is spam, labeling that email teaches it almost nothing. But if the model is stuck at 52% confidence — essentially a coin flip — that email is sitting right on the decision boundary, and labeling it provides maximum information.

The simplest version of this is uncertainty sampling. Train an initial model on whatever labels you have, run it on the unlabeled pool, and pick the examples where the model's confidence is lowest.

import numpy as np

def uncertainty_sample(model, X_unlabeled, n=100):
    probs = model.predict_proba(X_unlabeled)
    # For each example, how uncertain is the model?
    # If max probability is 0.51, the model is nearly guessing
    uncertainty = 1 - np.max(probs, axis=1)
    # Return indices of the n most uncertain examples
    return np.argsort(uncertainty)[-n:]

The predict_proba call gives us the model's confidence for each class. Subtracting the maximum from 1 gives us a measure of confusion — high values mean the model is torn between classes. We sort by that confusion and take the top n.

The full loop looks like this: train a model on your initial labeled set, score the unlabeled pool, select the most uncertain examples, send them to annotators, add the new labels to your training set, retrain, and repeat. Each cycle, the model gets smarter about exactly the examples it was struggling with.

A more sophisticated variant is query by committee: train multiple models (an "ensemble" or "committee") and pick examples where the committee members disagree most. If three random forests and two neural networks all disagree about whether an email is spam, that email is genuinely ambiguous and worth labeling.

Real-world results are striking. Teams using active learning routinely reach the same accuracy with 50–80% fewer labels compared to random sampling. In one medical imaging study, uncertainty sampling matched the performance of 450 random labels using only 200 targeted ones — a 55% reduction in annotation effort. The catch is infrastructure. You need a pipeline that can retrain models, score unlabeled pools, and route selected examples to annotators in a tight loop. That's engineering work, but the label savings usually justify it.

Finding Bad Labels After the Fact

Sometimes the damage is already done. You've collected labels — maybe through crowdsourcing, maybe through a legacy process — and you suspect some fraction of them are wrong. Auditing all of them manually is too expensive. Can you find the bad ones automatically?

This is exactly what Cleanlab does, built on a method called Confident Learning. The core idea is elegant: if you train a model and it consistently predicts a different class than the given label with high confidence, that label is probably wrong. The model isn't confused — the label is.

The workflow is straightforward. Train a classifier on your (noisy) labeled dataset. Use cross-validation to get out-of-sample predicted probabilities for every example — this avoids the model memorizing its own training labels. Then pass those probabilities and the given labels to Cleanlab, which estimates a joint distribution of "true label" vs. "given label" and flags the likely errors.

from cleanlab import Datalab

# predicted_probs: shape (n_samples, n_classes) from cross-validated model
# labels: the given (possibly noisy) labels

lab = Datalab(data={"labels": labels}, label_name="labels")
lab.find_issues(pred_probs=predicted_probs)

# Get indices of likely label errors, ranked by severity
issue_df = lab.get_issues()
errors = issue_df[issue_df["is_label_issue"]].sort_values("label_quality", ascending=True)
print(f"Found {len(errors)} likely label errors out of {len(labels)} examples")

The find_issues method looks at each example and compares the model's predicted probability distribution against the given label. If the model assigns 95% probability to "ham" but the label says "spam," that's a strong signal the label is wrong. Cleanlab ranks these by severity so you can review the worst offenders first.

In practice, real-world datasets often contain 5–10% label errors. Fixing even the worst 2–3% typically produces a noticeable accuracy boost — sometimes more than any architecture change would give you. I still get a knot in my stomach thinking about the months I spent debugging a model when the problem was sitting right there in the labels.

Data Augmentation — Manufacturing Training Signal

Let's shift gears from label quality to data quantity. Even with perfect labels, more training data usually helps. But collecting and labeling new data is expensive. Data augmentation offers a shortcut: create new training examples by applying transformations to existing ones.

Going back to our kitchen analogy — augmentation is like a chef who can't get more tomatoes but learns to slice them ten different ways. Each slice presents the same tomato from a different angle, and a model trained on all those angles learns to recognize "tomato" regardless of how it's cut. Technically, augmentation acts as a regularizer. It smooths the loss landscape in a similar way to dropout or weight decay, but from the data side instead of the model side.

Consider a model trained on 1,000 photos of cats. Without augmentation, it memorizes "tabby on blue couch" — the specific pixels, the specific background. Flip the image horizontally and the model hesitates. It learned the couch, not the cat. Augmentation forces the model to find features that survive transformations: the shape of ears, the pattern of whiskers, the curve of the spine. Those are the features that generalize.

import torchvision.transforms as T

train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.08, 1.0)),
    T.RandomHorizontalFlip(p=0.5),
    T.RandomRotation(degrees=15),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

RandomResizedCrop crops a random region of the image and resizes it — forcing the model to recognize objects at different scales and positions. RandomHorizontalFlip mirrors the image with 50% probability. RandomRotation tilts it slightly. ColorJitter randomly shifts brightness, contrast, saturation, and hue — because the same cat under fluorescent light looks different than under sunlight, but it's still a cat. Normalize at the end standardizes pixel values to what pretrained models expect.

One critical rule: never augment your validation or test set. Those must reflect the real data distribution. If you augment them, your evaluation metrics become meaningless — you'd be measuring how well the model handles synthetic transformations, not real-world inputs.

Mixing Strategies and Text Augmentation

Standard transforms modify a single image. Mixing strategies go further — they combine two images into one and mix their labels proportionally.

CutMix takes a rectangular patch from one image and pastes it onto another. If the patch covers 30% of the image area, the label becomes 70% "original class" + 30% "patch class." This forces the network to attend to all regions of the image, not fixate on a single discriminative area. It's particularly effective for tasks where occlusion matters — object detection, segmentation — because the model learns to make predictions even when part of the object is obscured.

MixUp takes a different approach: it linearly blends two entire images together (and their labels by the same ratio), producing ghostly superimpositions. This creates smoother decision boundaries and acts as a strong regularizer against overconfidence. In practice, MixUp is the better choice for classification with noisy labels, while CutMix excels when you need localization. Both can be combined, and modern training recipes like those used for Vision Transformers routinely stack both.

For text, augmentation is trickier because language has structure that random perturbation can destroy. Back-translation — translating to another language and back — produces the highest-quality paraphrases but is slow. Synonym replacement via WordNet is fast but shallow. LLM paraphrasing (asking a language model to rephrase) gives the best quality-to-effort ratio in 2024, though at a per-example cost that adds up.

For tabular data, augmentation is less mature. SMOTE synthesizes minority-class examples by interpolating between neighbors — useful for class imbalance. Gaussian noise injection adds small random perturbations to numeric features. Both act as regularizers, though neither is as transformative as image augmentation tends to be.

The One Rule of Augmentation

Every transformation must preserve the label. Flipping a car horizontally? Fine — a car is still a car. Rotating a handwritten "6" by 180°? You've turned it into a "9." The transformation broke the label. If you're unsure whether a transform preserves your label, don't apply it. This requires domain knowledge, and there's no shortcut around that.

Test-Time Augmentation

Augmentation isn't limited to training. At inference time, you can run the same input through several augmented versions — a few different crops, a horizontal flip, a slight rotation — and average the model's predictions. This is called test-time augmentation (TTA), and it typically buys you a 1–2% accuracy improvement with no retraining required.

import torch

predictions = []
for _ in range(5):
    augmented = train_transform(image)
    pred = model(augmented.unsqueeze(0))
    predictions.append(pred)
final_pred = torch.stack(predictions).mean(dim=0)

Each pass through the loop applies a different random augmentation. The five predictions represent the model's judgment from five slightly different perspectives, and averaging them cancels out noise. The tradeoff is inference latency — five forward passes take five times as long. Use TTA when accuracy matters more than speed, like in medical diagnosis or competition submissions.

Rest Stop

Congratulations on making it this far. If you want to stop here, you can. You now understand labeling strategies (manual, weak supervision, active learning), how to audit label quality with Cleanlab, and how data augmentation manufactures training signal. That's a solid mental model for data preparation, and it covers about 70% of what you'll use in practice.

What we haven't covered yet is what happens after you deploy. Your data is clean today, but production data changes. Columns shift format, distributions drift, upstream providers alter their schemas. The remaining sections cover how to catch these problems before they poison your model. The short version: write automated checks for your data the same way you write unit tests for your code.

But if the thought of silent data corruption nagging at you in production is uncomfortable, read on.

Data Validation — Unit Tests for DataFrames

Here's a scenario that still haunts me. A team spends three months building an elaborate recommendation model. They deploy it. Accuracy craters within two weeks. The root cause? An upstream service changed a date column from YYYY-MM-DD to DD/MM/YYYY. The model was fine. The data wasn't. And nobody had a single check in place to catch it.

Fixing a data issue at ingestion costs almost nothing — a schema check, a range assertion, a null count. Fixing it after the model has been retrained on corrupted data and pushed to production? That costs incident response, rollbacks, and lost trust. The entire field of data validation exists to catch problems early, before they compound. Think of it as building inspection for our kitchen — checking the ingredients before the chef starts cooking, not after the dish is served.

Great Expectations is the most established framework for this. You write declarative "expectations" about what your data should look like — the data equivalent of unit tests.

import great_expectations as gx

context = gx.get_context()

# Each expectation is a single assertion about your data
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=150)
validator.expect_column_values_to_be_in_set(
    "status", ["active", "inactive", "pending"]
)
validator.expect_table_row_count_to_be_between(min_value=1000, max_value=500000)

results = validator.validate()

Each expectation reads like a sentence: "expect column values to not be null," "expect column values to be between 0 and 150." When validate() runs, it checks every expectation and returns a pass/fail report with diagnostics for each failure. Expectations are stored as JSON — version-controllable, shareable across teams, and executable in CI/CD pipelines. Checkpoints in Great Expectations let you wire these checks into Airflow, Prefect, or any orchestrator so that a pipeline step fails loudly if the data doesn't meet expectations.

Pandera offers a lighter alternative when you don't need the full Great Expectations infrastructure — no HTML reports, no checkpoints, no data docs. Instead, you get a Python-native schema that validates a DataFrame in one line.

import pandera as pa

schema = pa.DataFrameSchema({
    "age": pa.Column(int, pa.Check.in_range(0, 150)),
    "income": pa.Column(float, pa.Check.greater_than(0)),
    "status": pa.Column(str, pa.Check.isin(["active", "inactive"])),
})

validated_df = schema.validate(df)  # raises SchemaError on failure

Use Great Expectations when you need a team-wide data quality platform with documentation and monitoring. Use Pandera when you need a quick schema guard inside a training script. Both are better than no validation at all, which is where a surprising number of production ML systems still sit.

Drift Detection — When Production Stops Looking Like Training

Data drift is when the statistical properties of your input data change over time. Customer behavior shifts with seasons. A sensor degrades gradually. A data provider quietly changes their schema. Your model was trained on one distribution, and production is silently serving it a different one. The model doesn't crash — it politely produces worse and worse predictions, and nobody notices until a business metric drops.

There are three standard ways to detect drift in numeric features. The Kolmogorov-Smirnov (KS) test measures the maximum difference between two cumulative distribution functions — your training data's distribution and your production data's distribution. A large KS statistic (with a small p-value) means the distributions look different. The Population Stability Index (PSI) bins both distributions and computes a divergence score: PSI below 0.1 means no meaningful drift, between 0.1 and 0.2 is moderate, and above 0.2 signals significant drift that needs investigation. PSI is particularly common in financial services and credit scoring. For categorical features, the Chi-square test compares observed category frequencies against expected ones.

from scipy.stats import ks_2samp

stat, p_value = ks_2samp(train_df["age"], prod_df["age"])
if p_value < 0.05:
    print(f"Drift detected in 'age' (KS stat={stat:.3f}, p={p_value:.4f})")

The ks_2samp function takes two arrays of values — the feature's distribution in training and production — and returns a test statistic and p-value. A p-value below 0.05 means the two distributions are statistically different at the 95% confidence level. In practice, you'd run this for every feature and flag the ones that drift.

Evidently AI wraps these statistical tests (and more) into a monitoring framework purpose-built for ML. You pass it a reference dataset and a current dataset, and it generates reports showing which features drifted, by how much, and whether the target distribution shifted too.

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")

I'm still developing my intuition for where to set drift thresholds in production. Too sensitive and you drown in false alarms every time normal seasonal variation occurs. Too lenient and you miss genuine distribution shifts until they cause real damage. The teams I've seen handle this well run drift checks daily, alert on sustained drift (multiple days, not a single spike), and pair statistical detection with business metric monitoring. If both the KS test and your conversion rate move at the same time, something real is happening.

Data Contracts

Drift detection catches problems after they happen. Data contracts try to prevent them. A data contract is a formal agreement between a data producer (the team that generates a table, an API, or an event stream) and a data consumer (the team that uses it for training or inference). It specifies what columns exist, what types they have, what ranges are valid, and what the update cadence is — essentially a schema plus an SLA.

This is a newer concept, growing out of the data mesh movement that treats data as a product owned by domain teams. The idea is that if the upstream team wants to change a column format, the contract requires them to notify consumers and negotiate the change — instead of silently breaking downstream ML pipelines.

In practice, data contracts are still maturing. Some teams implement them as Protobuf or Avro schemas in a central registry. Others use Great Expectations suites as executable contracts. The tooling is fragmented, but the principle is sound: make data expectations explicit, shared, and enforced at the boundary between teams. It's the data equivalent of API versioning, and once you've been burned by a silent schema change at 2am, you become a convert.

Data Quality in Practice

These tools work best when you treat data tests as first-class citizens in your pipeline — not afterthoughts, but hard gates that block training if the data is broken.

def train_pipeline():
    df = load_data()

    # Gate 1: Schema validation — reject if structure is wrong
    schema.validate(df)

    # Gate 2: Drift check — warn or block if distribution shifted
    check_drift(reference_df, df, threshold=0.2)

    # Gate 3: Label audit — flag suspicious labels for review
    audit_labels(df, model_probs)

    # Only after all gates pass: preprocess and train
    model = train(preprocess(df))

Each gate catches a different class of problem. Schema validation catches structural issues — missing columns, wrong types, unexpected nulls. Drift detection catches distributional issues — the data is structurally correct but statistically different from what the model expects. Label auditing catches annotation issues — the data looks normal but some labels are wrong. Together, they form a layered defense that catches problems at the earliest, cheapest point of intervention.

Wrap-Up

If you're still with me, thank you. I hope it was worth the journey.

We started with a question that sounds trivial — how do you attach a label to a piece of data? — and ended up building an entire defense system. We wrote annotation guidelines, measured annotator agreement with Cohen's Kappa, used Snorkel's labeling functions to scale labeling without scaling humans, deployed active learning to spend our annotation budget where it matters most, audited existing labels with Cleanlab, manufactured training signal through augmentation, and set up validation gates and drift detection to protect production systems from silent data corruption.

My hope is that the next time you find yourself reaching for a more complex model architecture to squeeze out another percentage point of accuracy, you'll pause and ask a different question first: have I checked the data? Because in my experience — and I learned this the hard way — the answer to "why isn't my model performing better?" is almost always sitting in the data, not the model.

Resources