Data Cleaning & Feature Engineering

Chapter 3: Data Fundamentals Section 3 of 10
TL;DR

Data cleaning and feature engineering are where the real model performance lives. Not in architecture choices, not in hyperparameter tuning — in the work you do before any of that begins. Cleaning decides what your model sees. Feature engineering decides how it sees it. A logistic regression with well-crafted features will crush a neural network fed raw garbage. Every imputation, encoding, and transformation you apply is a modeling decision, whether you realize it or not.

I avoided data cleaning for longer than I'd like to admit. It felt like the unglamorous part of machine learning — the dishes before the cooking, the part nobody writes blog posts about. Every time a tutorial said "we assume the data is already clean," I'd nod and move on to the model. Finally, the discomfort of watching my models fail on real data — data with holes, weird types, and values that made no sense — grew too great for me. Here is that dive.

Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete records in a dataset. Feature engineering is the process of using domain knowledge to create new variables — or transform existing ones — so that patterns become visible to algorithms. Together, they account for the majority of performance gains in tabular machine learning. This isn't a new insight. Kaggle grandmasters have been saying it for over a decade: the model is secondary to the features you feed it.

Before we start, a heads-up. We're going to be working with pandas DataFrames, scikit-learn transformers, and a bit of numpy. But you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

The mess we start with
Missing data — why values vanish and what to do about it
Outliers — errors vs. rare truths
Feature scaling — when scale lies to your model
Taming skewed distributions
Rest stop
Categorical encoding — translating words into numbers without lying
Creating features — where the real gains live
Data leakage — the silent model killer
Wrap-up

The Mess We Start With

Imagine we're building a model to predict monthly rent for apartments. We've scraped listings from six neighborhoods and assembled a small dataset. Here's what arrives on our desk:

import pandas as pd
import numpy as np

data = {
    "sqft":          [850, 1200, np.nan, 640, 3400, 900],
    "neighborhood":  ["Midtown", "midtown", "Eastside", "Westpark", "Eastside", "MIDTOWN"],
    "has_parking":   ["yes", "Yes", "no", 1, "no", "YES"],
    "listed_date":   ["2024-01-15", "2024-03-22", "2024-06-01", "2024-02-10", "2024-07-14", "2024-01-30"],
    "rent":          [1800, 2400, 1600, 1100, 15, 1950]
}
df = pd.DataFrame(data)

Six rows. Already a disaster. One apartment has no square footage recorded. "Midtown" appears three different ways. The parking column is a mix of strings and integers. And one apartment is listed at $15/month — either a typo or the deal of the century.

This is the natural state of real-world data. I used to think messy data was a sign that someone upstream messed up. I've since learned it's the default. Clean data is the exception. The question is never "is this data messy?" — it's "how messy, and in what ways?"

Think of it like cooking. You wouldn't toss unwashed, uncut vegetables into a pan and hope for the best. The prep work — washing, peeling, dicing — isn't the exciting part, but skip it and the whole dish falls apart. Data cleaning is that prep work. Feature engineering is choosing which spices to add and in what combinations. We'll come back to our little apartment dataset throughout this journey.

Missing Data — Why Values Vanish and What To Do About It

Our apartment at row 2 has no square footage. Before reaching for an imputer, we need to ask a question that most tutorials skip: why is this value missing? The answer determines which fix is safe and which will silently poison your model.

There's a framework for this, introduced by statistician Donald Rubin in the 1970s, and it classifies missing data into three categories.

MCAR — Missing Completely at Random. The missingness has nothing to do with the apartment or its rent. A web scraper timed out. A sensor glitched. Pure bad luck. If your data is MCAR, dropping those rows is safe — you lose statistical power but introduce no bias. It's like randomly tearing pages out of a phone book. You have less information, but what remains is still representative.

MAR — Missing at Random. The missingness depends on other observed variables, not the missing value itself. Maybe older listings are less likely to include square footage because the listing platform didn't require it before 2023. The data isn't missing by accident — there's a pattern — but it's a pattern we can see and account for. Condition your imputation on those related variables and you're in reasonable shape.

MNAR — Missing Not at Random. This is the dangerous one. The missing value is related to the value itself. Tiny apartments might not list their square footage because the landlord knows it looks bad. High earners skip the income question on surveys. No statistical trick fully compensates for this. You can impute and add a missing indicator flag — a binary column that says "this value was missing" — so the model can at least learn from the pattern of missingness itself. But be honest about residual bias.

I'll be honest — in practice, distinguishing MAR from MNAR often requires domain knowledge more than statistical tests. You can run Little's MCAR test to check whether data is plausibly MCAR, and you can visualize missingness patterns with heatmaps. But the final call on "why is this missing?" usually comes from understanding the data collection process, not from an algorithm.

Back to our apartment. The missing sqft could be MCAR (scraper glitch), MAR (older listing format), or MNAR (tiny apartment hiding its size). Let's say we investigate and find it's a scraper issue — MCAR. With only one missing value out of six, we could drop the row. But let's try imputation instead, since in real datasets the proportion is rarely this clean.

from sklearn.impute import SimpleImputer

# Median is robust to outliers — the mean would be pulled by that 3400 sqft outlier
imputer = SimpleImputer(strategy="median")
df["sqft"] = imputer.fit_transform(df[["sqft"]])

Median imputation replaces the missing value with the middle value of the non-missing entries. It's blunt but robust — unlike the mean, it won't get dragged around by that 3,400 sqft penthouse.

When columns are correlated — when knowing the neighborhood and parking situation tells you something about the square footage — there's a more sophisticated approach called MICE (Multiple Imputation by Chained Equations). MICE works by treating each column with missing values as a prediction target, using all other columns as predictors. It cycles through every incomplete column, re-imputing each one using the latest filled-in values from the others, and repeats this for several iterations until the imputations stabilize. Think of it as a group of friends collectively filling in each other's blanks, with each pass getting more accurate as they learn from one another.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

mice = IterativeImputer(max_iter=10, random_state=42)
df[["sqft"]] = mice.fit_transform(df[["sqft", "rent"]])

MICE is powerful, but it's overkill when fewer than 5% of values are missing and there's no particular structure to the missingness. It's also computationally heavier and can fail to converge when there's too little data to train meaningful internal regressions.

One thing you must never impute: the target variable. If we filled in missing rent values with estimates, we'd be training on fabricated labels. The model would learn patterns that exist in our imagination, not in the real world. Drop those rows instead.

Outliers — Errors vs. Rare Truths

That $15/month apartment is staring at us. An outlier is either a data error or a real but rare observation, and your response should differ dramatically depending on which one it is.

The classic detection method is the IQR rule. Compute the interquartile range — the distance between the 25th and 75th percentiles — and flag anything more than 1.5× that distance beyond either end. It's distribution-free, meaning it doesn't assume your data is normally distributed.

Q1, Q3 = df["rent"].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

outliers = df[(df["rent"] < lower_fence) | (df["rent"] > upper_fence)]

Our $15 apartment gets flagged immediately. But the IQR method has limitations I didn't appreciate for a long time. It struggles with multimodal data — if your apartments cluster into "studios" and "penthouses," the IQR treats the entire range as one group and can flag legitimate values in the smaller cluster. It also over-flags in heavily skewed distributions, which are exactly the distributions where outliers are most interesting.

The harder question is what to do once you've found them. If the $15 is a data entry error — someone typed 15 instead of 1,500 — remove or correct it. If that 3,400 sqft apartment is genuinely a penthouse, removal destroys real signal. Instead, you can Winsorize: cap values at a percentile boundary (say, the 1st and 99th), so extreme values get pulled inward without being deleted.

lower, upper = df["rent"].quantile([0.01, 0.99])
df["rent_clipped"] = df["rent"].clip(lower, upper)

Here's the nuance that matters in practice: your model type should influence your outlier strategy. Tree-based models (random forests, gradient boosting) handle outliers naturally — they split on thresholds, so an extreme value lands in its own bucket without distorting anything else. Linear models and distance-based models (k-NN, SVM) are a different story. A single extreme value can dominate the gradient or warp the distance metric. For those, transform or clip.

Going back to our kitchen analogy: removing an outlier is like throwing away an ingredient you're not sure about. Clipping is like trimming off the bruised part. Both are valid, depending on what you're dealing with. But if you throw away the saffron because it looks weird, you've ruined the dish.

Feature Scaling — When Scale Lies to Your Model

Suppose we add a "distance to downtown" feature to our apartments. Square footage ranges from 640 to 3,400. Distance ranges from 0.5 to 12 miles. If we feed these raw to a k-NN model, the distance calculation will be dominated by whichever feature has larger numbers — the model effectively ignores miles because the sqft numbers are hundreds of times bigger.

This isn't a flaw in the algorithm. The algorithm is doing exactly what we told it to: compute distances. The flaw is that we let the scale of measurement — an arbitrary human choice — influence the result.

StandardScaler subtracts the mean and divides by the standard deviation, centering every feature at zero with unit variance. It's the default for most situations, and it works well when features are roughly Gaussian (bell-shaped). MinMaxScaler squishes values to a [0, 1] range, which neural networks often prefer. RobustScaler uses the median and IQR instead of mean and standard deviation, making it resistant to those outliers we were worrying about earlier.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # uses train's mean and std

Notice something critical in that code: we fit the scaler on training data only, then use those learned parameters to transform both train and test. If we fit on the full dataset, the test set's statistics leak into training. This is a form of data leakage — we'll confront it head-on later.

Tree-based models don't care about feature scale at all. They make binary splits ("is sqft > 1,000?"), and that decision is invariant to whether sqft is in square feet or square meters. Don't waste time scaling for random forests or XGBoost.

Taming Skewed Distributions

Real-world numeric data loves long right tails. Rent prices, income, website traffic, word frequencies — they all pile up on the left and stretch far to the right. A few luxury apartments with $8,000 rents pull the mean away from where most of the data lives. Linear models, which assume roughly symmetric relationships, struggle with this.

The log transform compresses the right tail and stretches the left. After applying it, a distribution that was heavily skewed often looks close to Gaussian. The intuition: the difference between $1,000 and $2,000 rent matters more than the difference between $7,000 and $8,000. The log captures this — it cares about ratios rather than absolute differences.

# log1p handles zeros safely: log1p(x) = log(1 + x)
df["rent_log"] = np.log1p(df["rent"])

Use log1p instead of log when your data might contain zeros — log(0) is negative infinity, which tends to upset everything downstream. For data with negative values, the log is off the table entirely.

When you're not sure which power transform to use, Yeo-Johnson will find one automatically. It searches for the transformation parameter that makes the result closest to a normal distribution, and it handles zeros and negatives.

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method="yeo-johnson")
df[["rent_transformed"]] = pt.fit_transform(df[["rent"]])

I still occasionally get tripped up by when to apply log transforms in a pipeline. The key question is: does the transform go before or after the train/test split? After. Always after. Like scaling, the transform should be fit on training data and applied to both sets.

Rest Stop

Congratulations on making it this far. If you want, you can stop here.

You now have a solid mental model for data cleaning: diagnose why data is missing before imputing, handle outliers based on whether they're errors or signal, scale features so models treat them fairly, and transform skewed distributions so linear models can work with them. That's enough to clean most datasets you'll encounter.

But cleaning is half the story. The data that comes out the other side is ready for a model, but it's not yet optimized for one. The next half — encoding categories and engineering new features — is where the real performance gains live. It's also where the most insidious bugs hide.

If the discomfort of not knowing what's underneath is nagging at you, read on.

Categorical Encoding — Translating Words Into Numbers Without Lying

Our neighborhood column has a problem we haven't addressed yet: "Midtown," "midtown," and "MIDTOWN" are three versions of the same thing. Before encoding, normalize your strings.

df["neighborhood"] = df["neighborhood"].str.strip().str.lower()
df["has_parking"] = df["has_parking"].astype(str).str.strip().str.lower()
df["has_parking"] = df["has_parking"].map({"yes": 1, "1": 1, "no": 0})

That's the boring-but-essential cleanup. Now for the encoding itself. Models eat numbers. Encoding is how we translate categories into numbers. And the encoding we choose is a modeling decision — a translation, and like any translation, it can preserve meaning or destroy it.

One-Hot Encoding creates a new binary column for each category. "Midtown" becomes [1, 0, 0], "Eastside" becomes [0, 1, 0], "Westpark" becomes [0, 0, 1]. No category is implied to be "closer" to another. It's the safest encoding for nominal data — categories with no inherent order.

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
encoded = ohe.fit_transform(df[["neighborhood"]])

The drop='first' parameter drops one column to avoid multicollinearity — when all the one-hot columns sum to 1, any one of them is perfectly predictable from the others, which confuses linear models. I'll be honest — I still have to pause and think about whether I need drop='first' in a given situation. The rule: use it for linear/logistic regression, skip it for tree-based models (they don't care).

The problem with one-hot encoding is that it explodes with high-cardinality features. If your dataset has 50,000 unique zip codes, one-hot encoding creates 50,000 new columns. That's not encoding — that's a denial-of-service attack on your model.

Ordinal Encoding assigns integers: Midtown = 0, Eastside = 1, Westpark = 2. This works when categories have a genuine order — education levels (high school, bachelor's, master's, PhD), shirt sizes (S, M, L, XL). For our neighborhoods, it would imply that Westpark is "more" than Midtown, which is nonsense. The translation would be a lie.

Target Encoding replaces each category with the mean target value for that category. If Midtown apartments average $2,050/month, "Midtown" becomes 2050. It handles high cardinality elegantly and packs a lot of signal into a single column. But here's the trap: if you compute those means using the same data you're training on, you're leaking the target into the features. The model memorizes category-to-target mappings instead of learning generalizable patterns.

import category_encoders as ce

# Cross-validation target encoding to prevent leakage
te = ce.TargetEncoder(cols=["neighborhood"], smoothing=1.0)
X_train["neighborhood_enc"] = te.fit_transform(X_train["neighborhood"], y_train)
X_test["neighborhood_enc"] = te.transform(X_test["neighborhood"])

The smoothing parameter blends the category mean with the global mean. For rare categories — a neighborhood with only two listings — the category mean is unreliable, so smoothing pulls it toward the overall average. Without smoothing, rare categories get wildly overfit encodings.

Two more options for extreme cardinality: frequency encoding (replace each category with its count — works when frequency itself is informative, but two categories with the same count become indistinguishable) and the hashing trick (hash categories into a fixed number of buckets — irreversible and collision-prone, but handles streaming data and doesn't require knowing all categories upfront).

The translation analogy holds up: one-hot is a word-by-word literal translation — faithful but verbose. Target encoding is a contextual translation — compact but risks putting words in the author's mouth. Ordinal encoding is a translation that imposes grammatical rules from a different language. Choose based on what your data is actually saying.

Creating Features — Where the Real Gains Live

Here's an uncomfortable truth I resisted for a while: feature engineering often matters more than model choice. A logistic regression with well-crafted features will beat a neural network eating raw columns. I used to think "domain knowledge" was a handwavy excuse people used when they didn't want to learn fancier models. I was wrong. Domain knowledge, translated into features, is the single highest-leverage thing you can do in tabular ML.

Let's build some features for our apartment dataset.

Ratios and Interactions

Our dataset has square footage and rent. The model could learn the relationship between them, but why make it work for it? We can create price per square foot, which makes the pattern immediately obvious.

df["price_per_sqft"] = df["rent"] / df["sqft"]

This is the essence of feature engineering: take two columns that a model has to discover a relationship between, and hand it that relationship on a plate. Ratios, differences, products — these are the building blocks. In credit scoring, balance / credit_limit (utilization ratio) is one of the most predictive features. In e-commerce, revenue / visits (conversion value) often matters more than either component alone.

For automated interaction generation, scikit-learn offers PolynomialFeatures, but use it with caution. Fifty features at degree 2 creates 1,275 columns. At degree 3, tens of thousands. It's a blunt instrument — pair it with strong regularization or feature selection, or it will drown your model in noise.

Datetime Features

Our listed_date column is a string. In its current form, a model has no idea what to do with "2024-03-22." We need to extract the signals buried inside it.

df["listed_date"] = pd.to_datetime(df["listed_date"])
df["list_month"] = df["listed_date"].dt.month
df["list_dayofweek"] = df["listed_date"].dt.dayofweek
df["is_weekend_listing"] = df["list_dayofweek"].isin([5, 6]).astype(int)
df["days_on_market"] = (pd.Timestamp.now() - df["listed_date"]).dt.days

Month captures seasonality (rent spikes in summer). Day of week might capture listing strategy (do serious landlords list on weekdays?). Days on market captures freshness — stale listings might be overpriced.

But there's a subtlety with month and hour features that caught me off guard the first time I encountered it. December (month 12) and January (month 1) are one month apart, but numerically they're 11 apart. A model using raw month numbers thinks they're distant. Cyclical encoding fixes this by mapping the value onto a circle using sine and cosine.

df["month_sin"] = np.sin(2 * np.pi * df["list_month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["list_month"] / 12)

You need both sine and cosine — with only one, different months can map to the same value (sin is 0 for both month 6 and month 12). Together, they create a unique position on the circle for every month. The same trick works for hour-of-day, day-of-week, or anything else that wraps around.

Aggregation Features

Some of the most powerful features in tabular ML come from group-level statistics. What's the average rent in this neighborhood? How many listings does this neighborhood have? These inject context that individual rows can't provide on their own.

neighborhood_stats = df.groupby("neighborhood").agg(
    avg_rent=("rent", "mean"),
    listing_count=("rent", "count"),
    median_sqft=("sqft", "median")
)
df = df.merge(neighborhood_stats, on="neighborhood", how="left")

Now every apartment row carries information about its neighborhood's overall market. A 900 sqft apartment in a neighborhood where the median is 600 sqft looks very different from a 900 sqft apartment where the median is 2,000 sqft. The model no longer needs to discover this from scratch.

Kaggle winners use aggregation features relentlessly. In any problem with a grouping variable — user IDs, store IDs, product categories — computing per-group means, counts, standard deviations, and recency features is almost always worth trying.

Binning — Deliberate Information Loss

Sometimes the exact number matters less than the range it falls in. Domain experts don't think "this apartment is 847 square feet" — they think "it's a one-bedroom." Binning captures that mental model.

from sklearn.preprocessing import KBinsDiscretizer

binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df["sqft_bin"] = binner.fit_transform(df[["sqft"]])

Binning throws away information. That's the point, and it's also the danger. Use it when the relationship between feature and target is step-wise rather than smooth — when there are real thresholds ("apartments under 500 sqft are studios, 500–1000 are one-bedrooms"). Don't use it as a default. I've seen it applied mindlessly, destroying perfectly good continuous signal.

Data Leakage — The Silent Model Killer

I once spent three days debugging a model that had 99% accuracy on the test set and failed catastrophically in production. The culprit was data leakage — information from the future or the test set had seeped into the training process. The model wasn't learning patterns; it was cheating.

Leakage comes in three flavors, and they're all insidious because your metrics look great right up until deployment.

Train-test contamination is the most common. It happens when you preprocess — scale, encode, impute — using the full dataset before splitting. If you compute the mean for imputation using all the data, test set information leaks into training. The fix is mechanical: split first, then preprocess. Fit every transformer on training data only.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# The pipeline ensures transformers see only training data during fit
pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

X_train_clean = pipe.fit_transform(X_train)
X_test_clean = pipe.transform(X_test)

Target leakage is subtler. A feature that is derived from or strongly correlated with the target — something that wouldn't be available at prediction time — sneaks into the feature set. In our rental example, imagine we had a column "tenant_satisfaction_score" that's only recorded after someone rents the apartment. Using it to predict rent would produce a brilliant model that's useless in production, because you'd never have that score before the rent is set.

The test: for every feature, ask "would I have this value at the moment I need to make a prediction?" If the answer is no, it's leakage.

Temporal leakage plagues time-series data. If you randomly shuffle apartment listings across time and split into train/test, the model can see future listings during training. For any problem with a time dimension, split chronologically: train on the past, test on the future. Rolling statistics must look backward only — a 7-day rolling average that includes tomorrow's value is cheating.

Leakage is the scariest bug in machine learning because it rewards you with amazing numbers. Your validation accuracy soars. You celebrate. You deploy. And then the model falls apart on live data, and you spend days wondering what changed. What changed is nothing — the model was never actually good. It was memorizing information it should never have seen.

Going back to the kitchen one last time: leakage is like tasting the dish and rating it 10/10, but you forgot that you already knew the recipe. The taste test wasn't a test — it was a memory exercise. The only honest evaluation is one where the taster has never seen the recipe before.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with six messy apartment listings — missing values, inconsistent strings, a $15 outlier, and raw dates. We learned why data goes missing and how to fix it without introducing bias. We confronted outliers and decided when they're errors versus signal. We scaled features so models don't get fooled by units of measurement. We tamed skewed distributions. We translated categories into numbers — carefully, because the wrong translation is a lie. We built new features that make patterns obvious. And we looked data leakage in the eye, understanding why it makes models look brilliant until the moment they're actually tested.

My hope is that the next time you receive a raw dataset — the kind with NaN values scattered through it, categories spelled three different ways, and a column of dates that's been stored as strings — instead of feeling overwhelmed or skipping straight to model selection, you'll know exactly what to do with it, having a solid mental model of what each cleaning and engineering decision means for the model downstream.

What You Should Now Be Able To Do