Preprocessing Pipelines

Chapter 3: Data Fundamentals Section 6 of 10
TL;DR

A preprocessing pipeline chains every transformation — scaling, encoding, imputation — into a single object that you fit once on training data and replay identically on test data and in production. Scikit-learn's Pipeline and ColumnTransformer eliminate the entire class of bugs where training and inference disagree on how data was prepared. This section builds pipelines from scratch, walks through how each piece works under the hood, and confronts the subtle leakage bugs that pipelines exist to prevent.

I avoided really understanding preprocessing pipelines for longer than I'd like to admit. For months, I wrote my scaling code in one cell, my encoding code in another, serialized the model separately from the scaler, and then prayed that my production server would apply the same transformations in the same order. Spoiler: it didn't. I shipped a loan-approval model that was confidently approving everyone, because the scaler in production was freshly initialized — never fit on the training data. The model was seeing raw, unscaled features and interpreting them as if they'd already been standardized. That was my wake-up call. Here is that dive.

Scikit-learn's Pipeline was introduced to solve exactly this kind of gap. It bundles every preprocessing step and the final model into a single object. You call .fit() once on your training data, and the entire chain — imputation, scaling, encoding, modeling — gets locked in. When you later call .predict() on new data, every step replays with the exact parameters it learned during training. No mismatches. No gaps.

Before we start, a heads-up. We're going to be writing custom transformer classes, nesting pipelines inside other pipelines, and confronting some surprisingly subtle data leakage bugs. But you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The Gap Problem
Your First Pipeline — The Assembly Line
What Happens Under the Hood
ColumnTransformer — Routing Mixed Data
Custom Transformers — When sklearn Isn't Enough
Rest Stop
The Leakage Trap
Tuning Across the Entire Chain
Serialization and Production
Beyond sklearn — Feature Stores
Wrap-Up
Resources

The Gap Problem

Let's start with a concrete scenario. We're building a loan approval model. Our dataset has three columns: age (numeric), income (numeric), and occupation (categorical). Even for this tiny dataset, the preprocessing involves at least three steps: impute missing values, scale the numbers, and encode the categories. Here's what that looks like without a pipeline.

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
import numpy as np

# Step 1: Impute
imputer = SimpleImputer(strategy="median")
X_train_imputed = imputer.fit_transform(X_train[["age", "income"]])

# Step 2: Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)

# Step 3: Encode
encoder = OneHotEncoder(handle_unknown="ignore")
X_train_encoded = encoder.fit_transform(X_train[["occupation"]])

# Step 4: Combine... somehow
# Step 5: Train model
# Step 6: Save scaler, imputer, encoder, model separately
# Step 7: In production, load all four, apply in correct order
# Step 8: Pray

Every single gap between those steps is a place where a bug can hide. Did you fit the scaler on test data by accident? Did you forget to apply the imputer in production? Did the encoding categories shift between training and serving? The bug isn't in any single step. The bug lives in the spaces between the steps.

Think of it like a factory floor. If you have four workers sitting at separate tables, each doing their part of the assembly, you need someone to carry the half-finished product between tables — and that someone is you, writing fragile glue code. A pipeline is the conveyor belt. It connects the stations. Parts flow from one to the next automatically, and there's no room for someone to drop something between stations.

Your First Pipeline — The Assembly Line

Here's our loan approval preprocessing rewritten as a pipeline, starting with the numeric columns only.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

Each entry in that list is a tuple: a name (your choice, for later reference) and a transformer or estimator object. There's one hard rule: every step except the last must implement .transform(). The last step can be a transformer too, but more commonly it's a model — a classifier or regressor — which implements .fit() and .predict().

Now watch what happens when we call .fit() on this pipeline.

pipe.fit(X_train, y_train)

That single line does three things. It fits the imputer on the training data and transforms it. It fits the scaler on the imputed training data and transforms it. It fits the logistic regression on the scaled, imputed data. Three fits, in order, each feeding its output to the next. The conveyor belt is running.

When we later call .predict() on new data, the pipeline replays the chain — but this time, each step only transforms, using the parameters it learned during .fit(). The imputer fills in the same median values. The scaler subtracts the same mean and divides by the same standard deviation. The model predicts with the same weights. No re-fitting. No gaps.

pipe.predict(X_test)    # imputer.transform → scaler.transform → model.predict
pipe.score(X_test, y_test)

That's the whole idea. One object. One .fit(). One .predict().

What Happens Under the Hood

It's worth understanding the machinery inside Pipeline.fit(), because it reveals a design decision that trips people up. Here's the pseudocode for what sklearn does internally.

# Pseudocode for Pipeline.fit(X, y)
Xt = X
for name, step in all_steps_except_last:
    if hasattr(step, "fit_transform"):
        Xt = step.fit_transform(Xt, y)
    else:
        Xt = step.fit(Xt, y).transform(Xt)

last_step.fit(Xt, y)

Notice the fit_transform check. If a step has a dedicated .fit_transform() method, sklearn calls that instead of calling .fit() and .transform() separately. Why? Because for some transformers, doing both at once is faster. PCA, for instance, computes the SVD once during fit_transform rather than computing it in fit and then reapplying it in transform. This is an optimization, not a semantic difference — but it's the kind of detail a senior interviewer might probe.

The other thing to notice: y gets passed through to every step. Most transformers ignore it (their .fit() accepts y=None), but some — like target encoders or feature selectors based on mutual information — actually use the target variable during fitting. That's why it's threaded through.

There's a shorthand worth knowing. make_pipeline builds a pipeline without you needing to name each step. It auto-generates names from the class names.

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(SimpleImputer(strategy="median"),
                     StandardScaler(),
                     LogisticRegression())

The step names become "simpleimputer", "standardscaler", "logisticregression". This is fine for quick prototyping, but for anything you'll tune or debug later, explicit names are worth the extra keystrokes.

ColumnTransformer — Routing Mixed Data

Our loan approval pipeline so far handles only numeric columns. But occupation is categorical. We can't scale a string. We need different transformations for different columns, and that's what ColumnTransformer was built for.

Think of it as a sorting station on our factory floor. Raw data arrives on the conveyor belt. The ColumnTransformer peels off the numeric columns and sends them down one sub-belt (impute, then scale). It peels off the categorical columns and sends them down a different sub-belt (impute, then one-hot encode). At the end, it glues the results back together into a single array and sends it onward.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

numeric_cols = ["age", "income"]
categorical_cols = ["occupation"]

numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipe, numeric_cols),
    ("cat", categorical_pipe, categorical_cols)
])

full_pipe = Pipeline([
    ("prep", preprocessor),
    ("model", LogisticRegression())
])

full_pipe.fit(X_train, y_train)
full_pipe.score(X_test, y_test)

Each tuple inside ColumnTransformer has three parts: a name, a transformer (or sub-pipeline), and which columns to route to it. The output columns are concatenated in the order you listed the transformers.

Now here's something that confused me the first time. The remainder parameter. By default, remainder="drop". That means any column you don't explicitly mention gets silently dropped. If your dataset has a column called loan_amount and you forgot to include it, it's gone. No error. No warning. Gone.

# If you want unmentioned columns to pass through untouched:
preprocessor = ColumnTransformer([
    ("num", numeric_pipe, numeric_cols),
    ("cat", categorical_pipe, categorical_cols)
], remainder="passthrough")

I still occasionally get bitten by this. When my model's accuracy mysteriously drops after adding a new feature to the dataset, the first thing I check is whether the ColumnTransformer is actually including it. Setting remainder="passthrough" is the safe default for exploration. Switch to explicit column lists when you want strict control in production.

One more thing. If you want your pipeline to output a pandas DataFrame (with column names intact) instead of a numpy array, sklearn 1.2+ lets you do this.

full_pipe.set_output(transform="pandas")

That single line makes every transformer in the chain return DataFrames. Debugging becomes dramatically easier when your intermediate outputs have column names instead of anonymous indices.

Custom Transformers — When sklearn Isn't Enough

Sklearn's built-in transformers cover the common cases — scaling, encoding, imputation. But real data demands real creativity. Maybe you need to log-transform skewed income values. Maybe you need to extract the day-of-week from a timestamp. Maybe you need to compute the ratio of two columns. None of these exist as built-in transformers, so you build your own.

There are two ways. For quick, stateless transformations — things that don't need to learn anything from the data — FunctionTransformer is your friend.

from sklearn.preprocessing import FunctionTransformer
import numpy as np

log_transformer = FunctionTransformer(np.log1p)

pipe = Pipeline([
    ("log", log_transformer),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

FunctionTransformer wraps any callable into an sklearn-compatible transformer. The wrapped function gets called during .transform(), and .fit() does nothing. Quick. Clean. No class definition needed.

But what if your transformation needs to learn something from the training data? Say you want to clip outliers at the 1st and 99th percentiles — percentiles you compute from training data and then apply consistently to test data. That's a stateful transformation. FunctionTransformer can't do that. You need a custom class.

from sklearn.base import BaseEstimator, TransformerMixin

class PercentileClipper(BaseEstimator, TransformerMixin):
    def __init__(self, lower=1, upper=99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        self.lower_bound_ = np.percentile(X, self.lower, axis=0)
        self.upper_bound_ = np.percentile(X, self.upper, axis=0)
        return self

    def transform(self, X):
        return np.clip(X, self.lower_bound_, self.upper_bound_)

A few things to unpack here. Inheriting from BaseEstimator gives you .get_params() and .set_params() for free, which means your transformer works with GridSearchCV and pipeline cloning. Inheriting from TransformerMixin gives you a default .fit_transform() that calls .fit() then .transform().

The __init__ method must only assign parameters. No computation. No data access. This is a contract with sklearn's parameter inspection system — violate it and GridSearchCV will silently break in confusing ways.

Learned attributes get a trailing underscore: self.lower_bound_, not self.lower_bound. This is sklearn convention. It signals "this was learned from data during .fit()" versus "this was set by the user in __init__." It matters because sklearn uses this distinction internally.

I'll be honest — I'm still developing my intuition for when FunctionTransformer stops being enough and a custom class is warranted. My rough heuristic: if you find yourself storing anything in a variable outside the function and referencing it later, you've outgrown FunctionTransformer. Go write the class.

Rest Stop

If you've made it this far, you can build real preprocessing pipelines. You know how to chain transformers, route different column types through ColumnTransformer, and write custom transformers when the built-ins aren't enough. That's genuinely useful. You could stop here and have a solid toolkit for every sklearn-based project.

What we haven't covered is why pipelines exist in the first place — not as a convenience, but as a defense against a specific, insidious class of bugs. And we haven't talked about what happens when your pipeline needs to leave your laptop and survive in production.

The short version: pipelines prevent data leakage during cross-validation, and they serialize into a single file that guarantees identical preprocessing at serving time. There. That's the 80% summary.

But if the discomfort of not knowing how leakage happens, and why it's so hard to catch, is nagging at you — read on.

The Leakage Trap

Here's a line of code that looks completely innocent.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)        # <-- the trap

X_train, X_test = train_test_split(X_scaled, test_size=0.2)

The scaler was fit on all the data, including the test set. That means the mean and standard deviation it computed contain information from samples the model will later be evaluated on. The test set is no longer independent. Your accuracy estimate is optimistic. You have data leakage.

This is one of the most common mistakes in machine learning, and it's frighteningly easy to make. The code runs. No errors. The model trains. The accuracy looks great — maybe suspiciously great. You only discover the problem months later when production performance doesn't match your offline numbers.

The fix: put every preprocessing step inside the pipeline, and let cross-validation handle the fitting.

from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

scores = cross_val_score(pipe, X, y, cv=5)

Now, on each of the five folds, the scaler is fit only on the training portion of that fold. The validation portion is scaled using training-fold statistics. No information leaks across the boundary. The factory floor analogy returns: the conveyor belt runs inside a sealed room for each fold. Nothing from the validation set touches the assembly line.

This gets more dangerous with fancier preprocessing. Target encoding — where you replace categories with the mean of the target variable — is especially prone to leakage. If the encoder sees the test labels during fitting, it will encode test-set categories using their own outcomes, which is circular. The pipeline enforces discipline: .fit() sees only training data, period.

For time-series data, this discipline extends to the split strategy itself. KFold shuffles randomly, which means future data can leak into training folds. Use TimeSeriesSplit instead — it respects the temporal order, ensuring you always train on the past and evaluate on the future.

Tuning Across the Entire Chain

Here's where the pipeline really flexes. You want to tune not only the model's hyperparameters, but also the preprocessor's. Should the imputer use the mean or the median? Should the one-hot encoder drop the first category or keep them all? With a pipeline, you can search over all of it in one go.

The trick is sklearn's double-underscore naming convention. Each step in the pipeline has a name. You reach into nested parameters by chaining names with __.

from sklearn.model_selection import GridSearchCV

param_grid = {
    "prep__num__imputer__strategy": ["mean", "median"],
    "prep__cat__encoder__drop": [None, "first"],
    "model__C": [0.1, 1.0, 10.0]
}

search = GridSearchCV(full_pipe, param_grid, cv=5, scoring="f1")
search.fit(X_train, y_train)
print(search.best_params_)

Let's trace one of those parameter names. "prep__num__imputer__strategy" means: in the pipeline, find the step named prep (our ColumnTransformer). Inside it, find the sub-transformer named num (our numeric pipeline). Inside that, find the step named imputer. Set its strategy parameter. Four levels deep, reached through a flat string. This is why explicit step names matter — you need them to build these parameter paths.

And because the pipeline is inside GridSearchCV, every candidate combination is evaluated with proper cross-validation. The scaler is refit on each fold's training data. No leakage, even while tuning.

Serialization and Production

A fitted pipeline contains everything: the imputer's learned medians, the scaler's learned means and standard deviations, the encoder's learned categories, and the model's learned weights. All of it serializes into a single file.

import joblib

joblib.dump(full_pipe, "loan_approval_pipeline.joblib")

In production, you load one file and call .predict(). The preprocessing that happens at inference time is byte-for-byte identical to what happened during training. That's the guarantee.

loaded_pipe = joblib.load("loan_approval_pipeline.joblib")
predictions = loaded_pipe.predict(new_applications)

One file. One load. Zero chance of train-serve skew — at least from the preprocessing side.

Back to our factory analogy: instead of shipping four separate machines (imputer, scaler, encoder, model) to the production facility and hoping someone assembles them correctly, you ship the entire assembly line as one sealed unit. Plug it in. Feed it raw materials. Finished product comes out the other end.

There is a real-world limitation here, though. joblib serializes Python objects, which means your production server needs the same Python version, the same sklearn version, and the same custom transformer classes available for import. If you upgrade sklearn and the internal representation of StandardScaler changes, your saved pipeline may not load. For long-lived production systems, consider exporting to an interchange format like ONNX, or version-locking your dependencies.

Beyond sklearn — Feature Stores

Sklearn pipelines solve the offline problem beautifully: training and batch inference with consistent preprocessing. But they don't solve the online problem. If your model serves real-time predictions via an API, you need features computed and served in milliseconds — and you need those features to be identical to what the model saw during training.

This is where feature stores like Feast and Tecton enter the picture. A feature store is a centralized system that computes, stores, versions, and serves features for both training and inference. It handles things that sklearn pipelines never will: point-in-time joins to prevent temporal leakage, real-time feature computation from streaming data, feature versioning across experiments, and a shared registry so that team members can discover and reuse features instead of rebuilding them.

The mental model: an sklearn pipeline is a recipe card you carry in your pocket. A feature store is a full commercial kitchen where recipes are versioned, ingredients are prepped by specialists, and every dish that leaves the kitchen is logged and traceable.

You don't need a feature store for every project. For offline experiments, Kaggle competitions, or small-scale batch inference, a serialized sklearn pipeline is all you need. But once you have multiple models sharing features, real-time latency requirements, or a team that's stepping on each other's preprocessing code — you'll be glad feature stores exist.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a pain — scattered preprocessing code and the ghost bugs that live in the spaces between steps. We built our first pipeline as a conveyor belt, traced what happens inside .fit() under the hood, and then confronted the mixed-type reality of real datasets with ColumnTransformer. We wrote custom transformers — both the quick FunctionTransformer and the full-class version. We saw how pipelines prevent data leakage during cross-validation, how the double-underscore convention lets us tune the entire chain, and how a single joblib.dump() can ship everything to production. Finally, we looked past sklearn toward the feature stores that production systems eventually demand.

My hope is that the next time you find yourself writing preprocessing code in scattered notebook cells — fitting the scaler here, encoding the categories there, passing arrays between steps and hoping nothing drifts — you'll instead reach for a Pipeline, bundle the whole thing up, and never wonder whether training and production agree on how the data was prepared.

Resources