Data Exploration

Chapter 3: Data Fundamentals Exploration · Profiling · Distributions · Relationships

I avoided data exploration for an embarrassingly long time. Every time I got a new dataset, I'd load it, glance at .head(), confirm nothing looked obviously broken, and jump straight to modeling. And every time — every single time — I'd end up three days deep in debugging, only to discover the real problem was something I could have caught in the first five minutes: a column full of sentinel values masquerading as real data, or a feature that was secretly encoding the target variable. Finally the discomfort of being burned by my own laziness grew too great. Here is that dive.

Exploratory Data Analysis — EDA — is the practice of systematically examining a dataset before you do anything else with it. The term was popularized by John Tukey in his 1977 book of the same name, and it remains the single most underrated skill in applied machine learning. It's how you develop intuition about data that no automated pipeline can replace.

Before we start, a heads-up. We're going to talk about distributions, correlations, missing data mechanisms, and statistical profiling. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

Why exploration matters — the Anscombe warning
The thirty-second scan
Data types that change modeling decisions
File formats — Parquet or regret
Summary statistics — and why they lie
Distributions — reading the shape of your data
Missing data — the three mechanisms
Rest stop
Relationships and the correlation trap
The target variable — leakage, imbalance, and the "too good" model
Data quality and integrity
Automated profiling in production
Wrap-up

Why Exploration Matters — The Anscombe Warning

Let me start with a story that changed how I think about data. In 1973, a statistician named Francis Anscombe constructed four tiny datasets — each with eleven points. All four had the same mean for both x and y. The same variance. The same correlation coefficient (0.816). The same regression line. By every summary statistic, they were identical twins.

Then he plotted them.

One was a clean linear relationship. One was a perfect curve. One was a perfect line with a single outlier yanking the regression away. And one had all its x-values stacked at the same point except for one lonely outlier creating the illusion of a trend. Four datasets. Same numbers. Completely different stories. This is Anscombe's Quartet, and it's the reason EDA exists.

In 2017, researchers at Autodesk pushed the idea further with the Datasaurus Dozen — thirteen datasets with nearly identical summary statistics that, when plotted, form shapes including a dinosaur, a star, and an X. Same mean, same standard deviation, same correlation. Dinosaur.

I'll be honest — when I first saw the Datasaurus, I didn't fully believe it. I downloaded the data and computed the stats myself. They matched. The discomfort of that moment is what made me take exploration seriously. If summary statistics can hide a dinosaur, what are they hiding in your production dataset?

Think of EDA like a medical checkup before surgery. The surgeon doesn't skip the blood work because the patient "looks fine." The blood work reveals problems invisible to the naked eye — elevated markers, unexpected deficiencies, conditions that would make the planned procedure dangerous. Your model is the surgery. Your data is the patient. The checkup is not optional.

The Thirty-Second Scan

Every dataset exploration starts the same way. You open the data and run the equivalent of a vitals check — how big is this thing, what types does it contain, how much of it is actually present. This takes thirty seconds and catches problems that would otherwise waste hours.

Imagine we've been handed a small e-commerce dataset. Eight customers, six columns: user_id, age, city, purchase_amount, signup_date, and churned. In a real dataset, you'd have hundreds of thousands of rows and dozens of columns. The workflow is the same — you scale it, you don't skip it.

import pandas as pd
import numpy as np

df = pd.read_csv("customers.csv")

print(df.shape)            # (8, 6) — rows, columns
print(df.dtypes)           # the single most revealing line
df.info()                  # non-null counts + dtypes + memory usage
print(df.sample(3))        # random rows, not head() — head hides patterns

The dtypes output is where most surprises live. A column called purchase_amount showing object instead of float64 means something non-numeric snuck in — a dollar sign, a comma separator, the string "N/A" instead of an actual null. A column called churned showing float64 instead of bool or int means it probably contains NaN values, because pandas upcasts integer columns to float when nulls are present.

Use sample() instead of head(). Sorted or ordered data at the top of the file will fool you into thinking the data is cleaner than it is. Random sampling shows you what the data actually looks like.

That's the thirty-second scan. It tells you the dimensions of the problem and whether your types are trustworthy. The limitation is obvious: it tells you nothing about what's inside those columns.

Data Types That Change Modeling Decisions

The structured-versus-unstructured distinction is real but it's not what catches you in practice. What catches you are the subtler type distinctions within structured data — distinctions that directly determine how you encode features and which models are appropriate.

Continuous features take any value in a range. Age, temperature, revenue. Linear regression expects continuous targets. Feed it discrete counts and you'll get predictions like 3.7 purchases — which might be fine, or might be nonsensical, depending on your problem.

Discrete features come in countable steps. Number of items purchased, number of clicks. Poisson regression and negative binomial models are built for these. Treating counts as continuous works until it doesn't — and it usually doesn't when counts cluster near zero.

Nominal categorical features have no natural order. City names, product categories, color codes. One-hot encode these. If you label-encode "red=1, blue=2, green=3" you've told the model that green is three times as much as red and that blue is between them. That's an invented number line that doesn't exist, and the model will happily learn from it anyway — learning garbage.

Ordinal categorical features have a meaningful ranking but unknown spacing. Education level (high school, bachelor's, master's, PhD) has a clear order. But the "distance" between high school and bachelor's is not the same as between master's and PhD. Encoding these as 1, 2, 3, 4 implies uniform gaps. Sometimes that's acceptable. Often it isn't. Target encoding or treating them as nominal are safer defaults until you have a reason to believe the spacing is meaningful.

Temporal features are ordered, and that order is sacred. Shuffling rows of a time series destroys the signal. Never random-split time data — always split by time, because your model can't peek into the future in production, so it shouldn't during training either.

The limitation of this type-level thinking: it tells you how to encode, but not whether a feature is actually useful. For that, we need to look at distributions.

File Formats — Parquet or Regret

Before we get deeper into exploration, a practical detour. The format your data lives in affects how fast you can explore it, and whether types survive the round trip from storage to memory.

CSV is the format everyone knows and the format that causes the most pain. It has no type information — everything is a string until your reader guesses otherwise. It has no compression. It reads every column even if you only need two. A 10 GB CSV file will take minutes to load, consume 10+ GB of memory, and silently turn your zip code "02134" into the integer 2134.

Parquet is a columnar format that solves all of these problems. It stores type metadata, so "02134" stays a string. It compresses data (that 10 GB CSV becomes 1-2 GB). It supports column pruning — reading only the columns you ask for — and predicate pushdown — filtering rows at the storage layer before they ever touch Python memory. Switching from CSV to Parquet is the single highest-impact, lowest-effort improvement most teams can make.

# The swap that changes your life
df.to_parquet("data.parquet", engine="pyarrow")
df = pd.read_parquet("data.parquet", columns=["age", "churned"])  # reads ONLY those two columns

A quick reference for when someone asks you about other formats:

Format	Reach For It When	Walk Away When
CSV	Sharing with non-technical humans, quick one-off exploration	Anything in a pipeline — slow, typeless, uncompressed
Parquet	Default for ML data — columnar, compressed, type-safe	You need a file a human can open in Notepad
Feather/Arrow IPC	Intermediate pipeline files where read speed matters more than compression	Long-term storage or cross-system interchange
HDF5	Large numerical arrays, scientific computing, model weight checkpoints	Tabular data — Parquet is better in every way
JSON / JSONL	API responses, nested semi-structured data, config files	Large tabular datasets — verbose and slow

For truly large datasets — tens of gigabytes or more — consider Polars or Dask instead of Pandas. Pandas is single-threaded and in-memory. That's the right default until your data outgrows it, at which point the medical checkup analogy applies again: you wouldn't run the same blood test on a patient who weighs 50 kg and one who weighs 500 kg using the same equipment.

# Force types on load — prevents silent type coercion
df = pd.read_csv("data.csv", dtype={"zip_code": str, "quantity": "Int32"})

# Category dtype for low-cardinality strings — up to 90% memory savings
df["city"] = df["city"].astype("category")

# SQL — always push the WHERE clause to the database
from sqlalchemy import create_engine
engine = create_engine("sqlite:///shop.db")
df = pd.read_sql("SELECT user_id, revenue FROM orders WHERE revenue > 0", engine)

Summary Statistics — And Why They Lie

Now we look inside the columns. The instinct is to run describe() and move on. After the Anscombe warning, you know why that instinct is dangerous. But describe() is still a useful starting point — as long as you know exactly what to look for and exactly where it can deceive you.

print(df.describe())                  # count, mean, std, min, 25%, 50%, 75%, max
print(df.describe(include="all"))     # adds categorical columns: unique, top, freq

Back to our e-commerce customers. The describe() output for purchase_amount shows a mean of $142 and a median (the 50th percentile) of $87. That gap — $142 vs $87 — is your first clue. When the mean is substantially higher than the median, the distribution is right-skewed. A few high-spending customers are pulling the average up while most customers spend modestly. If you were to use the mean as a "typical" customer's spending, you'd overestimate most customers and underestimate the whales.

Here's what to hunt for in every describe() output:

The min and max are your sanity check. Negative ages? Prices of zero in a dataset where free items don't exist? A signup_date in the year 2097? These are data entry errors, sentinel values, or upstream pipeline bugs. Every impossible value is a conversation you need to have with whoever generated the data.

The standard deviation relative to the mean tells you about spread. A feature with near-zero std is nearly constant — it carries almost no information for prediction. A feature where std is larger than the mean usually has extreme outliers or a heavy-tailed distribution.

The count value is supposed to equal the number of rows. When it doesn't, you have missing data in that column. How much is missing and why — that's a whole topic we'll get to shortly.

The summary statistics told us something useful. They also hid the shape, the clusters, and the outliers. To see those, we need to look at distributions directly.

Distributions — Reading the Shape of Your Data

A distribution is the answer to the question: if I pick a random value from this column, what am I likely to get? Histograms are the workhorse visualization for this. Each bar represents a range (a "bin"), and the height shows how many values fall in that range.

import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for every numeric column — the fastest first pass
df.select_dtypes(include=[np.number]).hist(bins=50, figsize=(14, 10))
plt.tight_layout()
plt.show()

When you stare at a histogram, you're looking for four things.

Skewness — a long tail in one direction. Income data is almost always right-skewed: most people earn within a range, a few earn enormously more. Skewness is a number that quantifies this asymmetry. Positive skewness means a right tail. Negative means a left tail. Close to zero means roughly symmetric. When skewness is large (say, above 1), log transforms often pull the tail back toward the center, making the feature friendlier for models that assume normality.

Multimodality — multiple peaks. A histogram of purchase_amount with two humps might indicate two distinct customer segments mixed together: casual browsers making small purchases and power users making large ones. Multimodality is a signal that your data might benefit from segmentation — either explicitly (separate models per segment) or implicitly (adding a segment feature).

Hard boundaries — values capped at a suspiciously round number. If every house in your dataset costs at most $500,000 exactly, the data was clipped. The real distribution continues beyond that boundary, and your model will never learn what happens past it. This is called censoring, and it requires specialized handling.

Spikes — an unexpected concentration at a single value. If 40% of your customers have a purchase_amount of exactly zero, that spike isn't a statistical anomaly — it's likely a sentinel value (someone encoded "no purchase" as 0 rather than null) or a genuinely interesting subgroup (people who signed up but never bought). Either way, modeling it as "a regular number that happens to be zero" is a mistake. Separate it out.

There's a companion number to skewness called kurtosis, which measures tail heaviness rather than asymmetry. High kurtosis means more extreme outliers than a normal distribution would predict. I'm still developing my intuition for when kurtosis is actionable versus when it's a curiosity, but the practical rule I've landed on: if kurtosis is above 7, you almost certainly have outliers that will dominate any squared-error metric, and robust scaling or Winsorization deserves consideration.

# Skewness and kurtosis for all numeric columns
print(df.select_dtypes(include=[np.number]).skew())
print(df.select_dtypes(include=[np.number]).kurtosis())

For categorical features, the equivalent of a histogram is a value count:

print(df["city"].value_counts())
print(df["city"].value_counts(normalize=True))  # proportions

A category holding 95% of the data gives almost no signal — it's nearly constant. A column with as many unique values as rows is an ID column pretending to be a feature. Categories appearing in less than 1-2% of rows should probably be bucketed into "Other" to prevent overfitting on rare groups the model will never see enough of to learn from.

Distributions told us what individual columns look like. The limitation: they told us nothing about how columns relate to each other. But before we go there, there's a topic we've been avoiding.

Missing Data — The Three Mechanisms

Every production dataset has missing values. The question isn't whether data is missing — it's why it's missing. And the "why" determines everything about how you should handle it.

There are three mechanisms of missingness, formalized by statisticians Rubin and Little. I'll walk through each using our e-commerce dataset.

Missing Completely At Random (MCAR) means the probability of a value being missing has nothing to do with any data — observed or unobserved. A server crash randomly dropped 3% of rows from the purchase_amount column. The missingness is pure accident. You can verify MCAR (roughly) by checking whether the distribution of other columns looks the same for rows where purchase_amount is present versus missing. If they match, MCAR is plausible. With MCAR, even simple imputation (mean, median) is unbiased — you lose some statistical power, but you don't introduce systematic error.

Missing At Random (MAR) is trickier. The missingness depends on observed variables but not on the missing value itself. In our dataset, maybe younger customers are less likely to fill in their age — but within each age group, the missingness is random. You can detect MAR by building a quick logistic regression: create a binary indicator (1 = missing, 0 = observed) for the suspect column and model it as a function of other features. If other features significantly predict missingness, you're likely in MAR territory. The fix: use model-based imputation (KNN, iterative imputation) that leverages those observed relationships.

Missing Not At Random (MNAR) is the hardest case. The probability of missing depends on the missing value itself. High-income customers don't report their income. Severely ill patients drop out of clinical trials. The data that's missing is missing because of what it would have been. You generally can't detect MNAR from data alone — it requires domain knowledge. And no standard imputation method can fully correct for it, because the information needed to estimate the missing values is… the missing values. I still get tripped up distinguishing MAR from MNAR in practice. The honest answer is that in ambiguous cases, you document your assumption, use the best imputation available, and test whether your model's conclusions change under different assumptions. That's called a sensitivity analysis, and it's what experienced practitioners do when certainty isn't available.

# Step 1: How much is missing?
print(df.isnull().sum().sort_values(ascending=False))
print(f"\n% missing:\n{(df.isnull().mean() * 100).sort_values(ascending=False)}")

# Step 2: Visualize missingness patterns
import missingno as msno
msno.matrix(df, figsize=(10, 4), fontsize=8)
# Columns that go missing TOGETHER suggest correlated missingness — investigate

# Step 3: Test whether missingness is related to other columns
from sklearn.linear_model import LogisticRegression
missing_indicator = df["age"].isnull().astype(int)
other_features = df[["purchase_amount", "churned"]].dropna()
# If this model has high accuracy, missingness is NOT random

The medical checkup analogy applies here too. A missing lab result is different from a patient who refused the test. The first is a clerical error (MCAR). The second might be because they already know the result is bad (MNAR). The treatment plan — your imputation strategy — depends on knowing the difference.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model for exploring any dataset: scan the shape and types, check summary statistics while remembering they can lie, plot distributions to see what the numbers hide, and investigate missing data with an understanding of the three mechanisms. That's a genuinely useful toolkit — you could run this workflow on any new dataset and catch 80% of the problems that would otherwise blindside you during modeling.

But it doesn't tell the complete story. We haven't looked at how columns relate to each other, where correlation analysis goes wrong, how to examine the target variable for leakage, or how to automate all of this in a production pipeline.

The short version: always plot relationships (not rely on correlation coefficients alone), always suspect a model that's "too good," and automate your EDA checks so they run on every new batch of data. There. You're 80% of the way.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Relationships and the Correlation Trap

We've been looking at columns in isolation. The real power of EDA comes from examining how columns interact. The standard tool for this is the correlation matrix — and it's the tool that deceives more practitioners than any other.

Pearson correlation measures the strength of a linear relationship between two variables. It ranges from -1 (perfect negative linear) to +1 (perfect positive linear), with 0 meaning no linear relationship. The critical word is "linear." Two variables can be perfectly related — say, y = x² — and show a Pearson correlation near zero. The relationship is there, clear as day in a scatter plot. Pearson doesn't see it because it only looks for straight lines.

Spearman correlation is the fix for monotonic relationships. It doesn't care about linearity — only about whether one variable tends to increase (or decrease) as the other increases. If your relationship is curved but still one-directional, Spearman catches it where Pearson misses it. Use Spearman as your default unless you have a specific reason to need Pearson.

# Both correlation methods side by side
pearson_corr = df.select_dtypes(include=[np.number]).corr(method="pearson")
spearman_corr = df.select_dtypes(include=[np.number]).corr(method="spearman")

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pearson_corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=axes[0])
axes[0].set_title("Pearson (linear only)")
sns.heatmap(spearman_corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=axes[1])
axes[1].set_title("Spearman (monotonic)")
plt.tight_layout()
plt.show()

Now for the trap. Correlation does not imply causation — everyone knows that phrase. Fewer people know the specific ways correlation actively misleads.

Simpson's Paradox is the most unsettling. A trend that appears in several subgroups of your data can reverse when you combine them. The classic example: UC Berkeley was sued for gender bias in admissions because the overall acceptance rate was higher for men. When researchers broke the data down by department, they found that women were actually admitted at a higher rate in most departments. The overall trend reversed because women disproportionately applied to more competitive departments. The aggregated correlation was the opposite of the truth.

I'll be honest — the first time I encountered Simpson's Paradox in my own data, I spent an hour convinced my aggregation code had a bug. It didn't. The data was doing exactly what data does when a confounding variable hides in the background.

Spurious correlation is the other hazard. Ice cream sales and drowning deaths are correlated — not because ice cream causes drowning, but because summer heat drives both. In high-dimensional datasets (many features), spurious correlations are guaranteed to appear. With 100 random features, you'll find pairs with correlations of 0.3+ by pure chance. The more features you have, the more ghosts show up in your correlation matrix.

For categorical features, Pearson and Spearman don't apply. Use Cramér's V — a measure of association for categorical-categorical pairs based on the chi-squared statistic, normalized to fall between 0 and 1.

The defense against all of this: never trust a correlation number alone. Always make the scatter plot. Always check subgroups. Always ask whether a third variable could explain the relationship.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot — the correlation matrix can't show you clusters or outliers
axes[0].scatter(df["age"], df["purchase_amount"], alpha=0.4, s=15)
axes[0].set(xlabel="Age", ylabel="Purchase Amount", title="Age vs Spending")

# Boxplot — how does a numeric variable differ across groups?
sns.boxplot(x="city", y="purchase_amount", data=df, ax=axes[1])
axes[1].tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()

# Grouped statistics — sometimes more informative than any plot
print(df.groupby("city")["purchase_amount"].agg(["mean", "median", "std", "count"]))

If box plots look identical across groups, that categorical feature probably doesn't help predict the numeric one. If they differ dramatically, you've found a feature worth investigating further.

The Target Variable — Leakage, Imbalance, and the "Too Good" Model

The target variable deserves its own examination, separate from everything else. Two problems live here that can silently invalidate your entire project.

Class imbalance in classification tasks. If 98% of your dataset is one class, a model that predicts the majority class every single time achieves 98% accuracy. It has learned nothing. It is useless. Check the class distribution before building anything.

# Classification — check the balance
print(df["churned"].value_counts(normalize=True))
# 0.95 / 0.05 split means you need stratified splitting, weighted losses,
# and metrics that care about the minority class (precision, recall, F1, PR-AUC)

# Regression — check the shape
print(f"Target skewness: {df['purchase_amount'].skew():.2f}")
# Skew > 1: seriously consider log1p transform to normalize the target

Target leakage is the more dangerous problem because it hides behind great performance. I once spent a week debugging a churn model that had an AUC of 0.99. Every metric was spectacular. It turned out one of the features — "days since last login" — was computed using data from after the churn event. The model wasn't predicting churn; it was reading the answer key.

Red flags that suggest leakage: a single feature with extreme importance, an AUC that seems too high for the problem domain (real-world prediction problems rarely exceed 0.95), or a feature whose definition only makes sense if the outcome is already known. The remedy is simple in principle and difficult in practice: trace every feature's lineage and ask, "would this value be available at the moment I need to make this prediction in production?"

There's a brute-force detection method that works surprisingly well: shuffle the target column randomly and retrain. If your model still performs well on the shuffled target, you have leakage — the features contain information about the target's position in the dataset rather than its value.

Data Quality and Integrity

By this point you've profiled types, distributions, missing values, correlations, and the target. The final layer is integrity — the domain-specific checks that no automated tool can do for you because they require knowledge of what the data is supposed to represent.

# Exact duplicates — surprisingly common in production
print(f"Duplicate rows: {df.duplicated().sum()}")
print(f"Duplicate user_ids: {df['user_id'].duplicated().sum()}")

# Domain assertions
assert (df["age"] >= 0).all(), "Negative ages found"
assert (df["age"] <= 120).all(), "Unrealistic ages found"
assert df["purchase_amount"].between(0, 1e6).all(), "Purchases outside sane range"

# String inconsistencies — the silent killer
print(df["city"].nunique(), "unique cities")
print(df["city"].str.strip().str.lower().nunique(), "after normalization")
# "New York" vs "new york" vs " New York " — three entries, one city

In production systems, these checks belong in code, not in notebooks. Tools like Great Expectations and Pandera let you define expectations as schemas — "this column is never null, always positive, has at most 50 unique values" — and validate every incoming batch automatically. When a check fails, the pipeline halts before bad data reaches your model. This is the difference between a medical checkup performed by a doctor (manual EDA) and a continuous monitoring system that beeps when vitals go abnormal (automated validation). You need both.

Automated Profiling in Production

Everything we've done so far — shape, types, distributions, correlations, missing patterns — can be generated in a single function call. That's worth knowing, with a caveat.

# ydata-profiling — the most comprehensive automated report
from ydata_profiling import ProfileReport
report = ProfileReport(df, title="Customer EDA", explorative=True)
report.to_file("customer_report.html")

# sweetviz — best for comparing train vs test distributions
import sweetviz as sv
sv.compare([train, "Train"], [test, "Test"]).show_html("comparison.html")

# whylogs — lightweight profiling designed for production pipelines
import whylogs as why
result = why.log(df)
result.writer("local").write()  # generates a statistical profile you can diff over time

The honest tradeoff: automated tools show you everything but can't tell you what matters for your problem. They generate hundreds of plots and statistics with no sense of priority. They also struggle with datasets larger than about a million rows. Use them to get oriented on a new dataset. Then dig into the specifics manually — the specifics that matter to your modeling decision.

In a production pipeline, the real power of profiling tools isn't the one-time report. It's running them on every data refresh and diffing the profiles over time. When the mean of a feature shifts by 20% between this week and last week, that's data drift — and it's one of the most common reasons a deployed model's performance degrades. Tools like EvidentlyAI and whylogs are built for exactly this: continuous monitoring that catches drift before your users notice.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a warning from Anscombe — that identical statistics can hide wildly different data — and used it to motivate exploring every dataset before modeling. We built a workflow: scan the shape and types, compute summary statistics while remembering their blindspots, plot distributions to catch skew and spikes and sentinel values, investigate missing data through the lens of three mechanisms, examine relationships while watching for Simpson's Paradox and spurious correlations, scrutinize the target for imbalance and leakage, and enforce domain integrity checks. Then we automated what could be automated and acknowledged what couldn't.

My hope is that the next time you receive a new dataset, instead of jumping straight into model.fit() and debugging mysterious failures three days later, you'll spend those first thirty minutes with the data — running the medical checkup, reading the distributions, checking the types — having a pretty good mental model of what you're working with before you ask any algorithm to learn from it.

Resources and Credits

Exploratory Data Analysis by John W. Tukey (1977) — the O.G. The book that invented the field. Dense but endlessly quotable.

The Datasaurus Dozen by Matejka & Fitzmaurice (2017) — the paper that made Anscombe's point unforgettable by hiding a dinosaur in summary statistics.

Statistical Analysis with Missing Data by Little & Rubin — the definitive reference on MCAR/MAR/MNAR. Rigorous but worth the effort for the missing-data chapters alone.

EvidentlyAI documentation — wildly helpful for understanding production data monitoring and drift detection.

Great Expectations documentation — insightful if you want to treat data validation like software testing. The "expectations" paradigm changes how you think about data quality.

Spurious Correlations by Tyler Vigen (tylervigen.com) — a website and book cataloging absurd real correlations (US spending on science vs suicides by hanging). Equal parts hilarious and sobering.

Next → Data Cleaning & Feature Engineering