Cross-Validation

Chapter 4: ML Fundamentals & Core Concepts Section 4 Core Skill

I avoided thinking deeply about cross-validation for an embarrassingly long time. I knew what it was — split, train, average the scores, move on. Every time a colleague mentioned "nested CV" or debated k=5 versus k=10, I'd nod along and quietly change the subject. I had this nagging feeling that I was treating it as a magic incantation rather than understanding what it was actually doing for me. Finally the discomfort of not knowing what's really happening under the hood grew too great. Here is that dive.

Cross-validation is a family of techniques for estimating how well a model will perform on data it hasn't seen. The core idea dates back to the 1930s (Larson, 1931), but the version most of us use — K-Fold — was formalized by Geisser and Stone in the 1970s. It has since become the standard way to evaluate models, compare algorithms, and tune hyperparameters in both research and industry.

Before we start, a heads-up. We're going to build cross-validation from scratch, work through the mechanics by hand on tiny examples, and eventually get into subtleties like temporal leakage, group contamination, and the nested CV trick. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What we'll cover

The trust problem with a single split
Building K-Fold by hand
Choosing k — the bias-variance tradeoff of the estimate itself
Stratified K-Fold — why class balance matters
A rest stop
Time Series Split — when shuffling is a crime
Group K-Fold — when your data has families
The preprocessing trap
Nested CV — honest scores when you're also tuning
Comparing models on the same folds
Reading the tea leaves — what fold scores actually tell you

The Trust Problem with a Single Split

Let's start with something concrete. Imagine we work at a small company and we've collected 2,000 customer support emails. 300 are spam. 1,700 are legitimate. Our job is to build a spam classifier.

The textbook move is to split the data — say 80% for training, 20% for testing — train a model, and measure accuracy on the test set. We do that and get 94.5%. Feels good.

Then a colleague shuffles the data differently and runs the same experiment. She gets 91.2%. Same model, same algorithm, same hyperparameters. A different shuffle, and the accuracy dropped by 3.3 percentage points. We shuffle a third time: 96.1%.

The model didn't change. The data partition changed. And that five-point swing between the worst and best split is not telling us anything about the model. It's telling us about the luck of the draw — which spam emails happened to land in the test set, which legitimate ones happened to land in training.

Think of it this way. If you wanted to know whether a chef is any good, you wouldn't judge them on a single dish served on a single night. Maybe the oven was acting up. Maybe the ingredients that evening were off. You'd want to see them cook multiple meals, on different nights, with different ingredients. That's the intuition behind cross-validation: don't judge your model on one random slice. Judge it on several.

A single test score is an anecdote. A distribution of test scores is evidence.

Building K-Fold by Hand

Let's make this tangible. Forget the 2,000 emails for a moment. Imagine we have ten. Ten emails. I'll label them A through J. Three are spam (C, F, I), seven are legitimate.

We want to evaluate our classifier, but we don't want to waste data, and we don't want to depend on a single lucky split. So we divide the ten emails into five groups — we'll call them folds — with two emails each.

Fold 1: [A, B]. Fold 2: [C, D]. Fold 3: [E, F]. Fold 4: [G, H]. Fold 5: [I, J].

Now we run five rounds. In Round 1, we hold out Fold 1 as our test set and train on Folds 2 through 5 (eight emails). We measure accuracy on Fold 1. In Round 2, we hold out Fold 2 and train on the rest. We keep going until every fold has had its turn as the test set.

At the end, we have five accuracy scores — one per round. We average them. That average is our estimate of how this model would perform on emails it has never seen.

Two things to notice. First, every email gets used for testing exactly once and for training four times. Nothing is wasted. Second, the final estimate doesn't depend on which emails happened to land in which fold, because every fold gets its turn. That's a much better deal than a single 80/20 split, especially when data is scarce.

This procedure is called K-Fold Cross-Validation, and it's the workhorse of model evaluation. In our toy example, k was 5. In practice, the most common choices are k=5 and k=10. But why those numbers?

Choosing k — The Estimate's Own Bias-Variance Tradeoff

Here's something that tripped me up for a while. When people talk about the bias-variance tradeoff of cross-validation, they're not talking about the model's bias and variance. They're talking about the bias and variance of the performance estimate itself. The CV score is an estimator, and like any estimator, it has its own accuracy and stability.

When k is small — say k=2 — each fold uses only 50% of the data for training. Models trained on half the data are worse than models trained on most of the data. So the per-fold scores are pessimistic. The estimate has high bias — it systematically underestimates how well a model trained on all the data would perform. On the upside, the two training sets are quite different from each other, so the two scores bring genuinely independent information. That keeps the variance of the estimate low.

When k is large — say k = n, which is Leave-One-Out Cross-Validation (LOOCV) — each fold trains on n−1 examples, almost the entire dataset. The bias is tiny. But here's the surprise: the variance of the estimate actually increases. I'll be honest, this felt counterintuitive the first time I encountered it. Training on nearly all the data sounds like it should give you a very stable estimate. But the training sets overlap almost completely — the difference between any two is a single data point — so the models are nearly identical, and the per-fold scores are highly correlated. Averaging a bunch of highly correlated numbers doesn't reduce variance much. And if one data point happens to be influential — an outlier, a mislabeled example — omitting it swings the score hard.

The sweet spot sits in between. Kohavi's 1995 empirical study — still one of the most cited papers on this — found that k=10 tends to give the best tradeoff: low enough bias (training on 90% of the data per fold) and enough diversity between folds to keep variance manageable. k=5 is nearly as good and takes half the time. That's why 5 and 10 have become the community defaults.

In practice, I use 5-fold for exploratory work and early experiments. For final performance reporting — the number that goes in a paper or a stakeholder presentation — I bump it to 10, or use Repeated K-Fold (more on that later).

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
# Fold scores: [0.9649 0.9561 0.9474 0.9649 0.9558]
# Mean accuracy: 0.9578 ± 0.0068

That shuffle=True matters. Without it, folds follow the original row order — which is catastrophic if the data was sorted by class label. The first fold could be all negatives, the last fold all positives. Always shuffle unless your data has a time dimension (which we'll get to).

But standard K-Fold has a blind spot. It doesn't care about class balance. If 15% of your emails are spam, one fold might end up with 25% spam and another with 5%. The model trained without the spam-heavy fold barely knows spam exists. We need something smarter.

Stratified K-Fold — Why Class Balance Matters

Back to our spam classifier. Of our 2,000 emails, 300 are spam — that's 15%. In a standard 5-fold split, each fold has 400 emails. On average, each fold should contain about 60 spam emails. But "on average" hides a lot of mischief. By random chance, one fold might get 90 spam emails and another might get 30.

When the fold with 90 spam emails becomes the test set, the model was trained with fewer spam examples. When the fold with 30 spam emails is the test set, the model looks great — but only because the test set underrepresents the hard cases. The fold scores bounce around not because of the model, but because of the imbalanced splits.

Stratified K-Fold fixes this by ensuring each fold has approximately the same proportion of each class as the full dataset. If the full dataset is 15% spam, each fold will be close to 15% spam. The mechanics are identical to regular K-Fold — five rounds, hold one out, average the scores — but the initial assignment of examples to folds is constrained to preserve class ratios.

For classification tasks, stratified K-Fold should be your default. Not "a nice option." The default.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

print(f"Stratified scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
# Stratified scores: [0.9561 0.9737 0.9561 0.9649 0.9469]
# Mean accuracy: 0.9596 ± 0.0090

A useful bit of scikit-learn trivia: when you pass an integer to cv=5 for a classifier, it silently uses Stratified K-Fold under the hood. For regressors, it uses plain K-Fold. The library is trying to be helpful, and it usually is. But I prefer being explicit — future-me reading this code in six months will thank present-me for not relying on silent magic.

Stratification solves the class balance problem. But it still assumes something fundamental about the data: that any email can sit next to any other email, that the order doesn't matter, that shuffling is fine. For a lot of real-world problems, that assumption is dead wrong.

🛑 Rest Stop

If you've made it this far, congratulations. You can stop here if you want. You now have a solid mental model: cross-validation rotates through multiple train/test splits, averages the scores, and gives you a much more trustworthy estimate than a single split ever could. Stratified K-Fold preserves class balance. k=5 or k=10 for the fold count. Report mean and standard deviation.

That mental model will serve you well for 80% of practical situations. The short version of everything that follows is: temporal data needs special splits that respect time, grouped data needs splits that respect groups, hyperparameter tuning needs an extra layer of nesting, and preprocessing must happen inside the fold loop or you'll leak information. There. You're 80% of the way there.

But if the discomfort of not knowing what lies underneath is nagging at you, read on.

Time Series Split — When Shuffling Is a Crime

Let's change the scenario. Instead of a static pile of 2,000 emails, imagine we're receiving emails over time — a hundred per week for twenty weeks. We want to predict whether next week's emails are spam. The data has a timestamp, and that timestamp matters.

If we shuffle these emails and split them into folds, we'll have a situation where our model trains on emails from Week 18 and tests on emails from Week 3. It's using future information to predict the past. That's called temporal leakage, and it's one of the most expensive mistakes in applied ML. I've watched teams chase phantom accuracy for months because of this — their offline metrics looked spectacular, and the production model was useless. The gap was leakage.

Think back to our chef analogy. Temporal leakage is like letting the chef taste the dish after the judges have scored it, then pretending the chef cooked it that way on purpose. The performance looks amazing, but it's fictional.

Time Series Split respects the arrow of time. It uses an expanding training window: Fold 1 trains on the earliest data and tests on the next chunk. Fold 2 trains on everything from Fold 1's training and test data combined, and tests on the next chunk after that. The training set always precedes the test set in time. Always.

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

n_samples = 1000
tscv = TimeSeriesSplit(n_splits=5)

for i, (train_idx, test_idx) in enumerate(tscv.split(np.arange(n_samples))):
    print(f"Fold {i}: Train [{train_idx[0]}..{train_idx[-1]}] "
          f"({len(train_idx)} samples) → "
          f"Test [{test_idx[0]}..{test_idx[-1]}] "
          f"({len(test_idx)} samples)")
# Fold 0: Train [0..165]   (166 samples) → Test [166..332]  (167 samples)
# Fold 1: Train [0..332]   (333 samples) → Test [333..499]  (167 samples)
# Fold 2: Train [0..499]   (500 samples) → Test [500..666]  (167 samples)
# Fold 3: Train [0..666]   (667 samples) → Test [667..833]  (167 samples)
# Fold 4: Train [0..833]   (834 samples) → Test [834..999]  (166 samples)

Notice how the training set grows with each fold. The first fold trains on very little data, so its score might be poor. That's not a problem — it reflects the reality that early in a time series, you don't have much history to learn from.

For financial data, the story gets even more subtle. Marcos López de Prado introduced purged cross-validation in his book Advances in Financial Machine Learning. The idea: even after splitting by time, some training samples might have label windows that overlap with the test period. Purging removes those contaminated training samples. An additional embargo period can be added — a buffer of time between training and test — to guard against delayed information leakage from slow market reactions. If you're doing anything with financial time series, this is required reading.

The rule is iron-clad: if your rows have timestamps, do not shuffle them. Use Time Series Split. If the domain has delayed effects, add purging. There are no exceptions.

Group K-Fold — When Your Data Has Families

Back to our email classifier. A new wrinkle: it turns out that 50 of our users are power users who each contributed dozens of emails to our dataset. If Alice's emails appear in both training and testing, the model might learn to recognize Alice's writing style rather than learning what makes an email spam. Performance on Alice's test emails looks great. Performance on a brand-new user — someone the model has never seen — is much worse.

This is group leakage. The model learned to recognize the source rather than the pattern. It shows up everywhere: medical imaging where one patient contributes multiple scans, NLP datasets where one author writes multiple documents, sensor data where one device generates thousands of readings.

Group K-Fold keeps all data from the same group in the same fold. All of Alice's emails land together — either all in training or all in testing, never split across the two. You provide a groups array that tells the splitter which rows belong together.

from sklearn.model_selection import GroupKFold
import numpy as np

# 100 samples from 20 users (5 emails each)
groups = np.repeat(np.arange(20), 5)

gkf = GroupKFold(n_splits=5)
for i, (train_idx, test_idx) in enumerate(gkf.split(X[:100], y[:100], groups)):
    train_groups = set(groups[train_idx])
    test_groups = set(groups[test_idx])
    assert train_groups.isdisjoint(test_groups)
    print(f"Fold {i}: Train users={sorted(train_groups)}")
    print(f"         Test  users={sorted(test_groups)}")

The question to ask yourself: "Could the model be recognizing the source rather than the pattern?" If the answer is yes — or even maybe — you need group-aware splitting. Multiple essays from the same student, multiple transactions from the same customer, multiple frames from the same video. If there's a natural grouping, respect it.

The Preprocessing Trap

This is the mistake that gets the most experienced people. I still catch myself almost making it from time to time.

Here's the scenario. Before training your model, you need to scale your features — subtract the mean, divide by the standard deviation. Sensible. You do this:

# THE WRONG WAY — don't do this
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # fits on ALL data, including future test folds
scores = cross_val_score(model, X_scaled, y, cv=5)

The scaler learned the mean and standard deviation from all the data — including the examples that will later become the test fold. Every test fold has been subtly contaminated. The model's features were computed using information it shouldn't have had access to.

The magnitude of this leak depends on the dataset. Sometimes it barely matters. Sometimes it's the difference between a model that looks production-ready and one that falls apart on real data. A healthcare team building a sepsis prediction model discovered this the hard way — their cross-validated metrics looked excellent, their deployed model performed significantly worse, and the root cause was imputation computed on the full dataset before splitting.

The fix is elegant: use a Pipeline. It wraps preprocessing and modeling into a single object. When cross-validation holds out a fold, the pipeline fits the scaler only on the training folds and transforms the test fold using those training-only statistics. No leakage.

# THE RIGHT WAY
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000))
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

This applies to everything that learns from data: scalers, imputers, encoders, feature selectors, dimensionality reduction. If it has a fit() method, it needs to be inside the pipeline. No exceptions.

Nested CV — Honest Scores When You're Also Tuning

Here's a subtle trap that catches a lot of people, and I'll admit it caught me too.

You use 5-fold CV to try different hyperparameter combinations — tree counts, max depth, regularization strength. For each combination, you record the average CV score. You pick the winner. Then you report that winning score as your model's expected performance.

Is that honest? No. And here's why.

You searched through, say, 12 hyperparameter combinations and picked the one that happened to score highest on those particular folds. Even if each individual CV was fair, the act of selecting the best introduces optimistic bias. It's the same logic as testing on your training set — but at one level of abstraction higher. The hyperparameters were fit to the CV folds, and now you're evaluating on those same folds.

Back to the chef analogy. It's like asking the chef to cook five different dishes, then reporting the score of the best one as their average skill level. The best dish is an optimistic representation.

Nested cross-validation fixes this with two loops. The outer loop splits data into train/test folds for unbiased evaluation. The inner loop runs inside each outer training set to tune hyperparameters. The outer test fold is never touched during tuning. So the outer scores are honest — they reflect true out-of-sample performance, including the cost of having to choose hyperparameters.

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# Inner loop: tune hyperparameters
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None]
}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=inner_cv,
    scoring='accuracy',
    n_jobs=-1
)

# Outer loop: evaluate generalization honestly
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')

print(f"Nested CV scores: {nested_scores}")
print(f"Unbiased estimate: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

When do you need this? Whenever you're both tuning hyperparameters and reporting final performance in the same experiment. If you have a completely separate held-out test set that you never use for any decisions, single-level CV for tuning is fine. But if your paper or dashboard quotes the CV score as "expected performance," that number must come from the outer loop. The cost is computational — 5 outer folds × 3 inner folds × 12 hyperparameter combos = 180 model fits. Worth it for honesty.

Comparing Models on the Same Folds

If you're comparing two models — say Random Forest versus Logistic Regression — running them on different random folds is comparing apples to oranges. One model might get lucky folds. The other might get hard ones. The difference you see comes from the splits, not the algorithms.

The fix is dead simple: use the same fold assignments for both models. Then the difference between their scores on each fold is a genuine head-to-head comparison on identical data.

Repeated K-Fold makes this comparison even sharper. It runs the entire K-Fold procedure multiple times with different shuffles — 10 repeats of 5-fold gives you 50 paired scores. More data points means more statistical power to detect small differences.

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
from scipy import stats
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, random_state=42))

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
rf_scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')
lr_scores = cross_val_score(lr, X, y, cv=cv, scoring='accuracy')

print(f"Random Forest:       {rf_scores.mean():.4f} ± {rf_scores.std():.4f}")
print(f"Logistic Regression: {lr_scores.mean():.4f} ± {lr_scores.std():.4f}")

diff = rf_scores - lr_scores
print(f"\nPer-fold diff (RF - LR): {diff.mean():.4f} ± {diff.std():.4f}")

t_stat, p_value = stats.ttest_rel(rf_scores, lr_scores)
print(f"Paired t-test: t={t_stat:.3f}, p={p_value:.4f}")

The per-fold difference is the most informative number. If Random Forest wins on 45 out of 50 folds, that's convincing even if the average gap is small. If it wins on 28 and loses on 22, the models are equivalent in practice — pick the simpler one.

A word of caution. With 50 repeated folds, you have a lot of statistical power. A 0.2% accuracy difference can register as "statistically significant" at p < 0.05. That doesn't mean it matters. A 0.2% accuracy difference is invisible to users, invisible in production, and will vanish the moment the data distribution shifts. Always pair the p-value with the actual magnitude. If two models are within noise of each other, prefer the simpler one.

One more subtlety: the standard paired t-test assumes fold scores are independent. They're not — folds share training data. For repeated K-Fold, the Nadeau–Bengio corrected test adjusts for this dependence. In practice, the uncorrected test gives directionally correct answers, but if you're writing a paper, use the correction.

Reading the Tea Leaves — What Fold Scores Tell You

Most people compute the mean CV score, maybe the standard deviation, and move on. That's throwing away information.

Always print the individual fold scores. If four folds score 0.95 and one scores 0.82, that one bad fold is more interesting than the mean. Something about that slice of data is different — a subpopulation the model struggles with, a cluster of noisy labels, a distribution shift. Understanding why a fold is hard teaches you more about your problem than the average ever will.

Here's what different patterns of fold scores tend to mean:

If fold scores have high variance on small data, each fold is a small, noisy sample. More data is the real fix. Repeated K-Fold is the band-aid — it averages over more random shuffles and gives you a tighter confidence interval.

If one or two folds are dramatically worse than the rest, there's likely a subpopulation or context shift in the data. Some folds contain a cluster of unusual examples — a different region, a different time period, a different user segment. This is valuable signal. It tells you your model's performance varies by context, and that's exactly what you'll see in production.

If fold scores are suspiciously uniform and the mean is suspiciously high, check for data leakage. When information leaks from test to train, the model performs consistently well everywhere — too well. This is the scariest pattern because it looks like success.

I've developed a habit of always looking at the worst fold first. If the worst fold is acceptable for my use case, the model is ready. If it's not, the mean doesn't matter.

The Decision Guide

Technique	When to Reach for It	What It Gives You	What It Costs
Hold-Out	Large datasets (100k+ rows), quick sanity checks	Speed, simplicity	Unstable estimates on small data; wastes test data
K-Fold	General-purpose default for regression	Stable estimate; every sample tested once	Ignores class balance, time, and groups
Stratified K-Fold	Classification — this is your actual default	Preserves class ratios; stable across folds	Classification only; doesn't handle time or groups
LOOCV	Tiny datasets (<100 samples)	Maximum training data; deterministic	High variance as an estimator; computationally brutal
Time Series Split	Any data with temporal ordering	Prevents future→past leakage	Earlier folds train on less data
Group K-Fold	Data with natural groups (patients, users, devices)	Prevents group-level leakage	Uneven fold sizes; need a meaningful group label
Nested CV	Tuning + reporting performance in same experiment	Unbiased performance despite hyperparameter search	Expensive (outer × inner × param combos)
Repeated K-Fold	Final model comparisons; tight confidence intervals	Lower variance estimate; more statistical power	n_repeats × k models to train

The decision tree in your head: Does the data have a time component? Time Series Split. Does the data have groups? Group K-Fold. Classification? Stratified K-Fold. Regression? K-Fold. Also tuning hyperparameters? Wrap it in nested CV. Need to compare two models? Use the same folds and look at paired differences.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a simple observation — that a single train/test split gives you a number that depends more on the luck of the shuffle than on the quality of the model. We built K-Fold from scratch on ten emails. We explored why k=5 and k=10 are the standard choices, saw why stratification matters for classification, and then ventured into the places where standard CV breaks down: temporal data that demands respect for the arrow of time, grouped data where sources must not be split, preprocessing that must happen inside the fold loop, and the nested CV trick for honest reporting when you're also tuning.

My hope is that the next time you see a cross-validation score, instead of treating it as a black-box number to be compared against a threshold, you'll pause and ask the questions that matter: Was the split appropriate for this data? Did preprocessing happen inside or outside the fold? Is this score from the inner loop or the outer loop? And what does the worst fold look like? Those questions — not the mean accuracy — are what separate a practitioner who understands model evaluation from one who is going through the motions.

What You Should Now Be Able To Do

Explain why a single train/test split produces unreliable estimates and what cross-validation does about it
Walk through K-Fold mechanics on a toy example and explain why each data point gets used for both training and testing
Articulate the bias-variance tradeoff of the CV estimator itself — why k=2 has high bias, why LOOCV has high variance, and why k=5 or k=10 sits in the sweet spot
Use Stratified K-Fold as the default for classification and explain why class balance in folds matters
Apply TimeSeriesSplit for temporal data, explain temporal leakage, and know when purged CV is needed
Recognize when Group K-Fold is necessary and identify group leakage in real-world datasets
Put all preprocessing inside a Pipeline to prevent fit/transform leakage during cross-validation
Set up nested CV to get unbiased performance estimates while tuning hyperparameters
Compare two models on the same folds, interpret paired differences, and distinguish statistical from practical significance
Read individual fold scores diagnostically — high variance, one bad fold, suspiciously uniform scores — and know what each pattern means

← Previous Bias-Variance & Overfitting Next → Evaluation Metrics