Decision Trees
The Game of Twenty Questions
I avoided taking decision trees seriously for longer than I'd like to admit. They looked too simple. A flowchart? That's a machine learning model? Every time someone mentioned them in the same breath as neural networks or SVMs, I'd nod politely and move on. Finally the discomfort of not understanding why trees form the backbone of every Kaggle-winning tabular model — Random Forests, XGBoost, LightGBM — grew too strong for me. Here is that dive.
Decision trees were formalized as a learning algorithm by Leo Breiman and colleagues in 1984 with their CART book (Classification and Regression Trees), though the idea of building flowcharts from data traces back further — Ross Quinlan published ID3 in 1986 and later C4.5 in 1993. Today, trees are everywhere: credit scoring, medical diagnosis, fraud detection, recommendation systems. Not because they're the most accurate model on their own, but because they're the atom that ensemble methods are built from.
Before we start, a heads-up. We're going to work through how trees choose splits, walk through the math of Gini impurity and entropy, and get into pruning and feature importance. But you don't need to know any of it beforehand. We'll add what we need, one piece at a time.
This isn't a short journey, but I hope you'll be glad you came.
Contents
The Game of Twenty Questions
A Tiny Fruit Stand
Measuring Mess: Gini Impurity
The Other Yardstick: Entropy and Information Gain
The CART Algorithm — Greedy and Proud of It
Rest Stop
Pruning: The Art of Knowing When to Stop
Feature Importance — and Its Trap
The Variance Problem: Why One Tree Is Never Enough
What Trees Can't Do
Regression Trees: Same Idea, Different Score
The History We Inherited
Wrap-Up
Resources
A Tiny Fruit Stand
Imagine you run a fruit stand. Customers dump fruit into a bin and you need to sort it: apple, orange, or lemon. You have three measurements for each piece of fruit — its color (on a redness scale from 0 to 10), its diameter in centimeters, and its surface texture (smooth = 0, bumpy = 1). Here are six fruits:
Fruit | Color | Diameter | Texture | Label
--------|-------|----------|---------|-------
1 | 8 | 7.0 | 0 | Apple
2 | 3 | 6.5 | 1 | Orange
3 | 7 | 7.2 | 0 | Apple
4 | 2 | 5.0 | 0 | Lemon
5 | 4 | 6.8 | 1 | Orange
6 | 1 | 4.5 | 0 | Lemon
A decision tree approaches this the way a child plays Twenty Questions. It asks a single yes-or-no question about one feature, splits the fruit into two piles based on the answer, and then asks another question about each pile. Each question narrows the possibilities. Each answer guides you down a different branch.
For our fruit, the tree might first ask: "Is color > 5?" The fruits with high redness (fruits 1 and 3, both apples) go left. Everything else goes right. Then for the right pile, it might ask: "Is texture == 1?" Bumpy fruits (2 and 5, both oranges) go one way. Smooth, low-color fruits (4 and 6, both lemons) go the other.
Is color > 5?
├── Yes → Apple (fruits 1, 3)
└── No → Is texture == 1?
├── Yes → Orange (fruits 2, 5)
└── No → Lemon (fruits 4, 6)
That's a decision tree. Three questions, perfectly sorted fruit. Each internal node asks a question about one feature. Each branch is a yes-or-no answer. Each leaf is a prediction. You could draw this on a napkin and anyone — engineer or not — could follow the logic. Try doing that with a neural network.
But here's the question that makes trees interesting: how did the tree know to ask about color first, and not diameter? How did it pick the threshold of 5 instead of 4 or 6? The tree didn't guess. It tried every possible question and picked the one that produced the cleanest split. To understand what "cleanest" means, we need a way to measure mess.
Measuring Mess: Gini Impurity
Before the tree asks any questions, all six fruits sit in one pile: 2 apples, 2 oranges, 2 lemons. That pile is maximally mixed. If you randomly grabbed a fruit and randomly guessed its label based on the proportions in the pile, you'd be wrong a lot. We need a number that captures that messiness.
Gini impurity does exactly this. It measures the probability that a randomly chosen sample would be misclassified if you labeled it randomly according to the class proportions at that node. The formula is:
Gini = 1 - Σ(pᵢ²)
where pᵢ is the proportion of class i. Let's work through it for our starting pile. We have three classes, each with proportion 2/6 = 1/3.
Gini = 1 - ((1/3)² + (1/3)² + (1/3)²)
= 1 - (1/9 + 1/9 + 1/9)
= 1 - 3/9
= 1 - 0.333
= 0.667
That's close to the maximum mess. Now suppose we split on "color > 5." The left child gets fruits 1 and 3 — both apples. Its Gini is:
Gini_left = 1 - (1.0²) = 0.0
Zero. A pure node. All one class. The right child gets fruits 2, 4, 5, 6 — two oranges and two lemons.
Gini_right = 1 - ((2/4)² + (2/4)²)
= 1 - (0.25 + 0.25)
= 0.5
To evaluate this split overall, we take the weighted average of the children's impurity, weighted by how many samples went each way. Left got 2 out of 6 samples, right got 4 out of 6.
Gini_split = (2/6) × 0.0 + (4/6) × 0.5
= 0 + 0.333
= 0.333
We started at 0.667 and dropped to 0.333. That's a Gini reduction of 0.334. The tree tries every possible split on every feature, computes this reduction for each one, and picks the split with the biggest drop. That's the split that creates the "cleanest" child nodes — where the fruit piles are closest to being all one type.
A Gini of 0 means a perfectly pure node. A Gini of 0.5 (for binary) or higher (for multi-class) means maximum confusion. The tree is chasing zeros.
The Other Yardstick: Entropy and Information Gain
I'll be honest — I spent way too long early on trying to figure out when to use Gini versus entropy, convinced there was some deep secret about when each was better. There isn't. But entropy is worth understanding because it comes from information theory and gives a different angle on the same idea.
Entropy measures uncertainty. If you're about to pick a fruit from a pile and you have no idea what it'll be, that's high entropy. If the pile is all apples, you know exactly what you'll get — zero entropy. The formula:
Entropy = -Σ(pᵢ × log₂(pᵢ))
For our starting pile of 2 apples, 2 oranges, 2 lemons:
Entropy = -(1/3 × log₂(1/3)) × 3
= -(1/3 × (-1.585)) × 3
= 1.585
For a pure node (all one class), entropy is 0. For binary classification with a 50/50 split, entropy is 1.0. The key difference from Gini is the scale — entropy can go above 1.0 for multi-class problems — but the shape is similar. Both are concave curves that peak when classes are evenly distributed and hit zero when a node is pure.
Information gain is the parent's entropy minus the weighted average of the children's entropy. It measures how much uncertainty the split eliminates. The split that eliminates the most uncertainty wins.
Here's the thing that took me too long to internalize: in practice, Gini and entropy produce nearly identical trees. Research has shown they disagree on the chosen split in less than 2% of cases. Gini is marginally faster because it skips the logarithm. Scikit-learn defaults to Gini. That default is fine. If someone in an interview asks you when to prefer one over the other, the honest answer is that it almost never matters — and that honesty will land better than a made-up rule of thumb.
There's a historical wrinkle worth knowing. Quinlan's ID3 algorithm (1986) used information gain, but it had a flaw: information gain is biased toward features with many distinct values. A feature like "customer_id" with thousands of unique values could produce incredible information gain by trivially splitting into pure nodes that each contain one sample. Quinlan fixed this in C4.5 (1993) with gain ratio — information gain divided by the feature's own entropy (called split information), which penalizes features that scatter data into many small groups. Breiman's CART, meanwhile, used Gini impurity and sidestepped the problem entirely by always making binary splits.
The CART Algorithm — Greedy and Proud of It
With a way to measure mess in hand, we can describe the algorithm that actually builds the tree. CART — Classification and Regression Trees — was published by Breiman, Friedman, Olshen, and Stone in 1984, and it remains the foundation of scikit-learn's DecisionTreeClassifier today.
Here's what CART does at every node. It looks at each feature in turn. For a continuous feature, it sorts the values and considers every midpoint between adjacent values as a candidate threshold. For our fruit's color feature with values [1, 2, 3, 4, 7, 8], the candidate thresholds would be 1.5, 2.5, 3.5, 5.5, and 7.5. For each candidate, it splits the data left and right, computes the weighted Gini impurity of the two children, and records the reduction from the parent. After trying every threshold on every feature, it picks the (feature, threshold) pair that produced the biggest impurity reduction. Then it splits the data and repeats the whole process on each child node, stopping when a node is pure, or when some limit is hit.
Let's trace through one step with our fruit stand. At the root, we have all six fruits. CART considers every possible split:
Feature: Color
threshold 1.5 → left: {Lemon} right: {Apple,Orange,Apple,Lemon,Orange}
threshold 2.5 → left: {Lemon,Lemon} right: {Apple,Orange,Apple,Orange}
threshold 3.5 → left: {Orange,Lemon,Lemon} right: {Apple,Apple,Orange}
threshold 5.5 → left: {Orange,Lemon,Orange,Lemon} right: {Apple,Apple}
...
Feature: Diameter
(similarly, tries every midpoint)
Feature: Texture
threshold 0.5 → left: {Apple,Apple,Lemon,Lemon} right: {Orange,Orange}
For each of these, CART computes the Gini reduction. The split "color > 5.5" creates a left child with two pure apples (Gini = 0) and a right child with two oranges and two lemons (Gini = 0.5). The weighted reduction is 0.334, which turns out to be the best among all candidates. So CART picks it.
This process is greedy. At each node, CART picks the locally best split without any consideration of what splits might come later. A split that looks mediocre right now might enable a brilliant split two levels down — but CART will never find it. Finding the globally optimal tree is NP-hard (proven by Hyafil and Rivest in 1976), so greedy is what we have.
The fact that finding the optimal tree is NP-hard still bothers me, if I'm being honest. It means every decision tree you've ever trained is an approximation. A good approximation — good enough that trees and their ensembles dominate tabular machine learning. But an approximation nonetheless.
The greedy nature has a practical consequence that matters a lot: the first split casts a long shadow. If one extra data point changes the root split from "color > 5.5" to "diameter > 6.0," every subsequent split in the tree changes too. The entire structure cascades from that one decision. We'll come back to this instability later — it's going to turn from a weakness into the most important property of trees.
Rest Stop
Congratulations on making it this far. You can stop here if you want.
At this point you have a working mental model of what a decision tree is and how it's built: a series of yes-or-no questions about features, chosen greedily to maximize purity at each step, using Gini impurity (or entropy) to measure how "mixed" a node is. CART tries every feature and every threshold, picks the best split, and recurses. That mental model is genuinely useful — it'll let you explain trees in an interview, understand what sklearn is doing under the hood, and reason about why trees behave the way they do.
But it doesn't tell the complete story. An unchecked tree will memorize your training data. Feature importance scores can lie to you. And the most important property of decision trees — their instability — is something we haven't confronted yet. That instability is what makes ensemble methods work, and you can't really understand Random Forests or XGBoost without understanding why a single tree is so fragile.
If the discomfort of not knowing what's underneath is nagging at you, read on.
Pruning: The Art of Knowing When to Stop
Left unchecked, our tree algorithm will keep splitting until every leaf contains exactly one class — or even exactly one sample. On our tiny fruit stand this gives perfect accuracy. On a real dataset with ten thousand samples and a hundred features, it gives a nightmarish tree with thousands of leaves that has memorized every quirk and bit of noise in the training data. Training accuracy: 100%. Test accuracy: garbage.
Think of it like a gardener who never prunes a rose bush. The bush grows wild, with branches reaching in every direction, wasting energy on growth that produces no flowers. Pruning cuts back the excess to focus energy where it counts. Decision tree pruning works the same way — it cuts back branches that capture noise instead of signal.
There are two strategies, and they correspond to two different philosophies.
Pre-pruning says: don't let the tree grow wild in the first place. Set limits before training begins. The main controls are max_depth (how many layers deep the tree can go), min_samples_split (the minimum number of samples a node must have to be split further), and min_samples_leaf (the minimum number of samples that must end up in each leaf). Of these, max_depth is the single most important. A depth-5 tree has at most 2⁵ = 32 leaves. A depth-20 tree has over a million. The difference between those two trees isn't a matter of degree — it's the difference between an interpretable model and a lookup table.
The risk with pre-pruning is stopping too early. A split that looks useless at depth 3 might enable a powerful split at depth 4. By refusing to grow past depth 3, you'll never discover it. It's like telling a gardener to stop pruning at exactly 3 feet, regardless of where the best buds are.
Post-pruning takes the opposite approach: let the tree grow to its full wild glory, then prune it back. The most common variant is cost-complexity pruning, implemented in scikit-learn through the ccp_alpha parameter. The idea is captured in a single formula:
R_α(T) = R(T) + α × |T|
where R(T) is the tree's misclassification rate, |T| is the number of leaves, and α (alpha) is a complexity penalty. Larger α means you penalize complexity more — more pruning, simpler tree. When α = 0, there's no penalty and you get the full unpruned tree. As α increases, subtrees get collapsed back into leaves whenever the accuracy gained by keeping them isn't worth the complexity.
The elegant part is how you find the right α: you don't guess. Scikit-learn's cost_complexity_pruning_path computes a sequence of α values where the tree changes structure — each value corresponds to a different tree size. You cross-validate across these candidates and pick the α that gives the best validation performance. It's like the tree presenting a menu of its own possible sizes and asking the data to choose.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
# Grow full tree to compute the pruning path
full_tree = DecisionTreeClassifier(random_state=42)
path = full_tree.cost_complexity_pruning_path(X, y)
alphas = path.ccp_alphas[:-1] # last alpha gives a single-node tree
# Cross-validate each candidate alpha
cv_scores = [
cross_val_score(
DecisionTreeClassifier(ccp_alpha=a, random_state=42),
X, y, cv=5, scoring='accuracy'
).mean()
for a in alphas
]
best_alpha = alphas[np.argmax(cv_scores)]
print(f"Best alpha: {best_alpha:.4f}, CV accuracy: {max(cv_scores):.4f}")
# Train the final pruned tree
tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
tree.fit(X, y)
print(f"Depth: {tree.get_depth()}, Leaves: {tree.get_n_leaves()}")
This produces a tree that's typically 4–8 levels deep instead of 20+. The pruning path hands you a menu. Cross-validation picks the right item. The gardener analogy comes back: post-pruning is like letting the bush grow freely, then stepping back, looking at the whole thing, and cutting the branches that aren't producing flowers. It can see the full structure before deciding what to cut, which is why it often produces better results than pre-pruning alone.
In practice, many experienced practitioners combine both: set a reasonable max_depth (say, 10–15) to keep training fast, then apply ccp_alpha to fine-tune the result.
Feature Importance — and Its Trap
Every time the tree makes a split, it records how much that split reduced impurity. After training, you can sum up the total impurity reduction contributed by each feature, normalize so they add up to 1.0, and get a ranked list of which features mattered most. This is called impurity-based feature importance, and it comes free with every trained tree.
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X_train, y_train)
for name, importance in sorted(
zip(feature_names, tree.feature_importances_),
key=lambda x: -x[1]
)[:5]:
print(f"{name:30s} {importance:.4f}")
This is genuinely useful for getting a quick sense of what's driving your model. But there's a trap, and it's one that catches experienced practitioners too.
Impurity-based importance is biased toward features with many unique values. Think about why. A feature with 1,000 unique values gives the tree 999 candidate split points. A binary feature gives it one. More candidates means more chances for at least one of them to produce a large impurity reduction, even by luck. A completely random column with unique values for every row — something like a customer ID — can appear "important" because it trivially splits the data into pure single-sample nodes.
The fix is permutation importance. The idea is simple and elegant: after training the tree, take one feature at a time and shuffle its values across samples, breaking the relationship between that feature and the target. Then measure how much accuracy drops. If shuffling a feature tanks performance, it was genuinely important. If accuracy barely budges, it wasn't — regardless of how many split points it had.
from sklearn.inspection import permutation_importance
result = permutation_importance(tree, X_test, y_test,
n_repeats=10, random_state=42)
for name, importance in sorted(
zip(feature_names, result.importances_mean),
key=lambda x: -x[1]
)[:5]:
print(f"{name:30s} {importance:.4f}")
Permutation importance measures actual predictive power, not the statistical accident of having many split candidates. It's what you should report when someone asks "which features matter?" — especially in a production context where decisions are being made based on those rankings.
There's a deeper approach too: SHAP values (SHapley Additive exPlanations) come from game theory and give per-sample, per-feature attribution. For tree models specifically, TreeSHAP computes these in polynomial time, making it practical even on large datasets. But permutation importance is a solid default when you need quick, reliable rankings.
The Variance Problem: Why One Tree Is Never Enough
This is the most important thing to understand about decision trees. Everything else — the ensembles, the boosting, the billion-dollar Kaggle-winning pipelines — traces back to this single property.
A decision tree is a high-variance, low-bias learner. It can carve out arbitrarily complex decision boundaries (low bias), but it's exquisitely sensitive to the particular training data it sees (high variance). Remove five samples, add five new ones, and you might get a completely different tree — different root feature, different thresholds, different structure, different predictions.
I still get surprised by this sometimes. I'll retrain a tree on a dataset that differs by a handful of rows and the resulting tree looks nothing like the original. Not "a little different." Unrecognizably different.
The reason traces back to the greedy algorithm. Remember: the root split casts a long shadow. Imagine two features are nearly tied at the root — feature A gives a Gini reduction of 0.334 and feature B gives 0.331. A few data points changing hands can flip which one wins. Once the root is different, the two child nodes contain different subsets. The splits at the next level are chosen from different pools of data. The cascade continues downward. By the time you reach the leaves, the two trees share almost nothing.
Think of our Twenty Questions analogy again. If the first question changes from "Is it a mammal?" to "Does it live in water?," every follow-up question has to change too. The whole game takes a different path from a single altered starting point.
Why does this matter? Because a model that changes dramatically with small data perturbations can't be trusted on its own. Two trees trained on different random samples of the same population will agree on broad trends but disagree wildly on specific predictions. That disagreement is variance, and variance means unreliable predictions.
Here's the insight that changed tree-based machine learning forever: if individual trees have low bias but high variance, what if you averaged many of them? Averaging independent estimates reduces variance without increasing bias — that's a statistical fact that predates machine learning by centuries. That idea became Random Forests (Breiman, 2001): train hundreds of trees on different bootstrap samples of the data, with random subsets of features at each split, and average their predictions. The individual trees are wild and unstable. The average is stable and powerful.
And what if, instead of averaging independent trees, you built each new tree to specifically fix the mistakes of the previous ones? That's Gradient Boosting (Friedman, 1999), and it's what powers XGBoost, LightGBM, and CatBoost — the models that dominate structured data in industry and competitions alike.
The instability of a single decision tree isn't a defect. It's the raw material that ensemble methods need. Without variance, there would be nothing to average out. Without diversity, there would be nothing to boost. The weakness of the single tree is what makes the forest strong.
What Trees Can't Do
Decision trees have real limitations, and understanding them matters more than memorizing strengths.
Trees cannot extrapolate. This is fundamental and often overlooked. If your training data has house prices ranging from $100k to $500k, a regression tree will never predict $600k. Every leaf prediction is the mean of the training samples that ended up in that leaf, and no mean of numbers between 100k and 500k will ever exceed 500k. A linear model, by contrast, can extend its line beyond the training range. For time series data or any domain where future values can exceed historical ranges, a standalone tree will fail. This limitation carries over to Random Forests too — averaging a bunch of trees that can't extrapolate produces an ensemble that can't extrapolate.
Trees draw staircase boundaries. Every split is perpendicular to a single feature axis — "Is color > 5?" draws a vertical line in feature space. A tree can never draw a diagonal boundary directly. If the true separation between apples and oranges runs at 45 degrees through the space of color and diameter, the tree approximates it with a staircase: a series of horizontal and vertical cuts that zigzag along the true boundary. More depth gives a finer staircase, but also more overfitting. This is the axis-aligned split limitation. Oblique decision trees, which split on linear combinations of features (like "0.3 × color + 0.7 × diameter > 4"), can draw diagonal boundaries with fewer nodes, but they're harder to train, harder to interpret, and rarely used in practice — ensembles of axis-aligned trees tend to work better.
Categorical features require encoding in scikit-learn. The DecisionTreeClassifier in sklearn can't handle string categories natively. You have to one-hot encode or ordinal encode them first. One-hot encoding a feature with 50 categories creates 50 binary columns, which fragments the tree's splitting decisions and dilutes importance. Ordinal encoding imposes an arbitrary ordering that the tree will treat as meaningful. LightGBM and CatBoost handle categorical features natively by finding optimal category groupings at each split. This is one of their genuine advantages.
Missing values need workarounds. Sklearn trees can't handle NaN values — you have to impute before training. Breiman's original CART algorithm had a clever solution called surrogate splits: during training, for each split, CART identifies other features whose splits mimic the primary split most closely. When a sample has a missing value for the primary split feature, the tree uses the best available surrogate to route it. XGBoost and LightGBM take a different approach — they learn a default direction for missing values at each split, trying both left and right during training and keeping whichever reduces loss more. This is elegant and often better than any manual imputation.
No feature scaling needed. This is a genuine practical advantage. Trees care about the ordering of feature values, not their magnitude. Whether income is measured in dollars or millions of dollars, the same optimal threshold gets found. No StandardScaler, no MinMaxScaler, no worrying about different units. Feed raw features in.
Regression Trees: Same Idea, Different Score
Everything we've discussed has used classification as the running example, but CART handles regression too. The logic is identical — recursive greedy splitting — but the scoring function changes. Instead of Gini impurity or entropy, a regression tree minimizes mean squared error (MSE) at each split. It picks the split that maximizes the reduction in variance of the target variable: the child nodes should have less spread in their target values than the parent.
Each leaf predicts the mean of the training samples that ended up there. This is why regression trees can't extrapolate — the prediction is always a mean of training values, and no mean will ever exceed the maximum or fall below the minimum of the values being averaged.
Regression trees are the building blocks for gradient boosted regression, which is arguably the most powerful technique for tabular regression problems. The individual regression trees in a gradient boosting ensemble are typically very shallow — depth 3 to 6 — and each one fits the residual errors of the ensemble so far. But we'll save that story for the ensemble methods section.
The History We Inherited
It's worth knowing where the algorithms you use came from, because the design choices echo into every tree you train today.
In 1984, Leo Breiman and three co-authors published the CART book. CART introduced binary splits, Gini impurity, cost-complexity pruning, and surrogate splits for missing values. It handled both classification and regression. It was the first comprehensive, principled tree-building algorithm.
Two years later, Ross Quinlan published ID3 — a simpler algorithm that used information gain (entropy) and could handle multi-way splits for categorical features. ID3 was easier to implement but had weaknesses: it couldn't handle continuous features, couldn't handle missing values, and its splitting criterion was biased toward features with many categories.
Quinlan fixed all of these in C4.5 (1993), which introduced gain ratio to debias the splitting criterion, added support for continuous features by finding thresholds, handled missing values through probability-based routing, and included post-pruning. C4.5 became one of the most cited algorithms in machine learning history.
Today, scikit-learn's implementation follows the CART approach (binary splits, Gini or entropy, cost-complexity pruning). The gradient boosting libraries — XGBoost, LightGBM, CatBoost — all build on CART's foundation with their own innovations. The tree you train in 2024 is a direct descendant of Breiman's 1984 book.
Wrap-Up
If you're still with me, thank you. I hope it was worth it.
We started with a bin of six fruits and the game of Twenty Questions. We learned how Gini impurity and entropy measure the messiness of a node, how CART greedily searches through every feature and every threshold to find the split that cleans things up the most, and why that greedy search — despite producing only an approximation — works remarkably well. We pruned trees like a gardener to prevent memorization. We saw how feature importance can lie to you if you're not careful, and how permutation importance tells the truth. And we confronted the variance problem — the instability that makes a single tree unreliable but makes a forest of them powerful.
My hope is that the next time you see a DecisionTreeClassifier in a codebase, or get asked "explain how XGBoost works" in an interview, instead of that old vague sense of "it's a flowchart, I think," you'll be able to trace the logic all the way down — from Gini impurity to greedy splits to pruning to the variance that makes ensembles possible — having a solid mental model of what's happening under the hood.
Resources
Breiman, Friedman, Olshen, Stone — Classification and Regression Trees (1984). The O.G. book. Dense but foundational. If you want to understand CART's original intent, including surrogate splits and cost-complexity pruning as Breiman envisioned them, this is the source.
Quinlan — C4.5: Programs for Machine Learning (1993). The book that made decision trees accessible to a generation of researchers. Introduces gain ratio and explains why information gain alone falls short.
Scikit-learn's Decision Trees User Guide. Surprisingly deep for library documentation. The section on cost-complexity pruning is particularly well-written.
Hastie, Tibshirani, Friedman — The Elements of Statistical Learning, Chapter 9. Covers trees, bagging, random forests, and boosting in a unified framework. Free PDF available from the authors. Wildly useful as a reference once you have the intuition.
Christoph Molnar — Interpretable Machine Learning. The chapters on feature importance, SHAP, and permutation importance are insightful and practical. Pairs well with tree-based models.