Simple Classifiers

Chapter 5: Supervised Learning k-NN · Naive Bayes

I put off really understanding k-Nearest Neighbors and Naive Bayes for an embarrassingly long time. Both sounded simple enough that I kept skipping past them to reach the "real" algorithms. k-NN felt like cheating — you look at your neighbors and copy their answers? Naive Bayes felt like a dusty stats exercise. It took me longer than I should admit to realize that these two algorithms teach you more about why machine learning works than most of the fancier ones do. So let me sit down with you and share what I found when I finally went down the rabbit hole.

Here's what they are. k-Nearest Neighbors (KNN) is a classifier that makes predictions by finding the training examples most similar to a new data point and letting them vote on the answer. Naive Bayes is a classifier that uses probability theory to combine evidence from multiple features independently. Both are old, both are fast, both have crisp failure modes, and both will save you hours of work as baseline models on real problems. They also illuminate ideas — distance, probability, independence, dimensionality — that come up everywhere else in machine learning.

A few things to know before we dive in. We're going to build both algorithms from scratch using a single running example: sorting fruit at a farmers' market. Three types of fruit — apples, oranges, lemons — each described by two measurements: weight in grams and a color-redness score from 0 to 10. We'll use those exact numbers all the way through the page. The math is real, and we'll work through every calculation. By the time we reach the code, nothing should feel mysterious.

Ready? Let's go.

What we'll cover
  • Ask your neighbors — the KNN algorithm built from scratch
    • What "close" means — Euclidean and Manhattan distances on real fruit data
    • The Voronoi map — k=1 decision boundaries as territorial ownership
    • Turning the k dial — bias-variance made visible
    • Feature scaling — why unscaled features silently break everything
    • The curse of empty rooms — dimensionality and its consequences
    • Making it fast — KD-trees, ball trees, FAISS, HNSW
  • ⚑ Rest stop — you can stop here if KNN is all you need
  • Counting evidence instead — the shift to Naive Bayes
    • Bayes' theorem from scratch — built on the fruit example
    • The naive trick — why independence works for ranking
    • Three flavors — Gaussian, Multinomial, Bernoulli, Complement
    • The zero-probability catastrophe — and Laplace smoothing
    • Surviving the numbers — the log-space trick
    • Text classification: the killer app — NB + TF-IDF
  • When simple wins
  • Wrap-up

Ask Your Neighbors

Imagine you're running a booth at a farmers' market. You've been sorting fruit by hand for hours — apples into one bin, oranges into another, lemons into a third — and you've gotten good at it. Now your friend hands you something unfamiliar and asks what it is. What do you do? You hold it next to the fruit you've already sorted. You look for the ones that seem most similar. You check what those are. If the three most similar pieces of fruit in your sorted pile are all oranges, you say "orange." That is k-Nearest Neighbors.

The whole algorithm is that neighborhood poll. To classify a new point, find the k training examples most similar to it, tally up their labels, and return the majority class. For regression — predicting a continuous number instead of a category — you average the k neighbors' values instead of voting. There is no training phase in the traditional sense. No weights to learn, no optimization to run. The model memorizes the entire training set and does all its work at prediction time. Computer scientists call this a lazy learner, which is one of my favorite terms in all of machine learning. It earns the name.

Let's get some actual numbers on the table. Here are three fruits already sorted:

Fruit Weight (g) Redness (0–10) Label
Fruit A1828apple
Fruit B1306orange
Fruit C1152lemon

A mystery fruit arrives: weight 150 g, redness score 7. We want to know what it is. With k=1, the answer is whichever of the three training fruits is closest. To find that, we need a way to measure "close."

What "Close" Means

The most natural measure of closeness is the straight-line distance you remember from geometry — Euclidean distance. Between two points (a₁, a₂) and (b₁, b₂) in two dimensions, it's the length of the hypotenuse: sqrt((a₁ − b₁)² + (a₂ − b₂)²). Let's compute the distance from the mystery fruit to each of the three known fruits.

Distance to Fruit A (apple): sqrt((182 − 150)² + (8 − 7)²) = sqrt(32² + 1²) = sqrt(1025) ≈ 32.0

Distance to Fruit B (orange): sqrt((130 − 150)² + (6 − 7)²) = sqrt(20² + 1²) = sqrt(401) ≈ 20.0

Distance to Fruit C (lemon): sqrt((115 − 150)² + (2 − 7)²) = sqrt(35² + 5²) = sqrt(1250) ≈ 35.4

Fruit B is closest. With k=1, we predict orange. The mystery fruit is most geometrically similar to the orange in weight-redness space.

There's another common option worth knowing: Manhattan distance (also called L1 distance), named for navigating a city grid where you can't cut through buildings. You add up the absolute difference along each axis rather than taking the hypotenuse: |a₁ − b₁| + |a₂ − b₂|. For our mystery fruit, the Manhattan distances work out to 33 (apple), 21 (orange), and 40 (lemon) — same ranking as Euclidean. Both are special cases of the Minkowski distance family: Euclidean uses exponent p=2, Manhattan uses p=1. Sklearn exposes this as the p parameter. For most problems Euclidean is the right default; Manhattan can be more robust when individual features have occasional extreme outliers, because squaring amplifies those outliers while Manhattan treats them linearly.

The Voronoi Map

With k=1 — one nearest neighbor — something geometrically beautiful happens. Every training point gets to "own" a region of the feature space: every point in that region is closer to this training example than to any other. Drawn on a two-dimensional plot, these territories form a tessellation called a Voronoi diagram. The decision boundary between two class regions is the set of all points equidistant from their training examples.

The first time I saw a Voronoi diagram of a trained k-NN model, something clicked that years of staring at equations hadn't achieved. You can see exactly what the model will predict for every possible input by looking at the picture. Drop a new mystery fruit into the map — it lands in exactly one territory, and that territory's label is the prediction. The whole classifier is visible at once.

The catch is that with k=1, those boundaries are jagged. They snake around every single training point individually. If one apple was accidentally mislabeled "orange" during data collection, its Voronoi territory carves a small orange island inside the apple region. Every future point that falls in that island gets misclassified. The model has memorized the noise — a sign it has too much variance, too much sensitivity to the specific quirks of the training set it happened to see.

Turning the k Dial

Increasing k is the remedy. As you include more neighbors in the poll, decision boundaries smooth out. Each prediction is averaged over a larger neighborhood, so a single mislabeled point carries less influence. The boundaries stop snaking around individual training examples and start tracing broader, more reasonable territories.

But you can take this too far. When k equals the total number of training examples, every single point is polled for every single prediction — and the prediction is always the majority class in the entire dataset. That model has no memory of local structure whatsoever. Maximum bias, zero variance, completely useless for anything but predicting the most common class.

The neighborhood poll analogy earns its keep here. With k=1, you knock on one door and take their word for it. With k=100, you survey a large neighborhood. The single-person poll is highly variable — your one chosen neighbor might be an outlier. The hundred-person survey is stable but might miss the fact that you're standing right on the boundary between two very different neighborhoods. Both extremes fail. You want enough neighbors to smooth out noise without smoothing away signal.

A useful refinement on the basic majority vote is weighted KNN: give closer neighbors more influence, weighted by the inverse of their distance. A neighbor at distance 2 gets weight 0.5. A neighbor at distance 0.1 gets weight 10. The class with the highest total weight wins rather than the class with the most votes. This softens the arbitrary cutoff of majority voting and often improves accuracy on real data. In sklearn, set weights='distance'. For choosing k itself: try 3, 5, 7, 11, and 15, choose via cross-validation, and use odd values for binary classification to avoid ties.

Feature Scaling

Here's a silent bug that would completely ruin our fruit classifier if we hadn't built it from scratch and watched what was happening. Weight ranges from about 100 to 250 grams. Redness ranges from 0 to 10. In the Euclidean distance formula, a one-unit difference in weight and a one-unit difference in redness score are treated identically — but they represent wildly different things. One gram is almost nothing; one point on a ten-point scale is a tenth of the entire range.

Look at the actual numbers above. The weight term contributed (32)² = 1024 to the squared distance between the mystery fruit and Fruit A. The redness term contributed (1)² = 1. Weight dominated by a factor of a thousand. The redness feature was functionally invisible. We didn't build a "nearest fruit" classifier — we built a "nearest weight" classifier that happens to accept redness as input and then ignore it.

The fix is standardization: transform each feature to have mean 0 and standard deviation 1 across the training set. After scaling, a one-unit shift in any feature represents one standard deviation. All features contribute fairly to the distance calculation. The formula is (value − mean) / std, applied per feature. This is not optional for any distance-based method. KNN, SVMs, PCA — they all require scaled features. One critical implementation detail: fit the scaler on the training data only, then apply it to test data. If you compute statistics across both sets, test information leaks into your preprocessing. Sklearn's Pipeline handles this correctly when you use cross-validation — it fits the scaler inside each fold.

The Curse of Empty Rooms

The neighborhood poll works because nearby neighbors are similar. That assumption starts to break down as the number of features grows, and the breakdown is dramatic enough to have a name: the curse of dimensionality.

Picture searching for your nearest neighbor in a studio apartment. Everyone is close. Now move to a ten-room mansion and scatter the same number of people randomly throughout. Still manageable. Now imagine a mansion with a thousand rooms, all the same size, and people are scattered randomly. The rooms are mostly empty. To find your nearest neighbor, you'd have to search most of the house. Here's the part that haunts me: your "nearest" neighbor might be almost as far as the most distant one. The very concept of "near" has become nearly meaningless.

The math makes this concrete. Suppose you want a neighborhood large enough to contain 10% of the training data. In one dimension, you cover 10% of the range. In two dimensions, you need each axis to cover 0.10^(1/2) ≈ 32% of its range. In ten dimensions, each axis must cover 0.10^(1/10) ≈ 79% of its range. By ten features, your "local" neighborhood already spans most of the feature space. The locality assumption that makes KNN meaningful has largely dissolved. In practice, KNN starts struggling around 20 features and is usually hopeless past 50 unless you apply dimensionality reduction — PCA, UMAP, feature selection — beforehand.

Making It Fast

Every KNN prediction scans every training point to find the k nearest ones. That's O(n × d) per prediction — n training examples, each with d features. For a hundred training examples, fine. For a million, every single query requires a million distance calculations.

Three approaches exist for speeding this up. Brute force is the baseline: compute all distances, sort, return the top k. Works fine for small datasets or when d is large enough that spatial indexing doesn't help. KD-trees partition feature space recursively with axis-aligned splits — think binary search for neighbors. They reduce average query time to O(d log n), but degrade toward brute force in high dimensions because the bounding boxes stop pruning enough branches above roughly 20 features. Ball trees use hyperspheres instead of hyperrectangles, handling moderate dimensionality somewhat better and supporting non-Euclidean distance metrics. For very large-scale production systems, approximate nearest neighbor libraries like FAISS (from Meta) and HNSW trade a small, tunable fraction of accuracy for enormous speedups — these are what powers billion-scale similarity search in recommendation engines. Sklearn handles the brute/KD/ball choice automatically via algorithm='auto'; for most learning-phase datasets, the default is fine.

⚑ Rest Stop — You Can Stop Here

Pause for a moment. You now understand k-Nearest Neighbors completely. You know the core algorithm — poll the k nearest neighbors, take the majority vote. You know the distance metrics — Euclidean and Manhattan, both computed by hand on real fruit data. You understand the decision boundaries — Voronoi territories, jagged at k=1, smoothed as k grows. You've seen the bias-variance tradeoff made visible. You know that feature scaling is mandatory and exactly why. You've met the curse of dimensionality and understood why KNN fails above ~20 features. You know the speed tricks — KD-trees for moderate d, ball trees for non-Euclidean metrics, FAISS and HNSW for billion-scale production.

The code below shows the sklearn implementation. If you came for KNN, that's all you need.

If you keep reading, the second half covers Naive Bayes — a completely different algorithm that solves a problem KNN can't touch: tens of thousands of text features, sparse data, and the need to train in a single pass. It's worth the trip.

Code: KNN with Scaling

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Three fruits: [weight_g, redness_0_to_10]
X_train = np.array([[182, 8], [130, 6], [115, 2]])
y_train = ['apple', 'orange', 'lemon']

pipe = Pipeline([
    ('scaler', StandardScaler()),               # mandatory for KNN
    ('knn', KNeighborsClassifier(
        n_neighbors=1,
        weights='distance'))                    # closer neighbors count more
])

pipe.fit(X_train, y_train)

mystery = np.array([[150, 7]])
print(pipe.predict(mystery))                    # ['orange']

The Pipeline ensures the scaler is fitted on training data only — it cannot peek at whatever you later pass to predict. On a larger dataset, swap n_neighbors and re-run to watch the bias-variance tradeoff play out in the accuracy numbers. Try weights='distance' against weights='uniform' on noisy data to see when distance-weighting helps.

Counting Evidence Instead

Let's be direct about why KNN fails on text. A single email might be described by 50,000 features — one per word in the vocabulary, with the feature value representing how many times that word appears. Now try to find the "nearest neighbor" in 50,000-dimensional space. The curse of dimensionality devours it. Every document is roughly equally distant from every other document. The neighborhood poll returns noise.

We need a completely different idea. Instead of asking "which training examples look most like this one?", we ask "given what we observe in this example, which class is most probable?" That shift — from similarity-based to probability-based reasoning — leads us to Naive Bayes.

Bayes' Theorem from Scratch

Bayes' theorem describes how to update a belief given evidence. The version that matters for classification is:

P(class | features) ∝ P(features | class) × P(class)

Read it aloud: the probability that an example belongs to a class, given its observed features, is proportional to how likely those features would be if the example really were that class, multiplied by how common that class is in the first place. The symbol ∝ means "proportional to." We drop the denominator P(features) because it's identical for every class — it can't change which class wins, so we don't need it.

Let's walk through this with our fruit example. Say we've collected many fruits at the market. In the training data, 40% are apples, 35% are oranges, 25% are lemons. Those are our priors: P(apple) = 0.40, P(orange) = 0.35, P(lemon) = 0.25. Before seeing any measurements, we'd lean toward apple.

Now our mystery fruit arrives: weight 150 g, redness 7. We need P(features | class) — how likely are these measurements for each class? For continuous features, we model this with a Gaussian (bell curve). For each class, we track the mean and standard deviation of weight and redness across all training examples of that class. Suppose among training apples, weight averages 180 g with standard deviation 15 g, and redness averages 8.0 with standard deviation 1.0. A weight of 150 g sits two standard deviations below the apple mean — the Gaussian probability density there is low. A redness of 7 sits one standard deviation below — moderate probability. We multiply those two per-feature likelihoods together to get P(features | apple), then multiply by the prior 0.40. We repeat for orange and lemon. The class with the largest product wins. That's Gaussian Naive Bayes.

I'll be direct: the per-feature probability calculations are tedious to show in full for continuous features. The concept is the point — we're multiplying per-feature likelihoods by a class prior, then picking the winner. The code handles the arithmetic.

The Naive Trick

That step where we multiplied the per-feature likelihoods — P(weight | apple) × P(redness | apple) — is the independence assumption. We treated weight and redness as if knowing the weight tells us nothing about the redness, given that we know it's an apple. This is the "naive" part. It's almost certainly wrong. Heavier apples might tend to be redder. The features are correlated. And yet the classifier often works remarkably well.

The reason is subtle: for classification, we don't need probability values to be correct. We need the ranking to be correct. As long as the true class ends up with a higher score than the other classes, we win — even if that score is a wildly overconfident estimate. And when each feature independently points in the right direction, multiplying those independent leans together still points in the right direction, even if the magnitude is off.

I'm still developing my intuition for exactly why multiplying a bunch of slightly wrong probabilities together so often preserves the correct ranking. The best mental model I have: think of each feature as an independent witness in a courtroom. Witness A says "the weight suggests apple." Witness B says "the color suggests apple." We multiply their testimonies. The witnesses aren't truly independent — they've talked in the hallway, their stories share a common source. But if both witnesses lean toward apple, the combined testimony still points toward apple. The verdict is right even if the confidence stated in the verdict is inflated.

This is also why Naive Bayes gives famously miscalibrated probability outputs. It often returns 0.9999 for the winning class because it has multiplied many moderate likelihoods together, each slightly favoring that class. Don't treat these numbers as genuine confidence estimates. Use them for ranking and for picking the top class. My favorite thing about this algorithm is that beyond high-level explanations like this one, no one is completely certain why it works as well as it does on real-world data. That mystery genuinely delights me.

Three Flavors

GaussianNB, MultinomialNB, and BernoulliNB differ in one thing: the probability model used for each feature. Match the model to your data type and results are good. Mismatch them and you get something that runs but gives confused answers.

Variant Assumes features are… Best for
GaussianNB Continuous, approximately normal Real-valued tabular data — temperatures, heights, scores
MultinomialNB Non-negative integer counts Text (word counts), document classification
BernoulliNB Binary presence/absence Short text, binary feature vectors
ComplementNB Non-negative counts (same as Multinomial) Imbalanced text datasets

GaussianNB estimates a mean and variance per feature per class, then evaluates the Gaussian probability density at the observed value. No hyperparameters to tune, and often good enough even when features aren't perfectly Gaussian. Our fruit example above used this variant.

MultinomialNB models each class as a distribution over feature counts. This is the classic text classifier: each feature is a word, each value is how often it appears. The model learns which words appear frequently in which classes — "free," "offer," and "congratulations" cluster in spam; "meeting," "attached," and "quarterly" cluster in work email. Multiply enough of those word-frequency likelihoods together and the correct class wins. ComplementNB is a variant that learns what's uncharacteristic of each class rather than what's characteristic, which often outperforms standard MultinomialNB on imbalanced datasets where one class dominates the training data.

BernoulliNB cares only about whether a feature is present or absent, discarding frequency information entirely. For text, this means "did this word appear at all?" rather than "how many times?" There's a key difference from MultinomialNB: Bernoulli explicitly penalizes the absence of features that are common in a class. If "money" appears in 80% of spam emails, a non-spam email that lacks "money" gets a likelihood boost under BernoulliNB. MultinomialNB ignores absent words entirely. For short documents where word presence is more meaningful than word frequency, BernoulliNB often wins.

The Zero-Probability Catastrophe

Here's a failure mode that silently destroys a Naive Bayes model if you don't know about it. Suppose your spam filter's training data contains no legitimate (ham) emails with the word "blockchain." So P("blockchain" | ham) = 0. Now a completely legitimate email about cryptocurrency investment arrives and contains the word "blockchain." The product for the ham class becomes zero — zero times anything is zero — regardless of every other word in the email. A single word the model never saw in training poisons the entire classification.

The fix is Laplace smoothing (also called add-one smoothing), controlled by the parameter alpha. Instead of using raw counts, you add alpha to every word's count before computing probabilities:

P(word | class) = (count(word, class) + α) / (total_words_in_class + α × |vocabulary|)

With α = 1, every word in the vocabulary gets a count of at least 1 in every class, even words that never appeared in training. The denominator adjusts to keep everything normalized. Larger α values smooth more aggressively, pushing all word probabilities toward uniform and trusting the training data less. Smaller α values stay closer to the raw counts. The default α = 1 works well in almost every case; occasionally tuning it slightly via cross-validation helps on unusual datasets.

Surviving the Numbers

There's one more practical pitfall. When you multiply 50,000 probabilities together — one per word in the vocabulary — each might be something like 0.0002. Multiplying fifty thousand such numbers produces a result so close to zero that standard floating-point arithmetic rounds it to exactly 0.0. This is called numerical underflow, and it makes the classifier completely nonfunctional.

I'll be honest — the first time someone showed me the fix, I felt like I'd been doing math wrong my whole life. The solution is to work in log space. Instead of multiplying probabilities, take the logarithm of each and sum them:

log P(class | features) ∝ log P(class) + Σᵢ log P(xᵢ | class)

Since the logarithm is monotonically increasing, the class with the highest sum of log-probabilities is exactly the same class that would have had the highest product of probabilities. The ranking is preserved. But the numbers stay representable — log(0.0002) ≈ −8.5. Summing fifty thousand numbers around −8.5 gives you something around −425,000, a perfectly ordinary floating-point number. Sklearn handles this internally, so you don't need to write it yourself. But understanding it explains why predict_proba() returns sensible numbers even when the raw products would have underflowed to zero.

Text Classification: The Killer App

Combine MultinomialNB with TF-IDF vectorization and you have one of the most cost-effective classifiers ever built. TF-IDF — term frequency–inverse document frequency — transforms a document into a vector where each dimension is a word, weighted by how often it appears in this document relative to how often it appears across all documents. Common words like "the" and "is" get low weights. Distinctive, domain-specific words get high weights.

The courtroom analogy from the previous section comes back around here in a satisfying way. Each word is a witness. TF-IDF is the process of deciding how much credibility each witness deserves: a witness who shows up at every trial (common word) gets less weight than one who's distinctive to this specific case (rare word that strongly signals a class). Multiply the adjusted testimonies. Pick a verdict.

This combination is the classic spam filter. It's a strong baseline for sentiment analysis, topic labeling, language detection, and document routing. On small text datasets — hundreds to a few thousand documents — it frequently matches or outperforms neural networks and gradient-boosted trees. That's not because it's smarter; it's because with limited data, the simpler model generalizes better. The practical rule: always start a text classification project with TF-IDF plus MultinomialNB. If it gives you 88% accuracy in five minutes, you now have a baseline that tells you exactly how much improvement a more complex model needs to justify its training cost and deployment overhead.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score

# Two categories to keep the example focused
cats = ['rec.sport.hockey', 'sci.space']
train = fetch_20newsgroups(subset='train', categories=cats)
test  = fetch_20newsgroups(subset='test',  categories=cats)

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_df=0.5)),
    ('nb',    MultinomialNB(alpha=1.0))     # alpha=1.0 is Laplace smoothing
])

pipe.fit(train.data, train.target)
preds = pipe.predict(test.data)
print(f"Accuracy: {accuracy_score(test.target, preds):.3f}")
# Accuracy: ~0.970 — two categories, under one second of training

Roughly 1,200 training documents, tens of thousands of features, classified in under a second. The model itself is a few kilobytes. Try matching that training time with a fine-tuned transformer.

When Simple Wins

Both KNN and Naive Bayes beat more complex models in specific, predictable conditions. Knowing when is part of being a good practitioner.

KNN wins when your dataset is small and your feature space is low-dimensional — say, under a few thousand examples and under twenty features. It wins when you need locally interpretable predictions: "this loan application is similar to these five applications that all defaulted" is a compelling story that a gradient-boosted tree can't easily tell. It wins as a quick baseline that requires almost no distributional assumptions, and it wins when decision boundaries are genuinely irregular in ways that no parametric model would assume.

Naive Bayes wins on text — reliably, repeatedly, cheaply. It wins on any high-dimensional sparse data. It wins on tiny datasets where a complex model would overfit badly, because estimating a mean and variance per feature per class requires very few data points per class. It wins when training time is constrained — a single pass through the data is all it needs. It wins on streaming data where you need to update the model one example at a time; Naive Bayes updates are trivially incremental. And it wins when you need a production model that's a few kilobytes rather than a few gigabytes.

Both algorithms win as baselines. Before you train any complex model on a new problem, train KNN and Naive Bayes first. They set the floor — any model that doesn't substantially beat them is not worth the added complexity, training time, or deployment cost. I've seen teams skip this step and spend weeks tuning a neural network to achieve accuracy that a properly set-up Naive Bayes matched in twenty seconds. Don't be those teams.

Wrap-Up

Thank you for making the trip with me.

We started at a farmers' market with three fruits and two measurements, and we used those numbers all the way through. KNN grew from a simple neighborhood poll into a full picture: distance metrics computed by hand, Voronoi territories drawn in the mind's eye, the k dial mapped onto the bias-variance tradeoff, the silent importance of feature scaling revealed by actual calculation, the haunting geometry of high-dimensional empty rooms, and the engineering tricks that make it fast enough for production. Then we pivoted entirely — from "find similar examples" to "accumulate probabilistic evidence" — and built Naive Bayes from Bayes' theorem up: made it tractable with the independence assumption, matched it to three data types, patched the zero-probability catastrophe with Laplace smoothing, solved the underflow problem with a move to log space, and ended at the TF-IDF plus MultinomialNB combination that remains one of the most elegant and cost-effective tools in the whole field.

What I hope you're taking away isn't a list of API calls. I hope it's the sense that these algorithms make sense — that there was a problem, someone had a reasonable idea, the idea had a flaw, someone patched the flaw, and the whole thing cohered into something useful. That chain of reasoning is what lets you know when to reach for these tools, when to set them down, and when to look for something else entirely.

The next section covers Decision Trees — a different approach altogether, one that builds explicit if-then rules for classification rather than polling neighbors or multiplying probabilities. The intuition you've built here about bias, variance, and what it means to smooth a decision boundary will transfer directly. See you there.