Chapter 5: Supervised Learning
Supervised learning is the workhorse of applied ML — you have inputs, you have labels, and you want to learn the mapping. This chapter builds from the simplest linear models through decision trees and ensembles to SVMs and time series, giving you a practical toolkit for both regression and classification. The progression: start simple, add complexity only when the data demands it.
Which Algorithm When — Master Comparison
This is the single most useful reference in the chapter. Bookmark it. When you're starting a new problem, scan this table to pick your first model.
| Algorithm | Best For | Non-linearity? | Scaling Needed? | Interpretable? | Speed (Train / Predict) | Try First When… |
|---|---|---|---|---|---|---|
| Linear Regression | Regression with roughly linear relationships | No | Yes (for regularized) | ★★★ High | Fast / Fast | You need a regression baseline or interpretability is paramount |
| Logistic Regression | Binary/multiclass classification, probability estimates | No | Yes | ★★★ High | Fast / Fast | You need a classification baseline or want to understand feature effects |
| KNN | Small datasets, prototyping, non-parametric baselines | Yes (implicitly) | Yes (critical) | ★★☆ Medium | None / Slow | Dataset is small, you want a quick sanity check, or local patterns dominate |
| Naive Bayes | Text classification, high-dimensional sparse data | No | No | ★★☆ Medium | Very Fast / Very Fast | You have text data, need a fast baseline, or have very little training data |
| Decision Tree | Exploratory analysis, understanding feature interactions | Yes | No | ★★★ High | Fast / Fast | You need a fully interpretable model or want to visualize the decision logic |
| Random Forest | General-purpose tabular data, robust default | Yes | No | ★★☆ Medium | Medium / Medium | You want a strong model with minimal tuning — the "hard to mess up" choice |
| XGBoost / LightGBM | Tabular data competitions, maximum predictive performance | Yes | No | ★☆☆ Low | Medium / Fast | You need top accuracy on structured data and can tune hyperparameters |
| SVM | Small-to-medium datasets, clear margins, text classification | Yes (with kernels) | Yes (critical) | ★☆☆ Low | Slow / Medium | Dataset is small with clean separation, or you're doing text/image classification with kernel tricks |
The Practical Decision Flowchart
When you're staring at a new supervised learning problem, here's the order of operations:
- Start with logistic regression (classification) or linear regression (regression) as your baseline. Seriously. It's fast, interpretable, and tells you how far a simple model gets. If it hits 90% of your target metric, you might not need anything fancier.
- If the baseline isn't enough, try Random Forest. It handles non-linearity, requires no feature scaling, is robust to outliers, and has good defaults out of the box. It's the "hard to mess up" model.
- If you need more performance on tabular data, graduate to LightGBM or XGBoost with early stopping. This is where Kaggle winners live. Expect 1–5% improvement over Random Forest with proper tuning, but budget time for hyperparameter search.
- SVMs for small datasets with clear margins or text classification. When you have fewer than ~10k samples and the data has nice geometric structure, SVMs with RBF or polynomial kernels can outperform tree methods.
- KNN and Naive Bayes for specific niches. KNN for quick prototyping and sanity checks. Naive Bayes when you have text data and need something fast with minimal training data. Neither is likely your final production model, but both are valuable diagnostic tools.
Algorithm choice typically accounts for the last 5–10% of performance. Feature engineering, data quality, and proper validation strategy matter far more. A well-engineered logistic regression frequently beats a poorly-tuned XGBoost.