Feature Selection & Data Splitting
I avoided thinking carefully about feature selection and data splitting for longer than I should have. For a while I treated them as the boring plumbing between the exciting parts — get your features, throw them at a model, split the data 80/20, move on. Then I shipped a fraud detection model that looked brilliant in validation and was no better than a coin flip in production. The culprit wasn't the algorithm. It was leakage from a careless split, and features that were smuggling the answer into the input. That experience rewired how I think about these two topics. Here is that dive.
Feature selection is the practice of choosing which input variables your model gets to see. Data splitting is how you partition your dataset into the pieces used for training, tuning, and final evaluation. Together, they determine whether your model's reported performance means anything at all. Get either one wrong, and you're building on sand.
Before we start, a heads-up. We'll touch on correlation, mutual information, regularization geometry, and cross-validation strategies, but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.
This isn't a short journey, but I hope you'll be glad you came.
The kitchen pantry problem
Variance threshold — throwing out the dead weight
Correlation — catching the copycats
Mutual information — the nonlinear detective
Wrapper methods — search with feedback
Embedded methods — selection baked into training
The L1 geometry that makes Lasso special
Rest stop
The sealed envelope — why splitting matters
Train, validation, test — three distinct jobs
Data leakage — contaminating the crime scene
Stratified splits — preserving the balance
Time-based splits — never peek at the future
Group-based splits — keeping entities together
Cross-validation — when one split isn't enough
Handling class imbalance
Wrap-up
The Kitchen Pantry Problem
Imagine you're cooking a dish and someone dumps the entire contents of a grocery store onto your counter. Every spice, every grain, every obscure sauce you've never heard of. Having more ingredients doesn't make you a better cook. It makes you slower, confused, and more likely to ruin dinner by tossing in something that doesn't belong.
Features work the same way. You've engineered 200 of them — transaction amount, time since last purchase, merchant category, IP geolocation, user age, and 195 others. Many are noise. Some are near-duplicates of each other. A few are genuinely predictive. The job of feature selection is to find which ones actually help and throw the rest away.
This matters more than most people realize. There's a phenomenon called the curse of dimensionality: as the number of features grows, the volume of the feature space explodes exponentially. Your data points, which felt dense and informative in five dimensions, become isolated specks floating in a vast emptiness when you have 500 dimensions. Distance metrics start to break down — the "nearest" and "farthest" neighbors end up almost equally far away. This is called the Hughes phenomenon, and it means that past a certain point, adding more features actively hurts your model instead of helping it.
I'll be honest — the first time I read that more features could make things worse, I didn't believe it. It felt like saying more information is bad. But the math is unforgiving. If you have 20 data points and 200 features, the model doesn't have enough examples to learn any generalizable pattern. It memorizes. It overfits. And then it fails on new data.
Our kitchen pantry analogy will follow us through this section. We start by throwing away ingredients that have gone stale, then we remove duplicates, and finally we taste-test the rest to find the ones that actually matter.
Variance Threshold — Throwing Out the Dead Weight
The first question to ask about any feature is the most basic one: does it actually vary? A feature that has the same value for 99.99% of your rows is carrying almost no information. It's the equivalent of a spice jar filled with air — takes up space on the counter but contributes nothing to the dish.
Consider a tiny dataset of six customers for our fraud detection system. One of the columns is "country" and every single row says "US." That feature has zero variance. It can't possibly help distinguish fraud from legitimate transactions because it's the same for everyone.
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
kept = selector.get_support().sum()
print(f"Kept {kept} of {X.shape[1]} features")
That threshold of 0.01 means we're removing any feature whose variance is below 1%. For binary features, a variance of 0.01 corresponds to a feature that's one value 99.5% of the time and the other value 0.5% of the time. Effectively dead.
This is fast — milliseconds, even on large datasets. It's the pantry equivalent of opening jars and throwing out anything that's empty or expired. You won't make great selections this way, but you'll clear the counter for the real work ahead.
The limitation is obvious: variance says nothing about whether a feature is useful for prediction. A feature with high variance could be pure random noise. We need something smarter.
Correlation — Catching the Copycats
Two features that move in lockstep are carrying the same information. If "transaction_amount_usd" and "transaction_amount_cents" are both in your dataset — one being 100× the other — keeping both is redundant. The model gains nothing from the second copy, and in some cases (like linear regression) multicollinearity causes real numerical problems.
Pearson correlation measures the linear relationship between two variables, on a scale from −1 to +1. A correlation of 0.95 between two features means they're nearly identical in terms of the linear signal they carry.
import numpy as np
import pandas as pd
corr_matrix = pd.DataFrame(X_filtered).corr().abs()
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
redundant = [col for col in upper_triangle.columns
if any(upper_triangle[col] > 0.90)]
We scan the upper triangle of the correlation matrix (to avoid comparing a feature with itself or double-counting pairs) and flag anything above 0.90. From each highly correlated pair, we drop one — typically the one with lower individual correlation to the target.
Back in our pantry: this is like noticing you have three bottles of olive oil — regular, extra virgin, and "premium" that tastes identical to extra virgin. Keep one, toss the duplicates.
But here's where Pearson fails us, and this took me a while to fully internalize. Pearson correlation only measures linear relationships. If feature X and the target Y are related in a curved, U-shaped, or any other nonlinear way, Pearson can return a value near zero — as if there's no relationship at all. For a feature like "distance from city center" predicting house prices, where prices might be high both in the city core and in wealthy suburbs but low in between, Pearson would shrug and say "no correlation." That's a lie by omission.
Mutual Information — The Nonlinear Detective
Mutual information doesn't care about the shape of the relationship. It measures something more fundamental: how much knowing the value of one variable reduces your uncertainty about the other. If knowing a customer's "average transaction amount" tells you a lot about whether they'll commit fraud — regardless of whether that relationship is linear, quadratic, or some bizarre staircase shape — mutual information will catch it.
The value ranges from 0 (completely independent — knowing one tells you nothing about the other) up toward infinity (knowing one perfectly determines the other). Unlike Pearson's tidy −1 to +1 range, MI values aren't directly comparable across datasets, but they're excellent for ranking features within a single dataset.
from sklearn.feature_selection import mutual_info_classif
mi_scores = mutual_info_classif(X, y, random_state=42)
ranking = np.argsort(mi_scores)[::-1]
for i in ranking[:10]:
print(f"Feature {i}: MI = {mi_scores[i]:.3f}")
I'm still developing my intuition for exactly how MI estimation works under the hood — it involves partitioning the feature space and estimating probability densities, which gets computationally hairy with many features. But as a ranking tool, it's remarkably reliable. Think of it as upgrading from a metal detector that only finds iron (Pearson) to one that finds any metal (MI).
These three techniques — variance threshold, correlation filtering, and mutual information ranking — are all filter methods. They're called that because they filter features using statistics alone, without ever training a model. They're fast, model-agnostic, and a solid first pass. But they evaluate each feature in isolation (or in pairs, for correlation), which means they can miss features that are weak alone but powerful in combination. For that, we need a different approach.
Wrapper Methods — Search with Feedback
What if, instead of evaluating features with statistics, we trained a model, checked how it performed, then asked: "What happens if I remove this feature?" That's the core idea behind wrapper methods. They wrap a model inside a search loop — try different subsets of features, measure performance, and converge on the best set.
Recursive Feature Elimination (RFE) is the most commonly used wrapper. It works by training a model on all features, ranking them by importance (using whatever the model provides — coefficients for linear models, impurity reduction for trees), removing the least important one, and repeating. It's like a cooking competition elimination round: every iteration, the weakest ingredient gets sent home.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
rfe = RFECV(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
step=1, cv=5, scoring="f1", min_features_to_select=5
)
rfe.fit(X_train, y_train)
print(f"Optimal feature count: {rfe.n_features_}")
RFECV adds cross-validation to the mix, so you get a reliable estimate of performance at each feature count. The optimal number is wherever the CV score peaks.
The cost is real. With 200 features and 5-fold CV, you're training roughly 1,000 models. If each takes a minute, that's 16 hours of compute. For a production pipeline that runs daily, that's a non-starter. For a one-time feature audit on a new dataset, it can be worth every minute.
There's also a subtle instability issue. RFE can give you different feature sets from slightly different data samples, especially when features are correlated. If features A and B carry similar information, one run might keep A and drop B, while another run does the opposite. Stability selection addresses this by running the selection process on many random subsamples and keeping only features that survive consistently — but that multiplies the compute cost further.
The pantry version: instead of eyeballing ingredients, you actually cook 1,000 versions of the dish, each time leaving out one ingredient, and see which removals make no difference. Thorough, but exhausting.
Embedded Methods — Selection Baked Into Training
What if the model itself could decide which features matter as part of learning? That's what embedded methods do. Feature selection isn't a separate step — it's woven into the training process.
Tree-based models like Random Forest and XGBoost naturally produce feature importance scores. Every time a tree splits on a feature, it records how much that split reduced prediction error. Features that appear in many splits, high up in the tree, get high importance. Features the trees rarely bother with get low importance.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_10 = np.argsort(importances)[-10:][::-1]
for idx in top_10:
print(f"Feature {idx}: importance = {importances[idx]:.4f}")
This is computationally free — you were going to train the model anyway, and the importance scores are a byproduct. The limitation is that these scores are specific to the model that produced them. Features that a Random Forest considers unimportant might matter a great deal to a neural network with different inductive biases.
The L1 Geometry That Makes Lasso Special
L1 regularization, also called Lasso, does something that L2 (Ridge) cannot: it drives feature coefficients to exactly zero. Not close to zero. Exactly zero. The feature is gone, as if it never existed. This makes Lasso a genuine feature selector.
The reason comes down to geometry, and I'll be honest — I find it hard to visualize beyond two dimensions, but the 2D picture is illuminating enough. Imagine you're trying to find the best coefficients for a linear model. Without regularization, you'd land at whatever point minimizes your loss — call it the unconstrained optimum. Now add a constraint: the sum of the absolute values of your coefficients must be below some budget.
In 2D, L1's constraint region is a diamond — a square rotated 45 degrees. L2's constraint region is a circle. The loss function creates elliptical contours radiating out from the unconstrained optimum. As you expand these contours, they'll eventually touch the constraint region. With L2's circle, the contact point can be anywhere along the smooth surface — almost never on an axis, meaning both coefficients stay nonzero. With L1's diamond, the contours almost always hit a corner first, and corners sit on the axes where one coefficient is exactly zero.
That's the whole trick. The sharp corners of the L1 diamond act like a feature guillotine.
from sklearn.linear_model import LassoCV
import numpy as np
lasso = LassoCV(cv=5, random_state=42).fit(X_train, y_train)
surviving = np.where(lasso.coef_ != 0)[0]
eliminated = np.where(lasso.coef_ == 0)[0]
print(f"Lasso kept {len(surviving)} features, eliminated {len(eliminated)}")
The strength of the regularization — controlled by the lambda parameter, which LassoCV tunes via cross-validation — determines how tight the diamond is. Tighter diamond means fewer surviving features, higher bias, lower variance. It's a dial you can turn.
Congratulations on making it this far. You can stop here if you want.
You now have a working mental model for feature selection: filter methods for fast cleanup (variance → correlation → mutual information), wrapper methods for thorough search (RFE), and embedded methods that select during training (tree importance, Lasso). The pantry is organized. You know which ingredients to keep.
What we haven't addressed yet is the other half of the equation: once you've selected your features, how do you partition your data so that the performance numbers you compute actually mean something? A beautifully selected feature set means nothing if your evaluation is corrupted by leakage.
If the discomfort of not knowing how splitting can silently destroy your results is nagging at you, read on.
The Sealed Envelope — Why Splitting Matters
Think of your test set as a sealed envelope containing the final exam. You can study all you want (training), you can quiz yourself with practice tests (validation), but the envelope stays sealed until you're ready to take the real exam. If you peek at the exam questions while studying, your score is meaningless — it reflects memorization, not understanding.
This analogy isn't perfect, but it captures the essential discipline: the test set exists to give you an honest, unbiased estimate of how your model will perform on data it has never seen. Every time you look at test results and then go back to adjust your model, you're tearing open the envelope a little more. Eventually it's not a test at all — it's a second practice exam. And your reported "test accuracy" is a lie you're telling yourself.
I spent the early part of my career treating data splitting as a formality. Split, train, evaluate, done. It took a spectacularly failed deployment — where a model's 97% validation accuracy became 52% in production — to understand that how you split is as important as what you build.
Train, Validation, Test — Three Distinct Jobs
Let's make this concrete with our fraud detection system. We have 100,000 transactions. We need to partition them into three sets, each with a different job.
The training set is what the model learns from. It sees these examples, adjusts its parameters, builds its internal representation of what fraud looks like. The validation set is your tuning workbench. You use it to compare hyperparameter configurations, try different feature sets, decide whether to add regularization. You'll look at validation scores dozens of times during development, and that's fine — that's its purpose. The test set is your final, one-time scorecard. You touch it once, report the number, and live with the result.
Common split ratios are 70/15/15 or 80/10/10. But these are conventions, not laws. With millions of rows, you might use 98/1/1 — even 1% of a million is 10,000 examples, more than enough for a reliable estimate. With a few hundred rows, a fixed validation set is too noisy to be useful, and you're better off using cross-validation instead.
from sklearn.model_selection import train_test_split
# Always carve out the test set first — then split the remainder
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp
)
Notice the two-step process. We extract the test set first, sealing the envelope. Then we split the remainder into train and validation. The 0.176 looks odd — it's because 17.6% of the remaining 85% gives us roughly 15% of the total. And we pass stratify=y to preserve the fraud/legitimate ratio in each set. More on that in a moment.
Data Leakage — Contaminating the Crime Scene
If splitting is the discipline of keeping your evaluation honest, leakage is the most common way that discipline breaks down. Data leakage happens when information that wouldn't be available at prediction time sneaks into the training process.
The analogy I keep coming back to is a crime scene. Detectives seal the scene with tape, wear gloves, log every piece of evidence. If an investigator walks through the scene in street shoes, tracks mud everywhere, and touches things with bare hands — the evidence is contaminated. The prosecution's case falls apart in court, no matter how guilty the suspect actually is. Leakage contaminates your model's evidence in the same way. Your metrics look convincing, but they won't hold up in production.
There are several common sources, and I'll walk through each because they're sneaky.
Preprocessing leakage is the most common one I see in practice. You compute the mean and standard deviation of your features across the entire dataset, then split. The scaler now carries information from the test set — the test data's statistics have leaked into the training process. The fix is rigid: split first, then fit all preprocessors (scalers, imputers, encoders) on the training set only, and use those fitted transformers on validation and test data. Scikit-learn's Pipeline enforces this automatically, which is why experienced practitioners never scale outside a pipeline.
Target leakage is subtler. A feature that's derived from the target — or only available after the target is determined — gives the model the answer directly. In a loan default prediction model, a feature like "total_payments_made" is the answer in disguise: if the customer defaulted, payments stopped. At prediction time (when the loan is first issued), this number doesn't exist yet. A hospital mortality model that uses "total blood transfused" has the same problem — the total is only finalized after the patient's outcome is determined.
Temporal leakage occurs when time-ordered data is shuffled randomly. A stock price prediction model that's seen tomorrow's data in its training set will look spectacularly accurate in validation and be worthless in practice. We'll address this with time-based splitting below.
Group leakage happens when the same entity appears in both training and test sets. If a patient has 20 medical records and 15 end up in training while 5 land in test, the model can learn patient-specific patterns — recognizing the patient, not learning generalizable medicine.
The telltale sign of leakage is always the same: performance that looks too good. If your model achieves 99.8% accuracy on a problem where domain experts expect 85%, don't celebrate. Investigate. In my experience, suspiciously good results are caused by leakage far more often than by algorithmic brilliance.
Stratified Splits — Preserving the Balance
Returning to our fraud detection dataset: suppose 2% of transactions are fraudulent. In a random 80/20 split of 1,000 examples, the test set gets 200 examples. On average, only 4 of those would be fraud cases. Random variation could easily give you a test set with 0 or 1 fraud examples, making any evaluation of fraud detection performance meaningless.
Stratified splitting ensures that every partition preserves the class distribution from the original dataset. If the full dataset is 2% fraud, then the training set is 2% fraud, the validation set is 2% fraud, and the test set is 2% fraud. No accidents.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
train_fraud_pct = y[train_idx].mean() * 100
val_fraud_pct = y[val_idx].mean() * 100
print(f"Fold {fold}: train {train_fraud_pct:.1f}% fraud, val {val_fraud_pct:.1f}% fraud")
For classification problems, always use stratify=y in your splits. There is essentially no downside and the protection against bad luck is significant.
For regression, where the target is continuous, you can't stratify directly. The workaround is to bin the target into quantile-based buckets and stratify on those bins. It's not perfect, but it prevents the situation where all the extreme values end up in one split.
import pandas as pd
y_binned = pd.qcut(y, q=5, labels=False, duplicates="drop")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y_binned, random_state=42
)
Time-Based Splits — Never Peek at the Future
Everything we've discussed so far assumes that your data points are independent — that the order doesn't matter. For time series data, that assumption is catastrophically wrong.
If you're predicting tomorrow's stock price and your training set contains data from both yesterday and next week, the model has peeked at the future. Random shuffling destroys the temporal structure that defines the problem. The only valid approach is to split on time: train on the past, validate on the near future, test on the further future.
df = df.sort_values("timestamp")
train = df[df["timestamp"] < "2023-01-01"]
val = df[df["timestamp"].between("2023-01-01", "2023-06-30")]
test = df[df["timestamp"] >= "2023-07-01"]
I still find this hard to accept emotionally. You're throwing away recent data — the most relevant data — by putting it in the test set instead of training on it. Every instinct says to use your freshest data for training. But the alternative is self-deception. A time series model evaluated with shuffled splits will show you beautiful metrics that mean absolutely nothing.
For cross-validation on temporal data, sklearn provides TimeSeriesSplit, which implements an expanding window approach. Each successive fold uses all data up to a cutoff for training and the data immediately after for validation. No fold ever uses future data for training.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
print(f"Fold {fold}: train {len(train_idx)} samples, val {len(val_idx)} samples")
In finance, practitioners go further with purging and embargo. Purging removes training samples that overlap with the validation window (when labels are computed over periods that cross the boundary). Embargo adds a gap between training and validation to prevent any autocorrelation leakage. These aren't academic niceties — they're the difference between a backtest you can trust and one that's lying to you.
Group-Based Splits — Keeping Entities Together
In medical studies, a single patient might contribute dozens of data points — lab results from different visits, multiple imaging scans, repeated measurements. If some of those records end up in training and others in test, the model can learn to recognize the patient rather than the disease. It's memorizing faces instead of learning medicine.
Group-based splitting ensures all records belonging to the same entity stay in the same partition. If patient 42 is in the training set, all of patient 42's records are in the training set.
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=df["patient_id"]):
train_patients = df["patient_id"].iloc[train_idx].nunique()
val_patients = df["patient_id"].iloc[val_idx].nunique()
print(f"Train: {train_patients} patients, Val: {val_patients} patients")
This applies far beyond medicine. Multiple reviews from the same user. Multiple images from the same camera. Multiple transactions from the same account. Whenever your data has a grouping structure, random splitting violates independence and inflates your metrics. Group splits enforce the integrity your evaluation needs.
Cross-Validation — When One Split Isn't Enough
With a small dataset, any single train/validation split is noisy. Move a few examples around and your accuracy swings by 5%. That's not telling you anything useful about your model — it's telling you about the luck of the draw.
K-fold cross-validation addresses this by making K different splits, each time using a different chunk as the validation set, and averaging the results. Every example gets to be in the validation set exactly once. The variance of your estimate drops, and you get a much more stable picture of model performance.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring="f1")
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")
5-fold is the standard default. 10-fold gives a slightly more stable estimate but costs twice as much. Leave-one-out (where K equals the number of samples) is theoretically elegant but impractically slow for anything larger than a few hundred rows, and it produces high-variance estimates despite its reputation.
The real power of cross-validation is that it lets you use all your data for both training and evaluation, which matters enormously when data is scarce. But it doesn't replace the test set — cross-validation is for model development and hyperparameter tuning. The test set remains your sealed envelope for the final evaluation.
Handling Class Imbalance
When 99% of your transactions are legitimate and 1% are fraud, a model that always predicts "legitimate" gets 99% accuracy while catching zero fraud. The accuracy metric is meaningless. The model is useless. This is the class imbalance problem, and it intersects with both feature selection (some features only matter for the minority class) and splitting (you need stratification to keep the minority class represented).
Three approaches handle this, and in practice you often combine them.
Resampling changes what the model sees during training. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority examples by interpolating between real ones — picking two fraud cases and generating a new point somewhere between them in feature space. This is better than duplicating existing minority examples, which would cause overfitting. A critical rule: resampling is applied only to the training set. Never to validation or test data. Resampled evaluation sets give you inflated metrics.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
Class weights tell the model to care more about minority class errors. Instead of treating every misclassification equally, a weighted model penalizes missing a fraud case 99× more than falsely flagging a legitimate one. Most sklearn estimators support this through the class_weight="balanced" parameter, which automatically sets weights inversely proportional to class frequencies.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight="balanced", random_state=42)
rf.fit(X_train, y_train)
Better metrics sidestep the accuracy trap entirely. Precision, recall, F1, and AUC-ROC each tell a different part of the story. With imbalanced data, these are the numbers that matter. Chapter 4 covers them in depth.
Wrap-Up
If you're still with me, thank you. I hope it was worth it.
We started at the kitchen pantry — the overwhelming pile of features dumped on the counter — and worked through three families of selection methods: fast statistical filters that throw away the dead weight, wrapper methods that search with model feedback, and embedded methods where the model itself decides what matters. We watched Lasso's diamond-shaped constraint region guillotine irrelevant features to zero. Then we crossed into splitting territory, where we sealed the test set in an envelope, learned to preserve class distributions with stratification, refused to let time-series models peek at the future, and kept grouped entities together. We saw how leakage can contaminate everything like a careless investigator at a crime scene.
My hope is that the next time you see suspiciously good validation numbers, instead of celebrating, you'll pause and ask: "Where's the leak?" And the next time someone hands you a dataset with 500 features, instead of feeding them all to a model and hoping for the best, you'll reach for the variance threshold, check for copycats, and let mutual information find the signal hiding in the noise. You'll have a pretty darn good mental model of what's going on under the hood.
scikit-learn Feature Selection Guide — The official docs are surprisingly readable and cover every method we discussed, with working code.
"An Introduction to Statistical Learning" (James, Witten, Hastie, Tibshirani) — Chapter 6 on regularization and feature selection is wildly clear for a stats textbook. The geometric explanation of L1 vs L2 is where my understanding clicked.
"Advances in Financial Machine Learning" (Marcos López de Prado) — The definitive treatment of purging, embargo, and why naive cross-validation destroys financial backtests. Dense but unforgettable.
Kaggle's "Data Leakage" micro-course — Short, practical, and full of examples that make you go "oh no, I've done that." Highly recommended.
"Feature Engineering and Selection" (Kuhn & Johnson) — The most thorough practical guide to both creating and selecting features. The chapter on near-zero variance predictors alone is worth the read.