Time Series Forecasting

Chapter 5: Supervised Learning Stationarity · ARIMA · Walk-forward · Feature Engineering · Prophet · Deep Learning

I avoided time series for longer than I'd like to admit. Every time I saw ARIMA(1,1,2) or someone casually dropped "differencing to achieve stationarity," I'd nod along and quietly change the subject. Regression? Fine. Classification? Comfortable. But the moment data had a time axis, something about the whole discipline felt like it operated under different physics. Finally the discomfort of not knowing what was actually happening grew too great for me. Here is that dive.

Time series forecasting is the problem of predicting future values from past observations, where the order of those observations matters. It shows up everywhere — predicting tomorrow's sales, next hour's server load, next quarter's revenue. The field has roots stretching back to the 1970s with Box and Jenkins' ARIMA methodology, and it's been transformed in recent years by gradient boosted trees, transformer architectures, and now foundation models that can forecast on data they've never seen before.

Before we start, a heads-up. We're going to talk about autocorrelation, stationarity, differencing, Fourier features, and walk-forward validation. But you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Why Time Series Is a Different Animal

Autocorrelation — The Property That Changes Everything

Stationarity — Taming the Wandering Series

ARIMA — Building Prediction from Three Ideas

The ML Reframe — Turning Forecasting into Supervised Learning

Feature Engineering — Where Time Series ML Lives or Dies

Walk-Forward Validation — The Only Honest Evaluation

The Tree Extrapolation Trap

Rest Stop

Prophet — What's Actually Inside

Deep Learning — From N-BEATS to Foundation Models

Production — Where Time Series Models Go to Die

The Full Pattern — LightGBM from Start to Finish

Resources

Why Time Series Is a Different Animal

Imagine you run a small coffee shop. You've been tracking daily cup sales for a year, and now you want to predict next week. Here are the first few days of your data:

Day 1 (Mon):  142 cups
Day 2 (Tue):  137 cups
Day 3 (Wed):  145 cups
Day 4 (Thu):  139 cups
Day 5 (Fri):  168 cups
Day 6 (Sat):  210 cups
Day 7 (Sun):  195 cups

Most of the ML you've learned so far treats data points as independent and identically distributed — i.i.d. for short. Each row in your dataset is assumed to be a random draw from the same underlying distribution, unrelated to every other row. You can shuffle them, split them randomly into train and test, and nothing breaks.

Time series violates that assumption in the most fundamental way possible: each observation depends on the ones before it. Monday's 142 cups tells you something about Tuesday. Friday's spike hints at Saturday's. Last December's holiday rush predicts this December's. The data has memory, and if you ignore that memory — if you shuffle these rows and treat them like independent customer records — you'll build a model that looks brilliant in evaluation and fails the moment it touches reality.

This single fact — temporal dependence — changes everything. How you validate. How you engineer features. How you split data. How you scale. How you deploy. It's the reason time series forecasting feels like a different discipline. Because it is.

Autocorrelation — The Property That Changes Everything

Back at our coffee shop. You notice something when you plot the sales: busy days tend to follow busy days. Slow days cluster with slow days. This isn't coincidence — it's the defining property of time series data, and it has a name.

Autocorrelation is the correlation of a signal with a delayed version of itself. Take your daily cup sales and compare each day with the day before it. If Monday is high, Tuesday tends to be high too. That's autocorrelation at lag 1. Now compare each day with the same day last week. Monday tends to resemble last Monday. That's autocorrelation at lag 7.

We can measure this precisely. The autocorrelation function (ACF) gives you the correlation between the series and itself at each lag. Plot it, and the significant spikes tell you which past time steps carry useful information. A spike at lag 7 in daily data screams "weekly pattern." A spike at lag 365 screams "yearly seasonality." The partial autocorrelation function (PACF) does something subtler — it shows the correlation at lag k after removing the effects of all shorter lags, which helps isolate direct dependencies from inherited ones.

I'll be honest — I found the distinction between ACF and PACF confusing for an embarrassingly long time. The way it finally clicked: ACF is like asking "how similar is today to 7 days ago?" while PACF is asking "how similar is today to 7 days ago, above and beyond what days 1 through 6 already told us?" PACF strips away the chain of correlations to show what each lag contributes uniquely.

These two plots — ACF and PACF — are your first diagnostic tool for any time series. They tell you how deep the memory goes, whether there's seasonality, and how many lag features your model might need. We'll use them repeatedly.

Stationarity — Taming the Wandering Series

Our coffee shop is growing. In year one, we averaged 150 cups a day. In year two, 200. In year three, 260. The mean is drifting upward. The variance might be growing too — our best days are getting wilder as the shop gets more popular. A model trained on year-one data would be hopelessly wrong for year three. The statistical ground is shifting under our feet.

A stationary series has statistical properties — mean, variance, autocorrelation structure — that don't change over time. If you took any random chunk of the series and compared it to any other chunk, they'd look like they came from the same distribution. Most classical forecasting models were designed for stationary data. Feed them a series with a rising trend, and they'll produce forecasts that are mathematically valid but practically useless.

How do you know if your series is stationary? Plot it — that's always step one. If you see a trend, expanding variance, or a mean that's drifting, it's not stationary. But eyeballing only goes so far, so we have formal tests.

The Augmented Dickey-Fuller (ADF) test is the workhorse. Its null hypothesis is that the series has a unit root — a technical way of saying "it's non-stationary." If the p-value is below 0.05, you reject the null, and the series is stationary. The KPSS test flips the logic: its null hypothesis is that the series IS stationary. Rejecting it means non-stationarity. Using both tests together gives you the clearest picture. If ADF says stationary and KPSS agrees, you're on solid ground. If they disagree, you likely have a trend-stationary series that needs more investigation.

from statsmodels.tsa.stattools import adfuller, kpss

# ADF: null = non-stationary. Small p → stationary.
adf_stat, adf_p, _, _, _, _ = adfuller(series)

# KPSS: null = stationary. Small p → NOT stationary.
kpss_stat, kpss_p, _, _ = kpss(series, regression="c")

# Both agree the series is stationary? Great. Proceed.
# They disagree? Investigate trend-stationarity.

If the series isn't stationary, the most common fix is differencing — subtract each value from the one before it: y'(t) = y(t) − y(t−1). Instead of modeling the raw cup count (which drifts upward), you model the change in cup count (which might bounce around a stable zero). If one round of differencing isn't enough, difference again. If the variance is growing with the level — bigger values have bigger swings — apply a log transform first to stabilize it.

There's a subtlety that tripped me up early on. Stationarity isn't a binary you either care about or don't. Even ML models that don't formally require it — like gradient boosted trees — tend to learn more stable patterns from stationary inputs. The features become more consistent across time. The model doesn't have to waste capacity figuring out that "200 cups" meant something different in year one than in year three.

ARIMA — Building Prediction from Three Ideas

With autocorrelation and stationarity under our belt, we can now build the most celebrated classical forecasting model from the ground up. ARIMA stands for AutoRegressive Integrated Moving Average, and despite the imposing name, it's built from three ideas you already understand.

Let's return to our coffee shop. Suppose, after differencing, our stationary series of daily changes looks like this for a short stretch: +5, −3, +8, −2, +6. We want to predict tomorrow's change.

The AutoRegressive part says: tomorrow's value is a weighted sum of recent past values. If we use two lags (p=2), that's: ŷ(t) = φ₁·y(t−1) + φ₂·y(t−2). In our coffee shop, maybe φ₁ is 0.6 and φ₂ is 0.2. Tomorrow's predicted change is 0.6 times today's change plus 0.2 times yesterday's change. The model is saying "recent momentum matters." The number of lags (p) is typically chosen by looking at the PACF plot — significant spikes tell you how many direct dependencies exist.

But autoregression alone misses something. What about the mistakes we made? If we predicted +4 yesterday but the actual was +6, we were off by +2. That error might carry information — maybe we systematically undershoot on busy days. The Moving Average part incorporates past forecast errors: ŷ(t) = θ₁·ε(t−1) + θ₂·ε(t−2), where ε represents the errors from previous predictions. This is a self-correcting mechanism. The number of error lags (q) comes from the ACF plot of the residuals.

The Integrated part is the differencing we already discussed — the "I" in ARIMA. The parameter d tells you how many times you differenced the series to make it stationary. d=1 means you differenced once. d=2 means you differenced the differences.

Put them together and ARIMA(p, d, q) gives you a model with p autoregressive lags, d differencing steps, and q moving average terms. The seasonal variant, SARIMA, adds a second set of parameters (P, D, Q, m) to handle repeating patterns — like our coffee shop's weekly cycle where m=7.

# Let pmdarima figure out the best parameters
from pmdarima import auto_arima

model = auto_arima(
    series,
    seasonal=True, m=7,  # weekly seasonality
    stepwise=True,
    suppress_warnings=True
)
forecast = model.predict(n_periods=14)  # next two weeks

I want to be direct about something: auto_arima is doing you a massive favor here. Manually selecting p, d, and q by staring at ACF/PACF plots and running information criterion comparisons is educational, but in practice, the automated search gets you there faster and more reliably. The one thing to watch is the seasonal period — you need to tell it that m=7 for weekly data or m=12 for monthly data. It can't figure that out on its own.

ARIMA is not a relic. On short, clean, univariate series — where you're forecasting one variable with a few hundred data points and no external features — it remains remarkably hard to beat. It's your baseline. If your fancy ML model can't outperform ARIMA, the extra complexity isn't earning its keep.

The limitation is also clear. ARIMA is linear. It handles one variable. It doesn't accept external features like "is there a holiday this week" or "what's the weather forecast." For that, we need a different approach entirely.

The ML Reframe — Turning Forecasting into Supervised Learning

Here's the insight that unlocks the entire modern approach to time series: you can rewrite any forecasting problem as a supervised learning problem by engineering the right features. Once you do that, you can use any tabular model you want — LightGBM, random forests, even linear regression.

Think about what our coffee shop owner does intuitively. When she predicts tomorrow's cups, she considers: how many cups did we sell today? How many last Monday? Is tomorrow a holiday? Has there been a trend lately? She's mentally constructing features from the past and using them to predict the future.

We do the same thing mechanically. Take the historical time series and, for each time step, create columns that encode what we'd know at that point: yesterday's sales, last week's sales, the rolling average of the past 7 days, the day of the week, the month. Each row becomes a standard tabular data point with features and a target. The target is the value we're trying to predict. The features are derived entirely from the past.

This reframe is powerful because it lets us bring the entire toolkit of supervised learning to bear — gradient boosted trees, feature importance, hyperparameter tuning, all of it. But it comes with a catch that's easy to underestimate: the features are everything. A mediocre model with brilliant temporal features will crush a brilliant model with mediocre features. In time series ML, feature engineering is the ball game.

Feature Engineering — Where Time Series ML Lives or Dies

Let's build our coffee shop's feature set from scratch, one category at a time.

Lag Features

The most fundamental temporal features. If we're predicting cups sold on day t, we create columns for day t−1, day t−2, day t−7 (same day last week), day t−14, day t−28. Each lag captures a different kind of memory. Lag 1 captures momentum. Lag 7 captures the weekly cycle. Lag 28 captures the monthly rhythm.

How many lags? Look at the ACF plot we discussed earlier. If there's a significant spike at lag 7, include lag 7. If the autocorrelation dies off after lag 3, you probably don't need lag 30. Let the data tell you.

for lag in [1, 2, 3, 7, 14, 28]:
    df[f"lag_{lag}"] = df["cups"].shift(lag)

Rolling Statistics

Lags capture specific past values. Rolling statistics capture recent behavior. The rolling mean of the past 7 days tells the model "we've been in a busy stretch." The rolling standard deviation tells it "sales have been volatile lately." Rolling min and max capture recent extremes.

There's a trap here that I still occasionally catch myself falling into. When you compute a rolling mean, you have to make sure it doesn't include the current day's value — because the current day's value IS the target. That .shift(1) is not optional. Without it, you're leaking the answer into the features, and your model will look suspiciously accurate during development and fail in production when the current day's value isn't available yet.

for window in [7, 14, 30]:
    # shift(1) ensures we only use PAST values, not the current day
    df[f"roll_mean_{window}"] = df["cups"].shift(1).rolling(window).mean()
    df[f"roll_std_{window}"] = df["cups"].shift(1).rolling(window).std()

Calendar Features

Time itself carries signal, and it's often the strongest signal available. Our coffee shop sells more on weekends. More in December. More around holidays. These patterns are trivially easy to encode and often do more work than any fancy technique.

df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

Fourier Features for Seasonality

Calendar features like "month = 12" treat December as a category. That's fine, but the model doesn't know that December is close to January, or that the seasonal curve rises smoothly through fall. Fourier features fix this by encoding time as smooth, periodic sine and cosine waves.

The idea: take the time index t and the period P (say, 365.25 for yearly seasonality), and create pairs of features: sin(2πkt/P) and cos(2πkt/P) for k = 1, 2, …, K. With K=1, you get one big smooth wave per year. With higher K values, you get more wiggly patterns that can capture finer seasonal detail. This is exactly what Prophet uses under the hood for its seasonality model.

import numpy as np

period = 365.25
for k in range(1, 4):  # 3 harmonics
    df[f"sin_year_{k}"] = np.sin(2 * np.pi * k * df["day_of_year"] / period)
    df[f"cos_year_{k}"] = np.cos(2 * np.pi * k * df["day_of_year"] / period)

The beauty of Fourier features is that they place December 31 and January 1 right next to each other in feature space, which raw month integers never do. The downside: too many harmonics and you overfit the seasonality. Start with 3–5 and adjust.

The Golden Rule

Every feature must be computed using only information available at prediction time. No centered rolling windows. No features from tomorrow. No statistics computed on the full dataset before splitting. At every point in time, ask yourself: "If this were production and I were making this prediction right now, would I actually have this number?" If the answer is no, it's leakage — the most insidious kind, because your offline metrics will look great and your production model will quietly underperform.

Walk-Forward Validation — The Only Honest Evaluation

You've engineered your features. You've built your model. Now you need to know: how well does it actually work? And here's where most people's first instinct leads them astray.

Standard K-fold cross-validation randomly shuffles your data into folds. Fold 3 might contain data from March, while fold 1 contains data from June. Your model in fold 1 is training on June data to predict March — it's literally learning the future to predict the past. The resulting metrics are meaningless. They don't measure forecasting ability. They measure memorization.

Walk-forward validation is the fix. It mirrors exactly what happens in production: train on the past, predict the next time window, then slide forward.

For our coffee shop with two years of data, it might look like this: train on months 1–12, test on month 13. Then train on months 1–13, test on month 14. Then months 1–14, test on 15. Keep going. At every step, the model has never seen any future data. The average of all those test scores tells you what to actually expect in production.

This is called expanding window validation because the training set grows each fold. The alternative is sliding window, where you keep the training window fixed — train on the most recent 12 months, always. Use sliding when you believe older data has become irrelevant. Consumer behavior shifts. Markets evolve. Sometimes last year's data hurts more than it helps.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(df):
    train, test = df.iloc[train_idx], df.iloc[test_idx]
    # train is always BEFORE test in time. Always.

For financial data, even walk-forward isn't paranoid enough. Marcos López de Prado introduced purged cross-validation, which adds two more protections. First, purging: if your features use a rolling window of, say, 10 days, then the last 10 days of the training set overlap with the test set through their features. Remove those overlapping samples. Second, an embargo gap: don't use the few observations right after the test set for training either, because information can leak backward through rolling calculations. This matters in finance where you're working with overlapping returns and rolling features. For most other domains, standard walk-forward is sufficient.

The Tree Extrapolation Trap

Here's something that catches almost everyone the first time they apply gradient boosted trees to time series data, and it's important enough to warrant its own section.

Our coffee shop has been growing. Year one averaged 150 cups/day. Year two averaged 200. Year three, 260. You train LightGBM on all three years, and it makes predictions for year four that... plateau around 260. It refuses to predict 300 or 350, even though the upward trend is obvious to your eyes.

This isn't a bug. It's a fundamental property of how decision trees work. A tree partitions the feature space into rectangular regions and assigns each region a constant value — the average of the training targets that fell into that region. It's a piecewise constant function. It can only output values it saw during training. If the maximum cup count in the training data was 280, no combination of splits will ever predict 290. The model is incapable of extrapolating beyond the training range.

I've seen senior engineers lose days to this before realizing what was happening. The fix is to separate the problem into two parts. Let a simple model handle the trend — linear regression is perfect for this. Remove the trend from the data (subtract the linear fit), and then let LightGBM handle everything else: seasonality, day-of-week effects, holiday spikes, the complex nonlinear patterns it excels at. At prediction time, add the trend back.

# Separate trend from everything else
from sklearn.linear_model import LinearRegression

time_idx = np.arange(len(df)).reshape(-1, 1)
trend_model = LinearRegression().fit(time_idx[:train_size], df["cups"][:train_size])
df["trend"] = trend_model.predict(np.arange(len(df)).reshape(-1, 1))
df["detrended"] = df["cups"] - df["trend"]

# Now train LightGBM on the detrended residuals
# At prediction time: final_pred = trend_model.predict(future_idx) + lgbm.predict(features)

Another approach: instead of predicting the raw cup count, predict the difference from the previous day. Differences bounce around zero regardless of the overall level, so trees handle them naturally. The tradeoff is that errors compound over multi-step horizons.

I still don't think the community has fully settled on the best way to handle this. Some teams use hybrid models (linear trend + tree residuals). Others predict differences. A few use specialized architectures like N-BEATS that handle trend natively. No one is completely certain which approach wins in general, and the answer likely depends on how strong and stable the trend is.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model that covers the essentials of time series forecasting: why temporal data is different from cross-sectional data, how autocorrelation gives time series its memory, why stationarity matters and how to achieve it, how ARIMA works from the ground up, the feature engineering approach that turns forecasting into supervised learning, why walk-forward validation is the only honest evaluation, and why trees can't see beyond what they've been trained on.

That's a solid foundation. With lag features, calendar features, LightGBM, and walk-forward validation, you can build a respectable time series model for most business problems. The short version: engineer features from the past, validate by always training before testing, and never let the future leak in. There. You're 80% of the way there.

But there's more to the story. Prophet takes a fundamentally different approach by decomposing time series into interpretable components. Deep learning architectures like N-BEATS and Temporal Fusion Transformers are pushing accuracy boundaries on complex multivariate problems. And foundation models are starting to forecast on data they've never trained on. These aren't academic curiosities — they show up in production systems and interview questions alike.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Prophet — What's Actually Inside

Facebook (now Meta) released Prophet in 2017, and it became wildly popular because it made time series forecasting accessible to analysts who'd never heard of ARIMA. But most people use it as a black box. Let's open it up.

Prophet is, at its core, a generalized additive model. It says that any time series can be decomposed into a sum of components: y(t) = g(t) + s(t) + h(t) + ε(t). The trend g(t) captures long-term growth. The seasonality s(t) captures repeating patterns. The holidays h(t) capture specific dates like Christmas or Black Friday. And ε(t) is the noise that no model can predict.

The trend component is a piecewise linear function with automatically detected changepoints. Imagine drawing a line through your coffee shop's growth, but allowing the line to change slope at certain moments — maybe when you opened a second register, or when a competitor moved in next door. Prophet places potential changepoints at evenly spaced intervals through the first 80% of the history, then uses regularization to decide which changepoints are real (significant slope changes) and which are noise. The result is a trend line that can bend at meaningful moments without overfitting to every wiggle.

The seasonality component is built from Fourier series — the same sine and cosine features we discussed earlier. For yearly seasonality, Prophet fits a sum of sine and cosine terms at different frequencies. More Fourier terms means a more flexible seasonal curve. Ten terms is the default for yearly seasonality, and three for weekly. This is Prophet's clever trick: it doesn't need to learn "Fridays are busier" as a categorical fact. The Fourier basis captures that smooth weekly wave automatically.

Under the hood, Prophet fits these parameters using Stan, a probabilistic programming language. By default it uses MAP (maximum a posteriori) estimation, which is fast. Switch to MCMC sampling and you get full posterior distributions — uncertainty intervals that actually mean something probabilistically.

Where Prophet shines: business time series with daily data, strong seasonality, holiday effects, and missing values. It handles gaps gracefully. It gives you interpretable components you can show to stakeholders. Where it struggles: high-frequency data (minute-level), multivariate problems (it forecasts one series at a time with no input features), and datasets where a well-engineered LightGBM model would find nonlinear patterns that Prophet's additive structure can't capture.

Deep Learning — From N-BEATS to Foundation Models

LSTMs had their moment as the go-to neural network for sequences, and here's the uncomfortable truth that took the field years to fully internalize: on most tabular time series datasets — the kind with rows and columns of engineered features — gradient boosted trees beat LSTMs. The deep learning overhead, the architecture decisions, the GPU requirements, the training instability — rarely worth it when your data fits in a spreadsheet.

But there are problems where deep learning genuinely earns its complexity.

N-BEATS (Neural Basis Expansion Analysis for interpretable Time Series forecasting) was published in 2019 and won the M4 forecasting competition. What makes it remarkable: it's a pure feedforward network. No recurrence. No convolutions. It's built from stacked blocks, each containing a few fully connected layers. Each block takes the residual from previous blocks — what they couldn't explain — and produces two outputs: a backcast (its best reconstruction of the input) and a forecast (its prediction of the future). The final forecast is the sum of all blocks' forecasts. The interpretable variant constrains the blocks to produce explicit trend and seasonality components, which means you can see what the network thinks is trend and what it thinks is seasonal. N-BEATS showed that you don't need specialized architectures to beat classical methods — you need depth and the right inductive bias.

The Temporal Fusion Transformer (TFT), published by Google in 2021, tackles a harder problem: multivariate forecasting where you have static metadata (like store ID or product category), known future inputs (holidays, promotions), and observed historical features that won't be available in the future. TFT handles all three through variable selection networks that learn which features matter — and can tell you. Its attention mechanism operates over time steps, so you can visualize which past moments the model focused on for each prediction. For complex, real-world forecasting with dozens of input signals, TFT is currently the most capable architecture available.

And then, in 2024, the ground shifted again. Foundation models for time series arrived. Google's TimesFM, trained on over 100 billion time points, can forecast on series it has never seen before — zero-shot, no training required. Amazon's Chronos does the same and is open-source. Lag-Llama adapts the transformer architecture for probabilistic time series prediction. The implications are profound: instead of training a model for each new forecasting task, you might hand your raw data to a pre-trained model and get competitive forecasts immediately.

I'll be candid — I'm still building my intuition for when these foundation models fail. They're trained on massive, diverse datasets, and they generalize surprisingly well. But they're also expensive at inference time, they lack the interpretability of a well-understood LightGBM model, and we don't yet have enough production experience to know their failure modes. My current rule of thumb: for a new project, try a foundation model as a baseline alongside LightGBM. If the foundation model wins without any feature engineering, the problem might not warrant a custom model at all.

Production — Where Time Series Models Go to Die

A model that forecasts well in your notebook can degrade silently in production. Time series production is uniquely treacherous because the world keeps changing and your model, frozen at deployment time, doesn't.

Concept Drift

Our coffee shop opened a drive-through. Overnight, daily volumes jumped 40% and the weekday/weekend ratio shifted. The model trained on sit-in-only data is now predicting a world that no longer exists. This is concept drift — the relationship between features and target changes over time. COVID was the most dramatic recent example: every model trained on pre-2020 data became instantly useless for anything behavior-related.

The fix: retrain regularly. Monitor prediction error over time. When the error starts climbing — not if, when — retrain on recent data. Many production teams retrain daily or weekly as a matter of course, treating the model as a living thing rather than a one-time artifact.

Horizon Degradation

Predicting tomorrow's cup count is much easier than predicting next month's. Model accuracy degrades as the forecast horizon extends, and it degrades non-linearly. A model that's 95% accurate at 1-day ahead might drop to 70% at 7-day ahead and 50% at 30-day ahead. Always report performance per horizon, not as a single averaged number. Stakeholders need to know that "the model predicts well" has an asterisk: well, for how far out?

The Leakage Hall of Shame

Time series leakage is insidious because it comes from so many directions. Computing rolling features with a centered window instead of a backward-looking one. Scaling the entire dataset before splitting — the test set's mean and variance sneak into training. Using a feature that exists in your historical data but won't be available at prediction time, like "end-of-month total sales" when you're predicting mid-month. Fitting an imputer on the full dataset. Random train/test split instead of temporal split.

Every one of these mistakes produces the same symptom: suspiciously good offline metrics and mysteriously poor production performance. The rule is simple enough to tattoo: fit everything on training data only. Scale on training data. Impute on training data. Compute statistics on training data. The test set and production data receive only transforms, never fits.

The Full Pattern — LightGBM from Start to Finish

Let's bring everything together into the complete workflow. This is the pattern that wins a surprising number of forecasting competitions, and it's the one I reach for first on any new time series problem.

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error

def create_features(df):
    df = df.copy()

    # Lag features: yesterday, last week, two weeks, four weeks
    for lag in [1, 7, 14, 28]:
        df[f"lag_{lag}"] = df["cups"].shift(lag)

    # Rolling statistics: backward-looking only (shift prevents leakage)
    for w in [7, 14, 30]:
        df[f"roll_mean_{w}"] = df["cups"].shift(1).rolling(w).mean()
        df[f"roll_std_{w}"] = df["cups"].shift(1).rolling(w).std()

    # Calendar features
    df["dow"] = df["date"].dt.dayofweek
    df["month"] = df["date"].dt.month
    df["is_weekend"] = df["dow"].isin([5, 6]).astype(int)

    # Fourier features for yearly seasonality
    doy = df["date"].dt.dayofyear
    for k in range(1, 4):
        df[f"sin_{k}"] = np.sin(2 * np.pi * k * doy / 365.25)
        df[f"cos_{k}"] = np.cos(2 * np.pi * k * doy / 365.25)

    return df.dropna()

df = create_features(df)
features = [c for c in df.columns if c not in ["date", "cups"]]

# Walk-forward validation: always train on past, test on future
tscv = TimeSeriesSplit(n_splits=5)
scores = []
for train_idx, test_idx in tscv.split(df):
    tr, te = df.iloc[train_idx], df.iloc[test_idx]
    model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.05, num_leaves=31)
    model.fit(
        tr[features], tr["cups"],
        eval_set=[(te[features], te["cups"])],
        callbacks=[lgb.early_stopping(50, verbose=False)]
    )
    preds = model.predict(te[features])
    scores.append(mean_absolute_error(te["cups"], preds))

print(f"Walk-forward MAE: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

Notice the small details that make this correct rather than subtly broken. The .shift(1) inside the rolling calculations ensures we never peek at the current day's value. TimeSeriesSplit ensures each fold trains strictly before it tests. Early stopping uses the test fold for monitoring but doesn't leak — it's watching performance on a future window, which is exactly what we'd do with a held-out validation set in production.

After training, call model.feature_importances_ and look at what the model actually used. If lag_7 dominates, your series has strong weekly seasonality. If the rolling mean features rank high, recent momentum matters. If calendar features dominate, the problem might be better served by Prophet or even a simple seasonal decomposition. The features that matter tell you something about the data.

Wrapping Up

If you're still with me, thank you. I hope it was worth it.

We started with a simple question — why is time series different? — and built our way from autocorrelation to stationarity, from ARIMA's three-part architecture to the ML reframe that turns forecasting into supervised learning. We walked through the feature engineering that makes or breaks a time series model, learned why the only honest validation walks forward through time, discovered why trees can't see beyond their training data, opened up Prophet's internals, surveyed the deep learning landscape from N-BEATS to foundation models, and confronted the production realities that make time series uniquely treacherous.

My hope is that the next time someone mentions ARIMA or asks you to forecast next quarter's revenue, instead of nodding along and quietly changing the subject, you'll know exactly what's happening under the hood — the autocorrelation that gives the data memory, the stationarity that keeps models honest, the features engineered from the past, and the walk-forward validation that keeps your evaluation honest. Having a pretty darn good mental model of what's going on under the hood changes everything.

Resources

Forecasting: Principles and Practice by Rob Hyndman — The single best freely available textbook on time series. Covers everything from exponential smoothing to dynamic regression, with R code throughout. Wildly well-written for a statistics book.

"Forecasting at Scale" (Taylor & Letham, 2018) — The Prophet paper. Shorter than you'd expect and remarkably clear about what the model is and isn't. Worth reading for the engineering philosophy alone.

"N-BEATS: Neural basis expansion analysis for interpretable time series forecasting" (Oreshkin et al., 2019) — The paper that proved you don't need recurrence for time series. The architecture is elegant enough that you can implement it from the paper alone.

"Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (Lim et al., 2021) — The TFT paper. Dense but insightful, especially the variable selection networks and how they separate static, known future, and observed past inputs.

Advances in Financial Machine Learning by Marcos López de Prado — Chapter 7 on cross-validation in finance is unforgettable. The purged CV + embargo approach should be standard practice for anyone working with financial time series.

Kaggle Time Series competitions — The M5 competition (Walmart sales forecasting) and Store Sales competitions have public solutions that show exactly how winning teams engineer features, validate, and combine models. More practical than any textbook.

← Previous Support Vector Machines Next → Nice to Know