Object-Oriented Programming
I avoided taking OOP seriously for longer than I'd like to admit. For years, I wrote Python the way most data scientists do — functions, global variables, scripts that grew to 800 lines, and a vague feeling that classes were for "software engineers," not people who train models. Every time someone mentioned design patterns, my eyes glazed over. Then I tried to add a second model variant to a training pipeline I'd built, and the whole thing collapsed like a house of cards. Not because the math was wrong, but because I had no structure. Functions called other functions that mutated shared state. One change broke three things. Here is that dive.
Object-oriented programming has been around since the 1960s, popularized by Smalltalk and later C++ and Java. In the Python ML world, it's the skeleton beneath every framework you touch — scikit-learn, PyTorch, HuggingFace Transformers, LangChain, even the OpenAI SDK. You interact with it every time you write model.fit(X, y) or model(input). But the real reason OOP matters in production ML isn't about syntax. It's about what happens when ten engineers need to extend the same system without breaking each other's work.
Before we start, a heads-up. We're going to be building a tiny training framework from scratch, and along the way we'll encounter inheritance, composition, design patterns, and some Python internals. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.
This isn't a short journey, but I hope you'll be glad you came.
The spaghetti script
The first class — encapsulation as self-defense
The contract — why interfaces matter
Inheritance: when the framework says so
Composition: when you're building your own thing
Rest stop
The patterns you'll actually see in production
The dunder methods that earn their keep
The modern toolkit — dataclasses and protocols
The anti-patterns that kill codebases
Resources
The Spaghetti Script
Let's start where most ML projects start — a single Python script that trains a model. We'll call our running example minitrainer. Imagine we're building a small system that loads data, trains a model, and evaluates it. Here's the kind of code I used to write, and I suspect you've written something like it too.
import numpy as np
# globals everywhere
train_data = None
model_weights = None
learning_rate = 0.01
history = []
def load_data(path):
global train_data
train_data = np.load(path)
def train(epochs):
global model_weights, history
model_weights = np.random.randn(train_data.shape[1])
for epoch in range(epochs):
preds = train_data @ model_weights
loss = np.mean((preds - targets) ** 2)
grad = 2 * train_data.T @ (preds - targets) / len(targets)
model_weights -= learning_rate * grad
history.append(loss)
def evaluate():
preds = train_data @ model_weights
return np.mean((preds - targets) ** 2)
This works. For a single experiment, on a single machine, by a single person, it works fine. The pain begins when you try to do anything beyond that. Want to train two models with different learning rates at the same time? You can't — they share the same globals. Want a colleague to add a different optimizer? They have to read every function to understand what state exists. Want to test the training logic without loading real data? Good luck — everything is wired to everything.
I'll be honest — I wrote code like this for a couple of years. It felt productive. And then it didn't.
The First Class — Encapsulation as Self-Defense
The first problem with our spaghetti script is that all the state — data, weights, history, learning rate — floats around in module-level variables. Any function can read or modify any of them. When something goes wrong, you have to trace through the entire file to find which function mutated what.
A class is, at its core, a way to bundle related state and the functions that operate on it into one unit. Think of it like giving each experiment its own sealed envelope. The data, weights, and history for experiment A live inside one envelope. Experiment B gets its own. They can't accidentally reach into each other's pockets.
class MiniTrainer:
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
self.weights = None
self.history = []
def fit(self, X, y, epochs=100):
self.weights = np.random.randn(X.shape[1])
for epoch in range(epochs):
preds = X @ self.weights
loss = np.mean((preds - y) ** 2)
grad = 2 * X.T @ (preds - y) / len(y)
self.weights -= self.learning_rate * grad
self.history.append(loss)
return self
def predict(self, X):
return X @ self.weights
Now two experiments can run simultaneously without interference.
trainer_a = MiniTrainer(learning_rate=0.01)
trainer_b = MiniTrainer(learning_rate=0.001)
trainer_a.fit(X_train, y_train)
trainer_b.fit(X_train, y_train)
Each trainer carries its own weights, its own history, its own learning rate. That sealed envelope idea — in OOP, it's called encapsulation. The data and the methods that touch it live in the same place. If a bug shows up in trainer_a's history, you know the problem is in MiniTrainer's methods, not in some random function three files away.
This is the entire argument for classes, stripped to the bone. Not inheritance, not polymorphism, not any of the big words. The argument is: when state and behavior travel together, bugs have fewer places to hide.
class Model: layers = [] means every instance shares the same list. Append to it in one instance, and it shows up in all of them. Always initialize mutable state in __init__ as self.layers = []. Same trap appears with mutable default arguments — def __init__(self, items=[]) reuses the same list object across calls. The fix: items=None, then self.items = items or []. I once spent an entire afternoon debugging a shared list between two model instances. It's the kind of bug that makes you question your career choices.
Our MiniTrainer works, but it has a limitation. The optimization logic is hardcoded. What if we want to try Adam instead of vanilla gradient descent? We'd have to go into the fit method and rewrite it. And if someone else wants to try RMSprop, they'd have to copy the entire class and change a few lines. That's not scalable.
The Contract — Why Interfaces Matter
Here's a scenario that plays out at every ML team I've seen. You build a trainer. A colleague builds a different trainer for a different algorithm. A third person writes an evaluation script. The evaluation script needs to call .predict() on whatever model it receives. But your colleague called their method .run_inference(). The evaluation script breaks.
The pain isn't the naming inconsistency — it's that there was no agreement about what a "model" should look like. No contract. In Python, we have two ways to express such a contract.
The first is an Abstract Base Class (ABC). An ABC says: "If you want to be a valid model, you must implement these methods. If you forget one, Python will refuse to even create an instance of your class." It fails loudly, at construction time — not silently, two hours into a training run when .predict() is finally called.
from abc import ABC, abstractmethod
class BaseModel(ABC):
@abstractmethod
def fit(self, X, y):
...
@abstractmethod
def predict(self, X):
...
class BrokenModel(BaseModel):
def fit(self, X, y):
pass
# forgot predict!
# BrokenModel()
# → TypeError: Can't instantiate abstract class BrokenModel
# with abstract method predict
# Fails NOW, not when predict() is called hours later.
The second is a Protocol, introduced in Python 3.8 through PEP 544. A Protocol expresses the same idea — "a valid model must have fit and predict" — but it doesn't require inheritance. It works through what's called structural subtyping. If your class has the right methods with the right signatures, it satisfies the protocol. The type checker (mypy, pyright) verifies this at analysis time, not runtime.
from typing import Protocol
import numpy as np
class Predictor(Protocol):
def fit(self, X: np.ndarray, y: np.ndarray) -> "Predictor": ...
def predict(self, X: np.ndarray) -> np.ndarray: ...
class LinearModel:
def fit(self, X, y):
self.w = np.linalg.lstsq(X, y, rcond=None)[0]
return self
def predict(self, X):
return X @ self.w
def evaluate(model: Predictor, X_test, y_test):
preds = model.predict(X_test)
return np.mean((preds - y_test) ** 2)
# LinearModel satisfies Predictor without inheriting from it.
# mypy will catch it if LinearModel is missing predict().
I'm still developing my intuition for when ABC vs Protocol is the right call. The rule of thumb I've landed on: use ABC when you need runtime isinstance() checks or when your base class provides default method implementations. Use Protocol when you want maximum flexibility, especially with third-party classes you can't modify. In newer codebases, I find myself reaching for Protocol more often.
Inheritance: When the Framework Says So
Now let's bring our minitrainer into the real world. If we want it to work with scikit-learn's Pipeline, GridSearchCV, and cross_val_score, we need to follow sklearn's contract. That means inheriting from BaseEstimator and a mixin.
Think of inheritance like this. A framework architect built a kitchen and said: "I've wired the gas, the plumbing, and the ventilation. You bring the recipe." When you subclass nn.Module or BaseEstimator, you're bringing the recipe. The framework has already built the kitchen — parameter registration, device movement, gradient tracking, hyperparameter introspection, grid search integration. You get all of that by inheriting and filling in the blanks.
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_is_fitted
import numpy as np
class MiniRegressor(BaseEstimator, RegressorMixin):
def __init__(self, learning_rate=0.01, epochs=100):
self.learning_rate = learning_rate
self.epochs = epochs
def fit(self, X, y):
self.coef_ = np.random.randn(X.shape[1])
for _ in range(self.epochs):
preds = X @ self.coef_
grad = 2 * X.T @ (preds - y) / len(y)
self.coef_ -= self.learning_rate * grad
return self
def predict(self, X):
check_is_fitted(self)
return X @ self.coef_
Three conventions make this work. First, __init__ stores hyperparameters and does nothing else — BaseEstimator introspects your __init__ signature to power get_params() and set_params(), which is how GridSearchCV clones your estimator with different hyperparameters. Second, fit() returns self — this enables method chaining in pipelines. Third, fitted attributes end with an underscore (coef_) — check_is_fitted() detects these to verify that fit() was called before predict().
RegressorMixin gives you a score() method for free — it calls predict() and computes R² against the true labels. That's a mixin: a small class that adds a single slice of behavior without defining __init__. Sklearn uses mixins everywhere — ClassifierMixin, TransformerMixin, ClusterMixin. Each assumes the host class has certain methods, and each gives something back in return. This is a fair trade, and it's the right use of inheritance.
The same kitchen analogy applies to PyTorch. When you subclass nn.Module, you implement forward(). The framework's __call__ method wraps your forward() with pre-hooks, post-hooks, and autograd setup. That's why you write model(x) and never model.forward(x) directly — calling forward() bypasses the hooks and silently breaks gradient tracking. The kitchen handles the gas lines; you bring the recipe.
import torch
import torch.nn as nn
class TinyNet(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, out_dim),
)
def forward(self, x):
return self.net(x)
model = TinyNet(784, 256, 10)
# model(x) → runs hooks, autograd, THEN forward()
# model.parameters() → auto-discovered from self.net
# model.to('cuda') → recursively moves all parameters
Behind the scenes, nn.Module uses a custom __setattr__ to intercept every attribute assignment in __init__. When you write self.net = nn.Sequential(...), it detects that nn.Sequential is a Module and registers it as a child. That's how model.parameters() can recurse into nested modules and find every learnable weight without you listing them manually. It's one of the most elegant uses of OOP I've seen in any framework.
The Diamond Problem and MRO
When a class inherits from multiple parents, things get interesting. Consider class D(B, C) where both B and C inherit from A. When D calls super().__init__(), which path does it take? Python resolves this with the Method Resolution Order (MRO), computed via an algorithm called C3 linearization.
The key insight that tripped me up for a while: super() doesn't mean "call my parent." It means "call the next class in the MRO chain."
class A:
def __init__(self):
print("A")
class B(A):
def __init__(self):
print("B")
super().__init__() # calls C, NOT A
class C(A):
def __init__(self):
print("C")
super().__init__() # calls A
class D(B, C):
def __init__(self):
print("D")
super().__init__() # calls B
D() # prints: D → B → C → A
print(D.__mro__)
# (D, B, C, A, object)
Every class appears exactly once in the MRO, and a class always appears before its parents. When you see class SGDClassifier(BaseSGD, ClassifierMixin) in sklearn's source, this is the machinery that makes it work without ambiguity. I still occasionally sketch the MRO on paper for complex hierarchies. It's one of those things where the algorithm is well-defined but the intuition takes a while to internalize.
Composition: When You're Building Your Own Thing
Inheritance is the right tool when a framework gives you a kitchen. But when you're building your own system — a training pipeline, an experiment manager, a model evaluation suite — composition wins almost every time.
Let's revisit our minitrainer. Right now the optimization logic is baked into fit(). What if we want to try different optimizers? The inheritance approach would be to create MiniTrainerSGD, MiniTrainerAdam, MiniTrainerRMSprop — a new subclass for each optimizer. That's three classes that are identical except for a few lines in the training loop. Now imagine adding a new data loading strategy. Do we create MiniTrainerSGDWithCSV? The class hierarchy explodes.
Composition takes the opposite approach. Instead of "the trainer is a kind of SGD trainer," we say "the trainer has an optimizer." The optimizer is a separate object, passed in from outside.
class SGDOptimizer:
def __init__(self, lr=0.01):
self.lr = lr
def step(self, weights, grad):
return weights - self.lr * grad
class AdamOptimizer:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1, self.beta2, self.eps = beta1, beta2, eps
self.m = None
self.v = None
self.t = 0
def step(self, weights, grad):
if self.m is None:
self.m = np.zeros_like(weights)
self.v = np.zeros_like(weights)
self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
return weights - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
class Trainer:
def __init__(self, optimizer):
self.optimizer = optimizer
self.history = []
def fit(self, X, y, epochs=100):
weights = np.random.randn(X.shape[1])
for _ in range(epochs):
preds = X @ weights
loss = np.mean((preds - y) ** 2)
grad = 2 * X.T @ (preds - y) / len(y)
weights = self.optimizer.step(weights, grad)
self.history.append(loss)
self.weights = weights
return self
Now swapping optimizers is a one-line change.
trainer_sgd = Trainer(SGDOptimizer(lr=0.01))
trainer_adam = Trainer(AdamOptimizer(lr=0.001))
No new subclass. No copy-pasted code. The trainer doesn't know or care which optimizer it's using — it calls self.optimizer.step() and trusts the result. This is the "has-a" relationship: the trainer has an optimizer. It isn't an optimizer.
Sklearn's Pipeline is a masterclass in this pattern. Pipeline([('scaler', StandardScaler()), ('clf', SVC())]) composes a scaler and a classifier into one unit. It doesn't inherit from either. It has them. That's why you can swap SVC() for RandomForestClassifier() without changing the pipeline class itself.
Here's the rule of thumb I use: inherit when a framework gives you a base class with real functionality behind it. Compose when you're building your own system and need flexibility. In four years of production ML work, the ratio has been roughly 20% inheritance, 80% composition.
Rest Stop
Congratulations on making it this far. You can stop here if you want.
At this point, you have a mental model that covers the foundation: classes as sealed envelopes for state, interfaces as contracts, inheritance for extending frameworks, and composition for building your own systems. That's genuinely enough for most day-to-day ML work. If someone asks you in an interview why you'd use composition over inheritance, you can point to our minitrainer example and explain the combinatorial explosion that inheritance creates.
The short version of what's left: design patterns are the recurring shapes you'll see when reading framework source code; dunder methods are the hooks Python gives you to make objects behave like built-in types; dataclasses and protocols are the modern tools that reduce boilerplate; and anti-patterns are the traps that turn codebases into unmaintainable messes. There — you're 60% of the way through.
But if the discomfort of not knowing what's underneath is nagging at you, read on.
The Patterns You'll Actually See in Production
I used to think design patterns were an enterprise Java thing — abstract factories and visitor patterns and twenty classes to print "hello world." Then I started reading PyTorch and HuggingFace source code, and I realized I'd been using these patterns all along. I was living in the design patterns house without knowing the names of the rooms.
Template Method
This is the pattern behind every framework that says "implement this one method and we'll handle the rest." PyTorch's nn.Module is the clearest example. The framework defines __call__, which runs pre-forward hooks, calls your forward(), runs post-forward hooks, and manages autograd. You only write forward(). The template — the skeleton of the algorithm — is fixed. You fill in the variable part.
Sklearn does the same thing. TransformerMixin.fit_transform() calls your fit() then your transform(). You implement the pieces; the framework orchestrates them.
The pattern's value is this: the framework author can guarantee that hooks always run, gradients are always tracked, and fitted state is always verified — regardless of what you put in forward() or fit(). You can't accidentally skip the safety rails.
Strategy
We already built this one in the composition section. When our Trainer accepts an optimizer object and calls self.optimizer.step(), the optimizer is the strategy. Different strategies (SGD, Adam, RMSprop) are interchangeable because they all have the same interface. The trainer doesn't know which one it's running. It doesn't need to.
PyTorch uses this everywhere. Loss functions, activation functions, learning rate schedulers — they're all strategies. You pass them in, the training loop calls them, and the specific behavior is determined by which object you injected.
Observer (Hooks)
The observer pattern is how a system broadcasts events to interested listeners without knowing who they are. In PyTorch, this shows up as hooks — functions you register on a module that fire automatically during forward or backward passes.
def log_activations(module, input, output):
print(f"{module.__class__.__name__}: "
f"output shape {output.shape}, "
f"mean {output.mean():.4f}")
model = TinyNet(784, 256, 10)
hook = model.net[0].register_forward_hook(log_activations)
# now every forward pass through the first Linear layer
# triggers our logging function
# hook.remove() when done
Gradient clipping, activation logging, feature extraction from intermediate layers — all of these work through hooks. The model itself doesn't know it's being observed. This is what makes it possible to add monitoring, debugging, and profiling to any model without modifying its code.
Registry / Factory
This is the pattern that HuggingFace built its empire on. When you call AutoModel.from_pretrained("bert-base-uncased"), here's what happens under the hood. The system loads a config file. It reads the model_type field — in this case, "bert". It looks up "bert" in a dictionary that maps type strings to classes: {"bert": BertModel, "gpt2": GPT2Model, "t5": T5Model, ...}. It instantiates the correct class and loads the weights.
That dictionary is the registry. The from_pretrained() classmethod is the factory. Together, they let you load any of hundreds of model architectures through a single uniform interface. The same pattern powers AutoTokenizer, AutoConfig, and AutoModelForSequenceClassification. When you see "Auto" in HuggingFace, think "registry that maps config strings to concrete classes."
OpenAI's Python SDK uses the same idea. Every API resource — completions, files, models — is a class inheriting from a base APIResource. ChatCompletion.create() is a classmethod (a factory) that handles HTTP, auth, serialization, and error handling behind a clean interface. You call one method; an entire orchestra of networking code plays behind the curtain.
The limitation of all these patterns is that they have to be recognized to be useful. If you don't know the template method pattern exists, PyTorch's __call__-wraps-forward() trick looks like arbitrary framework magic. Once you see the pattern, the magic dissolves into a sensible design decision. That's the whole point of naming patterns — not to create jargon, but to make the architecture legible.
The Dunder Methods That Earn Their Keep
Python has over a hundred dunder (double-underscore) methods. Most of them, you'll never touch. But there's a handful that shows up in ML code constantly, and understanding them is the difference between reading framework source code fluently and staring at it in confusion.
__call__ is the one that makes an instance callable like a function. When you write model(x), Python translates that to model.__call__(x). PyTorch's nn.Module defines __call__ to run hooks and autograd setup before calling your forward(). This is why model(x) and model.forward(x) are not interchangeable — forward() skips the entire hook and gradient machinery. I've seen this bug in production code more than once. It trains, it runs, and the gradients are silently wrong.
__len__ and __getitem__ are the dataset pattern. PyTorch's DataLoader needs __len__ to know how many samples exist and __getitem__ to fetch one by index. Implement these two methods on any class and it works with shuffling, batching, and multi-worker loading — all from two method definitions.
class TextDataset:
def __init__(self, texts, labels):
self.texts = texts
self.labels = labels
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
return self.texts[idx], self.labels[idx]
# DataLoader(TextDataset(texts, labels), batch_size=32, shuffle=True)
# Batching, shuffling, multiprocess prefetch — all from two methods.
__enter__ and __exit__ make context managers. You use them every time you write with torch.no_grad(): or with open(path) as f:. The pattern: __enter__ sets up a resource or state change, __exit__ tears it down — even if an exception fires in between. For anything that needs guaranteed cleanup — timers, temporary gradient disabling, database connections — this is the mechanism.
import time
class Timer:
def __enter__(self):
self.start = time.perf_counter()
return self
def __exit__(self, *exc):
self.elapsed = time.perf_counter() - self.start
print(f"{self.elapsed:.3f}s")
return False # don't suppress exceptions
# with Timer():
# model.fit(X, y)
# prints elapsed time even if fit() crashes
__repr__ is for debugging. It controls what you see when you type an object's name in the REPL or print it. Sklearn auto-generates these from get_params(), which is why printing a model shows RandomForestClassifier(n_estimators=100, ...) instead of <sklearn.ensemble._forest.RandomForestClassifier object at 0x...>. If you define only one string method on your class, make it __repr__.
That's the core set. There are others — __eq__ and __hash__ for making objects work as dict keys, __iter__ for making objects iterable — but you can look those up when you need them. The ones above are the ones that show up in framework source code daily.
The Modern Toolkit — Dataclasses and Protocols
If you've ever written a config class that's nothing but __init__ storing a dozen parameters, you know the boilerplate pain. @dataclass kills that dead.
from dataclasses import dataclass, field
@dataclass
class TrainConfig:
lr: float = 0.001
epochs: int = 100
batch_size: int = 32
tags: list = field(default_factory=list)
cfg = TrainConfig(lr=0.01)
print(cfg) # TrainConfig(lr=0.01, epochs=100, batch_size=32, tags=[])
cfg2 = TrainConfig(lr=0.01)
cfg == cfg2 # True — __eq__ compares all fields automatically
@dataclass auto-generates __init__, __repr__, and __eq__ from your field annotations. Add frozen=True and you get an immutable config object — attempts to modify fields raise FrozenInstanceError, and __hash__ is generated automatically, so frozen dataclasses work as dict keys and set members out of the box.
The field(default_factory=list) syntax solves the mutable default problem we talked about earlier. Instead of sharing one list across all instances, it creates a fresh list for each one. This is the dataclass equivalent of the items=None; self.items = items or [] pattern, but cleaner.
HuggingFace uses dataclasses extensively for training arguments. TrainingArguments is a dataclass with dozens of fields — learning rate, batch size, logging steps, gradient accumulation — and it all works with serialization, comparison, and printing for free.
Protocols, which we covered earlier, fit into this modern toolkit as the typing-era answer to duck typing. Where Python traditionally said "if it has a fit method, it's a model" and hoped for the best at runtime, Protocol lets you express that contract in code and have mypy verify it statically. No inheritance required. No runtime overhead. The combination of dataclasses for data-holding classes and Protocols for behavioral contracts covers most of what you need to write clean, modern Python for ML.
The Anti-Patterns That Kill Codebases
I've spent more time debugging bad OOP than writing good OOP. These are the patterns I've seen destroy codebases — and the ones that senior interviewers love to probe.
The God Class
A God class does everything: loads data, defines the model, runs training, computes metrics, saves checkpoints, sends Slack notifications. It starts as a convenient 200-line class and grows to 2,000 lines because everything is interconnected and splitting it feels too risky. I've written God classes. Everyone has. The fix is composition: extract a DataLoader, a Model, a Trainer, and a MetricsLogger as separate objects, then wire them together in a thin orchestrator.
Deep Inheritance Hierarchies
Three levels of inheritance is a warning sign. Four is a code smell. Five means someone confused the class hierarchy for an organizational chart. The problem isn't inheritance itself — it's that behavior is scattered across multiple files and multiple levels, and tracing which method actually runs requires walking up and down the tree. I've seen codebases where CustomResNet inherits from ResNetBase inherits from CNNModel inherits from BaseModel inherits from nn.Module, and nobody can tell you which forward() is actually called without checking four files.
The fix: flatten to one or two levels of inheritance (usually into a framework base class), and use composition for the rest.
Global Mutable State
Singleton config objects, module-level dictionaries that get mutated during training, global random seeds set in one file and depended on in another. Global state is the spaghetti script wearing a class costume. Every piece of shared mutable state is a potential race condition, a test isolation failure, and a debugging nightmare. Pass dependencies explicitly. If a function needs a config, it should take it as an argument, not reach into a global namespace.
Premature Abstraction
This is the subtlest anti-pattern, and the one I still struggle with. You build one model, and you immediately abstract it into a BaseModel with three abstract methods and a factory pattern — before you know whether you'll ever build a second model. The abstraction adds complexity with no payoff. The rule I try to follow: wait until you have two or three concrete implementations before extracting a shared interface. Let the pattern emerge from the code, rather than imposing it upfront.
No one fully agrees on the right level of abstraction, by the way. I've seen teams with too little structure (everything is functions) and teams with too much (seven layers of interfaces before any real logic). Finding the middle ground is a judgment call that gets better with experience but never becomes automatic.
Wrapping Up
If you're still with me, thank you. I hope it was worth it.
We started with a spaghetti script full of global variables, felt the pain of shared mutable state, and discovered that classes are sealed envelopes — a way to keep state and behavior together. We built contracts with ABCs and Protocols, learned that inheritance is for extending framework kitchens while composition is for building our own systems, and recognized the design patterns hiding in plain sight inside PyTorch and HuggingFace. We looked at the dunder methods that make Python objects behave like built-in types, the modern tools that eliminate boilerplate, and the anti-patterns that turn codebases into unmaintainable tangles.
My hope is that the next time you open PyTorch's nn.Module source or HuggingFace's AutoModel registry, instead of feeling like you're reading someone else's inscrutable magic, you'll recognize the template methods, the strategies, the hooks, and the factories — having a pretty good mental model of what's going on under the hood. And the next time you're designing a system that ten people will work on, you'll reach for composition first, inherit when the framework asks you to, and resist the urge to build a five-level class hierarchy that felt clean at the time.
Resources
PyTorch nn.Module source code — Reading the actual __call__, __setattr__, and parameter registration logic is worth more than any tutorial. The template method pattern is right there in the code. GitHub link.
Scikit-learn developer guide on estimators — The official guide on writing sklearn-compatible estimators. If you're building custom transformers or estimators for pipelines, this is the definitive reference. Docs link.
HuggingFace Transformers philosophy page — Explains why they chose the class hierarchy they did, and how AutoModel's registry works. Insightful for understanding large-scale library design decisions. Docs link.
PEP 544 — Protocols: Structural subtyping — The proposal that introduced Protocol classes to Python. If you want to understand why structural subtyping matters and how it differs from ABC-based nominal typing, this is the source. PEP link.
Refactoring Guru — Design Patterns in Python — Wildly helpful visual explanations of every major design pattern with Python code examples. Not ML-specific, but the patterns (Strategy, Observer, Template Method, Factory) map directly to what we covered. Website link.
Google's Rules of Machine Learning — Not about OOP directly, but about the engineering discipline that keeps ML systems maintainable. Rule #1 is "Don't be afraid to launch a machine learning system without machine learning" — that kind of practical wisdom. Guide link.