Nice to Know

Chapter 9: CNNs & Computer Vision 9 topics

I avoided the "other" areas of computer vision for an embarrassingly long time. I knew classification. I knew detection. I could talk about ResNets and YOLO at dinner parties (to the extent anyone wants to hear about YOLO at dinner parties). But every time someone mentioned optical flow, or pose estimation, or NeRFs, I'd nod thoughtfully and change the subject. The discomfort of knowing that computer vision is a vast landscape — and I'd been camping in one corner of it — finally became unbearable. Here is what I found when I explored the rest of the map.

This section covers the topics that didn't fit neatly into CNN architectures, detection, or transfer learning — but that show up constantly in papers, job descriptions, and production systems. Video understanding, tracking, pose estimation, 3D reconstruction, visual search, super-resolution, and a couple of architectural building blocks that quietly power many of the above.

Before we start: you don't need deep familiarity with any of these to understand what follows. If you know what a convolutional neural network does and have a rough sense of how object detection works, you have everything you need. We'll build each concept from the ground up.

This isn't a short journey — there are nine stops — but each one is self-contained. You can read them in order or jump to whatever caught your eye in the table of contents.

Contents
  • Optical Flow
  • Multi-Object Tracking
  • Pose Estimation
  • Video Understanding
  • Rest Stop
  • Visual Search & Retrieval
  • Image Super-Resolution
  • Neural Radiance Fields & Gaussian Splatting
  • Deformable Convolutions
  • Squeeze-and-Excitation Blocks

To keep things concrete, we'll use a running scenario throughout: imagine you're building a wildlife monitoring system for a forest reserve. Cameras are mounted on trees, recording video of animals 24/7. Your job is to build the computer vision pipeline that turns this raw footage into useful intelligence — what species are present, how many individuals, what they're doing, where they go. Every topic in this section solves a real piece of that puzzle.

Optical Flow

Our wildlife cameras record video, and the first question we need to answer is deceptively fundamental: what moved? Not what objects are in the frame — we'll get to that — but which pixels shifted between one frame and the next, and where did they go.

That's optical flow. Given two consecutive frames, it produces a dense motion map — a vector (dx, dy) for every single pixel, describing where that pixel moved. If nothing moved, all vectors are zero. If a deer walked left across the frame, the pixels on the deer all have vectors pointing left. If the camera jiggled slightly, everything has a small uniform shift.

I'll be honest — I used to think optical flow was a relic from classical computer vision that deep learning had made obsolete. I was wrong. Optical flow is alive and thriving, precisely because it captures something that single-frame analysis cannot: motion itself, separated from object identity. A detection model can tell you there's a deer in frame 12. Optical flow can tell you that something in that region moved 4 pixels to the left between frames 11 and 12, without needing to know what it is.

The classical approaches — Lucas-Kanade (1981) and Farneback (2003) — work by assuming that nearby pixels move together and solving for the motion that best explains brightness changes between frames. They work in simple settings but struggle with large motions, occlusion, and textureless regions.

The modern standard is RAFT (Recurrent All-Pairs Field Transforms, Teed & Deng, 2020). Here's how it works, at a level you could explain in an interview. Both frames pass through a shared CNN to produce feature maps. RAFT then builds what's called an all-pairs correlation volume — for every pixel in frame 1, it computes the similarity with every pixel in frame 2. This creates a massive 4D tensor. If the images are 128×128, the volume is 128×128×128×128. It's the brute-force answer to "where might each pixel have gone?"

But RAFT doesn't stop at a single lookup. It maintains a flow estimate and iteratively refines it through a recurrent unit (a ConvGRU). Each iteration looks up the current estimate in the correlation volume, considers the local context, and nudges the flow toward a better answer. Think of it like focusing a camera — each iteration brings the picture slightly more into focus. After 12–32 iterations, the flow field converges.

The key insight that makes RAFT so good: rather than trying to predict the perfect flow in one shot, it gives itself permission to be wrong and correct iteratively. That iterative refinement is what lets it handle large motions that would throw off single-pass methods.

For our wildlife system, optical flow is the foundation layer. It tells us "something moved over there" before we even run a detector. It feeds into video stabilization (compensating for camera shake), frame interpolation (creating smooth slow-motion footage from low-framerate cameras), and — as we'll see next — tracking.

The limitation of optical flow is that it knows nothing about objects. It's pixel-level bookkeeping. Two deer standing side by side with overlapping motion vectors? Optical flow has no idea there are two animals. For that, we need a different tool entirely.

Multi-Object Tracking

Our wildlife cameras have detected animals in individual frames using an object detector like YOLO. Frame 14 has three bounding boxes: deer, deer, fox. Frame 15 also has three: deer, deer, fox. But which deer in frame 15 is which deer from frame 14? Did the one on the left stay on the left, or did they swap positions?

This is the multi-object tracking (MOT) problem. Detection answers "where are things?" in a single frame. Tracking answers "which thing is which?" across frames. The difference matters enormously. Detection gives you a headcount per frame. Tracking gives you individual animal trajectories — deer #7 walked from the waterhole to the tree line over 40 seconds.

I still find it conceptually satisfying how the dominant approach, called tracking-by-detection, decomposes this into two clean steps. First, run a detector independently on every frame. Second, link the detections across frames into consistent identities. The elegance is in that decoupling — you can upgrade your detector or your linker independently.

The linking part is where things get interesting. Imagine you're a customs officer at a border crossing. Cars come through every minute (that's your "detection per frame"). Your job is to recognize which cars you've seen before and which are new. You could use the car's position — if a blue car was at gate 3 last minute and there's a blue car at gate 4 now, it probably drove one gate over. Or you could photograph each car and compare license plates for a definitive match.

SORT (Simple Online and Realtime Tracking, Bewley et al., 2016) takes the position-only approach. For each tracked object, SORT maintains a Kalman filter — a mathematical model that predicts where the object will be in the next frame based on its recent velocity. When new detections arrive, SORT uses the Hungarian algorithm to find the optimal matching between predicted positions and actual detections, minimizing the total distance. Unmatched predictions become "lost tracks." Unmatched detections become new tracks.

SORT is fast and works remarkably well when objects move predictably. Where it falls apart is occlusion — if a deer walks behind a tree for three frames, its track dies. When the deer reappears, SORT creates a new identity. Deer #7 becomes deer #12 for no good reason.

DeepSORT (Wojke et al., 2017) fixes this by adding "license plate matching" — a visual appearance model. Each detected bounding box gets passed through a small CNN that produces an appearance embedding, a compact vector that captures what the object looks like. Now, when matching detections to tracks, DeepSORT considers both where the object should be (Kalman prediction) and what it looks like (embedding similarity). That deer reappearing from behind the tree? Its appearance embedding still matches deer #7's stored embedding, so the identity persists through the occlusion.

The tradeoff is speed — running a CNN for every detection in every frame adds real latency, especially in crowded scenes.

ByteTrack (Zhang et al., 2022) takes a cleverer approach. Most trackers only consider high-confidence detections — bounding boxes where the detector is quite sure something is there. ByteTrack's insight is that the low-confidence detections, the ones other trackers throw away, are gold mines for tracking occluded objects. A partially hidden deer might produce a detection with 30% confidence — too low for most trackers, but ByteTrack matches it anyway. It runs association in two rounds: first match tracks with high-confidence detections, then match remaining tracks with the low-confidence ones. No appearance CNN required, and it achieves state-of-the-art results.

I find it humbling that ByteTrack's key innovation — "don't throw away uncertain detections" — is something any human tracker would do instinctively. We don't ignore a blurry glimpse of something; we use it to maintain our mental model. It took the field five years after DeepSORT to formalize that intuition.

For our wildlife system, tracking turns a stream of anonymous detections into named trajectories. Deer #7 visited the waterhole at dawn, spent 12 minutes, and left heading northeast. That's conservation data. That's what researchers need.

Tracking gives us where each animal goes. But what are they doing? For that, we need to look at their bodies more carefully.

Pose Estimation

Knowing that deer #7 is at coordinates (340, 220) in the frame is useful. Knowing that deer #7 has its head down, front legs bent, and is in a drinking posture — that's understanding behavior. Pose estimation detects specific body keypoints — joints like elbows, knees, wrists, ankles, nose, and ears — and connects them to form a skeleton.

There are two fundamentally different strategies for finding these keypoints, and the distinction matters in practice.

The top-down approach works like a spotlight. First, detect each individual with a bounding box (using any detector — YOLO, Faster R-CNN). Then, for each box, crop that region and run a specialized keypoint network on the crop. Each animal gets its own close-up analysis. HRNet (High-Resolution Network, Sun et al., 2019) is the gold standard here. Unlike most networks that downsample aggressively and then try to recover spatial detail, HRNet maintains high-resolution feature maps throughout the entire forward pass. This makes it exceptional at precise keypoint localization — it never loses the fine spatial information that tells you exactly where a joint is.

The bottom-up approach works like a floodlight. Instead of finding individuals first, it detects all keypoints in the entire image simultaneously — every knee, every elbow, every nose, from every animal, all at once. Then it groups them: "This knee belongs with that hip and that shoulder — they form one skeleton." OpenPose (Cao et al., 2017) pioneered this using Part Affinity Fields (PAFs) — vector fields that encode the direction and strength of connection between adjacent body parts. If a knee and a hip have a strong PAF connecting them, they belong to the same individual.

The spotlight versus floodlight tradeoff is real. Top-down methods (HRNet) are more accurate per individual but get slower as the scene gets more crowded — each person requires a separate forward pass. Bottom-up methods (OpenPose) run once regardless of crowd size but can struggle with the grouping step when bodies overlap closely. For our wildlife cameras, where we might have 2–5 animals in frame, top-down works well. For a stadium crowd, bottom-up would be the better choice.

Then there's MediaPipe (Google, 2019), which deserves special mention not because it's the most accurate but because it runs in real-time on a phone. MediaPipe is a top-down pipeline optimized ruthlessly for speed — it uses a lightweight detector followed by a small regression network instead of heatmaps. It trades some precision for deployment practicality, and for prototyping or mobile applications, that trade is usually worth it.

For our wildlife system, pose estimation transforms tracking data ("deer #7 moved here") into behavioral data ("deer #7 is drinking," "deer #7 is running"). Conservation biologists can then ask questions like "how much time do deer spend at the waterhole?" or "do animals show stress behaviors near hiking trails?" without watching thousands of hours of footage.

Pose estimation gives us a snapshot of posture in a single frame. But animal behavior unfolds over time — a drinking motion is a sequence of poses, not a single pose. To understand actions that span multiple frames, we need models that can reason about video.

Video Understanding

Everything we've discussed so far processes one or two frames at a time. Optical flow looks at pairs. Detection and pose estimation look at single frames. But understanding what's happening in a video — recognizing that an animal is grazing versus running versus fighting — requires looking at many frames together. The question becomes: how do you feed a temporal sequence of images into a model?

The most intuitive approach is to extend our familiar 2D convolutions into the time dimension. A standard 2D kernel slides across height and width. A 3D convolution adds a third axis — it slides across height, width, and time (frames). Where a 2D kernel might be 3×3 pixels, a 3D kernel might be 3×3×3 — three pixels in each spatial direction and three frames deep. This lets the kernel detect patterns that exist in both space and time simultaneously, like a paw moving downward across three consecutive frames.

C3D (Tran et al., 2015) was the first to demonstrate that straightforward 3D ConvNets, stacked into a VGG-like architecture, could learn powerful spatiotemporal features for action recognition. I3D (Carreira & Zisserman, 2017) improved on this by taking existing 2D architectures pretrained on ImageNet and "inflating" them to 3D — literally copying the 2D kernel weights along the time axis to bootstrap the 3D network. This leveraged years of work on image classification to get a running start on video.

The limitation of 3D convolutions is the same as 2D convolutions, only worse: they're local. A 3×3×3 kernel can only see 3 frames of context. To capture a 5-second action at 30fps, you'd need 150 frames of context, which requires either very deep networks or very large kernels. This is the same frustration that motivated attention mechanisms in NLP — and the same solution appeared in video.

Video Transformers like TimeSformer (Bertasius et al., 2021) and ViViT (Arnab et al., 2021) replace 3D convolutions with self-attention over space and time. They divide video into patches (like Vision Transformers do with images), but now each patch has a spatial position and a temporal position. The attention mechanism can then directly relate a patch in frame 1 to a distant patch in frame 90, without the signal needing to pass through 89 intermediate layers. TimeSformer factors this attention — first attend across space within each frame, then attend across time — which makes the computation tractable.

More recently, VideoMAE (Tong et al., 2022) brought self-supervised pretraining to video. It masks out 90% of the video patches (much more aggressive than image masking) and trains the model to reconstruct them. The high masking ratio works because video has enormous temporal redundancy — neighboring frames are nearly identical. This lets VideoMAE learn strong video representations from unlabeled footage, which is particularly relevant for our wildlife system where we have terabytes of raw video but limited labeled data.

The two-stream approach is worth knowing as a historical landmark. Two-stream networks (Simonyan & Zisserman, 2014) process RGB frames through one CNN and precomputed optical flow maps through a second CNN, then fuse the outputs. The appearance stream knows what things look like; the motion stream knows how they move. This elegant decomposition dominated video understanding for several years before 3D convolutions and transformers subsumed both functions into a single model.

For our wildlife system, video understanding is what turns "there's a deer in the frame" into "deer #7 is engaged in territorial display behavior." That semantic leap — from pixels to animal behavior — is what makes the system genuinely useful to researchers.

🛋️ Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model of how a vision system handles the temporal dimension — tracking motion with optical flow, maintaining identities with tracking, understanding posture with pose estimation, and recognizing actions with video models. That covers a large chunk of what shows up in interview questions about "beyond detection and classification."

It doesn't cover the full picture, though. We still have visual search (how to find similar images at scale), super-resolution (what to do when your camera resolution is poor), 3D scene reconstruction (turning photos into navigable 3D worlds), and a couple of architectural tricks that power many modern systems.

If you're feeling good with the temporal tools, the short version of what's ahead: visual search uses learned embeddings plus vector databases, super-resolution invents plausible detail with GANs, NeRFs and Gaussian Splatting reconstruct 3D from 2D photos, and deformable convolutions and SE blocks are plug-in upgrades for any CNN. There. You're 70% of the way there.

But if the discomfort of skipping those topics is nagging at you, read on.

Visual Search & Retrieval

Our wildlife system has accumulated millions of animal detections over months of operation. A researcher photographs an animal at a different site and asks: "Have you seen this individual before?" We need to search through millions of stored images and find the ones that look most similar. This is visual search, and the engine that powers it is an embedding space.

The core idea is this: take any image, pass it through a CNN or Vision Transformer backbone, and extract the feature vector from one of the final layers. This vector — typically 128 to 2048 numbers — is the image's embedding. Think of it as a fingerprint for the image. Two photos of the same deer from different angles will produce similar fingerprints. A deer and a fox will produce very different ones.

The raw features from an ImageNet-pretrained backbone are decent embeddings, but they weren't trained to optimize for similarity. They were trained to classify. To get a truly good embedding space — where similar images cluster tightly and dissimilar ones are pushed far apart — you need metric learning, which means training with a loss function designed specifically for distance relationships.

Contrastive loss works with pairs. Take two images. If they're similar (same species, same individual), the loss penalizes large distances between their embeddings. If they're dissimilar, it penalizes small distances, but only up to a margin — once they're far enough apart, the loss stops caring. You're sculpting the embedding space: pulling similar things together, pushing different things apart, but not infinitely far apart.

Triplet loss works with triplets: an anchor image, a positive (similar to anchor), and a negative (different from anchor). The loss says: the distance from anchor to positive should be smaller than the distance from anchor to negative, by at least some margin. The formula is: L = max(d(anchor, positive) - d(anchor, negative) + margin, 0). When the positive is already closer than the negative by more than the margin, the loss is zero — no gradient, nothing to fix. The hard part in practice is mining good triplets. Random triplets are mostly "too easy" — the negative is already obviously far away, and the model learns nothing. Hard negative mining — deliberately selecting negatives that are close to the anchor — is what makes triplet loss training work.

ArcFace (Deng et al., 2019) takes a different, more elegant approach that dominates face recognition and is spreading to other domains. Instead of working with pairs or triplets, it modifies the classification loss itself. It normalizes both the feature embeddings and the classification weights to unit length, so their dot product becomes the cosine of the angle between them. Then it adds a fixed angular margin to the correct class's angle before computing softmax. This forces the model to learn embeddings where same-class samples are separated by a wider angular gap from decision boundaries. In the embedding space, classes form tighter, more compact clusters.

I'm still developing my intuition for why angular margin works so much better than Euclidean margins in practice. The best explanation I've found is that in high-dimensional spaces, cosine similarity is a more stable measure than Euclidean distance — distances tend to concentrate in high dimensions, making absolute distances less meaningful, while angles remain discriminative.

Once you have good embeddings, retrieval is a nearest-neighbor search. Store all your embeddings in a vector database — FAISS (Facebook's library for fast approximate nearest neighbor search), Milvus, or Pinecone for managed cloud solutions — and query by cosine or L2 similarity. For our wildlife system, this means the researcher's photo gets embedded, and we find the 10 most similar embeddings in the database, which correspond to the 10 most visually similar animal sightings. If the same individual has been photographed before, it should appear near the top.

This is the same machinery behind reverse image search, "find similar products" on e-commerce sites, and face verification in security systems. The backbone, the loss function, and the vector database — that's the whole pipeline.

The limitation is that embedding quality is everything. A mediocre embedding space makes the entire system unreliable. And training good embeddings requires carefully curated pairs or triplets, which can be expensive to label.

Image Super-Resolution

Some of our wildlife cameras are cheap. They capture grainy, low-resolution images, especially at night when the infrared sensor kicks in. Researchers want to zoom in on a distant animal and see enough detail to identify the species, maybe even the individual. Image super-resolution trains a model to upscale a low-resolution image to a higher resolution version with plausible fine detail.

The standard approach is GAN-based. ESRGAN (Enhanced Super-Resolution GAN, Wang et al., 2018) uses a generator network to upscale the image and a discriminator network that judges whether the output looks like a real high-resolution photo. The generator learns to add texture, sharpen edges, and synthesize detail that fools the discriminator. Real-ESRGAN (Wang et al., 2021) extends this to handle real-world degradation — not the clean downsampling of academic benchmarks, but the messy blur, noise, and compression artifacts of actual cameras.

The results can be stunning. Blurry text becomes readable. Faces gain recognizable features. Fur texture appears where there was only a smudge. But here's where honesty matters: the model is hallucinating that detail. It is generating plausible-looking pixels based on patterns it learned during training. It is not recovering information that was lost during image capture. If a pixel was genuinely blurred beyond recognition, no neural network can determine what the original value was — it can only guess what would look realistic.

This distinction is critical in applications where accuracy matters. If you're enhancing a wildlife photo for a blog post, hallucinated fur texture is fine. If you're trying to identify an individual animal by its unique spot pattern, hallucinated spots would be actively misleading. In medical imaging, hallucinated detail could lead to false diagnoses. In forensic imaging, it could be fabricating evidence.

I've seen too many demos where super-resolution is presented as "enhancing" an image, as if the model is recovering hidden ground truth. It's not. It's making an educated guess. A very good educated guess, often indistinguishable from reality — but a guess nonetheless. Knowing this doesn't diminish the technology's usefulness; it calibrates your trust in its outputs.

For our wildlife system, super-resolution is genuinely helpful for making distant animal captures more visually interpretable. The researcher gets a cleaner image to look at. But any species identification or individual re-identification should be done on the raw (or minimally processed) images, not the hallucinated ones.

Neural Radiance Fields & Gaussian Splatting

Here's a different kind of challenge for our wildlife system: the research team wants to map the 3D structure of the forest habitat. They've walked through the area taking photos from different angles. Can we turn those 2D photos into a navigable 3D model?

This is the problem that Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020) solved with startling quality. A NeRF represents a 3D scene not as a mesh or a point cloud, but as a neural network — a small MLP that takes a 5D input and produces a 2D output. The input is a position in space (x, y, z) plus a viewing direction (θ, φ). The output is a color (r, g, b) and a volume density (σ) — how opaque the point is. Solid surfaces have high density. Empty air has zero density. Semi-transparent things like smoke or glass have intermediate values.

To render a pixel in a new view, you shoot a ray from the camera through that pixel into the scene, sample dozens of points along the ray, query the neural network for each point's color and density, and composite them front-to-back using differentiable volume rendering. Dense points block the ray; the first dense thing the ray hits dominates the pixel color. Because the entire rendering process is differentiable, you can train the network by backpropagating through the rendering — comparing rendered images to real photos and adjusting the network to reduce the difference.

Give a NeRF 50–100 photos with known camera positions, and it learns to render photorealistic novel views from any angle. The results were jaw-dropping when the paper came out. I'm still developing my intuition for why a small MLP can encode such rich 3D detail — the explanation involves positional encoding (mapping coordinates through sinusoidal functions to help the network learn high-frequency details) and the implicit smoothness of neural networks, but I won't pretend to fully grok it yet.

The catch is speed. Rendering a single pixel requires hundreds of network evaluations along the ray. Rendering a full image takes seconds. That's fine for offline applications but useless for real-time exploration.

3D Gaussian Splatting (Kerbl et al., 2023) solves the speed problem by abandoning the neural network representation entirely. Instead of an implicit function, it represents the scene as millions of tiny, fuzzy 3D ellipsoids — 3D Gaussians. Each Gaussian has a position, shape (orientation and extent), color, and opacity. To render a view, you project all the Gaussians onto the camera plane (this is the "splatting") and alpha-composite them. No neural network evaluation per ray. No volume sampling. The result: real-time rendering at 60+ frames per second.

The quality in 2024 rivals and sometimes surpasses NeRFs, while being orders of magnitude faster to render. Training is also faster — instead of gradient descent through a neural network, you're optimizing the parameters of individual Gaussians to match the training views. The tradeoff is storage: millions of Gaussians take more memory than a compact neural network.

For our wildlife system, 3D reconstruction lets researchers build navigable models of animal habitats, measure terrain features, or plan camera placements from their desks. If you're starting a new 3D reconstruction project today, Gaussian Splatting is where you'd begin.

Deformable Convolutions

Let's shift from applications to architectural building blocks — components you can plug into existing networks to make them better.

A standard convolution samples pixels on a fixed rectangular grid. A 3×3 kernel always samples the 9 pixels at the same relative positions: top-left, top-center, top-right, and so on. This rigidity is both the strength and the weakness of convolutions. The strength: it's efficient and translationequivariant. The weakness: not everything in nature is rectangular.

Imagine our wildlife camera captures a snake curving through the grass. The standard 3×3 kernel sees a small square patch — it might catch part of the body, but the rectangular sampling grid doesn't follow the snake's curved shape. It would need many layers of context to "understand" the full curve.

Deformable convolutions (Dai et al., 2017) fix this by making the sampling grid learnable. For each output location, the network predicts a set of 2D offsets — one (Δx, Δy) per sampling point. The kernel's 9 sampling positions get shifted by these offsets before extracting features. The offsets are real-valued (not integer pixels), so the network uses bilinear interpolation to sample at fractional positions.

Here's the crucial detail: the offsets are predicted by an additional convolutional layer that sits alongside the main convolution. It takes the same input features and produces a 2×k×k offset map (where k is the kernel size). During training, backpropagation flows through both the convolution weights and the offset-prediction layer, so the network simultaneously learns what features to extract and where to look for them.

If you visualize the sampling positions of a deformable convolution on an image of that snake, you'd see the 9 points stretching and curving along the body, rather than sitting in a rigid square. The kernel has learned to deform itself to match the geometry of the feature it's looking at.

Deformable DETR (Zhu et al., 2021) applies this same idea to attention — instead of attending to all positions uniformly, it learns to attend to a small set of offset positions around each query. This makes the attention mechanism both more efficient (fewer positions to attend to) and more effective (it focuses on relevant locations).

In our wildlife system, deformable convolutions would help detect animals of any shape and orientation — a coiled snake, a bird with spread wings, a deer partially occluded by branches — all without needing specialized augmentation for each geometric case.

Squeeze-and-Excitation Blocks

A convolutional layer produces many channels — think of each channel as a feature detector for a different pattern. One channel might respond to horizontal edges, another to green textures, another to circular shapes. In a typical convolution, all channels are treated equally. But for any given image, some channels carry much more useful information than others. An image of a forest scene doesn't need the "brick texture" channel at full volume.

Squeeze-and-Excitation (SE) blocks (Hu et al., 2018) add a simple mechanism that lets the network learn which channels matter for each input. It's a form of channel attention.

The process has two steps, and the names are apt. First, squeeze: each channel's entire spatial feature map gets reduced to a single number via global average pooling. If you had a 7×7×512 feature map, you now have a 512-dimensional vector. Each number summarizes one channel's overall activation. This is the "what's present?" summary.

Second, excitation: that summary vector passes through a small two-layer fully connected bottleneck network — first down to, say, 512/16 = 32 neurons (with ReLU), then back up to 512 neurons (with sigmoid). The output is 512 values between 0 and 1 — one per channel. These are learned importance weights. Multiply each channel by its weight, and you've selectively amplified the channels that matter for this particular input and suppressed the ones that don't.

The bottleneck matters — it's what keeps the SE block lightweight. Going from 512 to 32 and back forces the network to learn a compressed, generalizable channel relationship, rather than memorizing arbitrary per-channel weights. The total parameter count added by an SE block is negligible compared to the layers it modifies.

SENet won the 2017 ImageNet competition, improving top-1 accuracy by about 1% across various architectures at minimal computational cost. Since then, SE blocks (or close variants) have been integrated into EfficientNet, RegNet, MobileNetV3, and many other architectures. They're the quintessential "free lunch" in neural architecture design — drop them in, get a small but consistent improvement, move on.

For our wildlife system, SE blocks would help the classification backbone focus on species-relevant features while suppressing irrelevant background textures. In a forest scene full of green foliage, the block learns to dial down the "green texture" channels and dial up the "animal shape" channels, adapting its attention to what matters for the current input.

Stepping Back

If you're still with me, thank you. That was a lot of ground to cover.

We started at the pixel level with optical flow — measuring raw motion between frames. We built up to tracking, where detections gain persistent identities. We added pose estimation to understand body posture, and video understanding to recognize behavior over time. Then we shifted to visual search with learned embeddings, super-resolution for enhancing low-quality captures, 3D reconstruction with NeRFs and Gaussian Splatting, and finally two architectural building blocks — deformable convolutions and SE blocks — that quietly improve the networks powering all of the above.

My hope is that the next time you see a job posting requiring "experience with multi-object tracking" or read a paper citing "deformable attention," instead of the reflexive nod-and-change-subject that I used to default to, you'll have a concrete mental model of what's happening under the hood — and the confidence to dig into the implementation details when the time comes.

Resources