Neural Network Basics

Chapter 7: Deep Learning Foundations Neurons · Perceptrons · Activations · Initialization
Section Overview

From a single neuron that can barely tell AND from OR, all the way to deep networks that bend space and compose abstractions — built from scratch, one piece at a time.

I avoided neural networks for longer than I'd like to admit. Every time someone mentioned "perceptrons" or "hidden layers" or "activation functions," I'd nod along and quietly change the subject. I could use PyTorch. I could stack layers. But if someone had asked me why stacking layers works — why two layers can solve problems one layer provably cannot — I'd have mumbled something about "nonlinearity" and hoped no one pressed further. Finally the discomfort of not knowing what's really going on under the hood grew too great. Here is that dive.

Neural networks are function approximators built from simple computational units. The concept traces back to McCulloch and Pitts in 1943, became trainable with Rosenblatt's Perceptron in 1958, nearly died after the XOR crisis in 1969, and was resurrected by backpropagation in 1986. Today these same building blocks power everything from spam filters to GPT. The history matters because the failures tell you more about how these things actually work than the successes do.

Before we start, a heads-up. We're going to be doing some linear algebra (dot products, matrix multiplications) and touching on gradients, but you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The Artificial Neuron
The Perceptron — A Neuron That Learns
The XOR Wall
Stacking Layers — The Multilayer Perceptron
Rest Stop
Why Nonlinearity Is Everything
ReLU and the Dying Neuron Problem
The Smooth Gates — GELU and Swish
Sigmoid, Tanh, and Softmax — The Output Activations
Weight Initialization — Xavier and He
The Universal Approximation Theorem
Wrap-Up
Resources

The Artificial Neuron

Let's start with a scenario we'll carry through the entire section. Imagine we're building a tiny spam detector. Our inputs are two numbers: the word count of an email and whether the email contains a link (1 for yes, 0 for no). Our output is a single number — how "spammy" does this email look?

An artificial neuron is the smallest possible machine that can take those two numbers and produce a verdict. It does three things. First, it multiplies each input by a weight — a number that says how much this neuron cares about that particular input. Second, it adds a bias — a constant that shifts the decision boundary up or down, the way you might raise or lower the bar for what counts as suspicious. Third, it passes the result through a nonlinear function called an activation function, which squashes the output into a useful range.

That's the entire unit. Every neural network ever built is made of these.

import numpy as np

def neuron(x, w, b):
    z = np.dot(w, x) + b
    return 1.0 / (1.0 + np.exp(-z))  # sigmoid activation

x = np.array([150, 1])    # 150-word email with a link
w = np.array([0.01, 0.9]) # cares a lot about links, a little about length
b = -1.0

print(neuron(x, w, b))    # 0.73 — looks fairly spammy

The weight on the link feature is 0.9 — this neuron is heavily suspicious of links. The weight on word count is 0.01 — it cares a little, but not much. The bias of −1.0 means the neuron starts skeptical; the inputs have to overcome that negative baseline before the output climbs above 0.5. Change the weights and bias and you get a completely different detector. The weights are what the neuron learns. The activation function is what makes the learning interesting.

I'll be honest — when I first saw this, I thought "that's a dot product and a squashing function, what's the big deal?" The big deal isn't one neuron. It's what happens when you start stacking them.

The Perceptron — A Neuron That Learns

Our spam neuron above had weights we chose by hand. That's fine for two features. It's absurd for two thousand. In 1958, Frank Rosenblatt took the McCulloch-Pitts neuron and added one critical ingredient: a learning rule. The result was the Perceptron — a single neuron that adjusts its own weights based on its mistakes.

The idea is almost embarrassingly simple. Show the perceptron an email. If it classifies correctly, do nothing. If it gets it wrong, nudge the weights toward the correct answer. The size of the nudge is controlled by a learning rate.

if prediction != true_label:
    w = w + learning_rate * true_label * x
    b = b + learning_rate * true_label

When the perceptron says "not spam" but the email is spam (+1), the update adds the input features to the weights — effectively saying "pay more attention to whatever this email looked like." When it says "spam" but the email is legitimate (−1), the update subtracts — "pay less attention to these features."

Rosenblatt proved something remarkable: if the data is linearly separable, the perceptron is guaranteed to converge to a perfect classifier in a finite number of steps. That's the Perceptron Convergence Theorem. It means this tiny learning rule isn't a heuristic — it has a mathematical guarantee. Here it is learning the AND gate:

import numpy as np

X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([-1, -1, -1, 1])   # AND: only (1,1) is positive

w, b, lr = np.zeros(2), 0.0, 0.1

for epoch in range(20):
    errors = 0
    for xi, yi in zip(X, y):
        pred = 1 if np.dot(w, xi) + b >= 0 else -1
        if pred != yi:
            w += lr * yi * xi
            b += lr * yi
            errors += 1
    if errors == 0:
        break

print(f"Converged in {epoch+1} epochs, weights: {w}, bias: {b:.1f}")

Geometrically, what the perceptron learned is a line (in 2D) or a hyperplane (in higher dimensions) that separates the two classes. Everything on one side is +1, everything on the other is −1. For AND, that line passes between the point (1,1) and the other three points. This is elegant, and it works for any linearly separable problem — AND, OR, NAND, our two-feature spam detector.

But there's a catch. A devastating one.

The XOR Wall

Plot the XOR function on a 2D grid. (0,0) maps to 0. (0,1) maps to 1. (1,0) maps to 1. (1,1) maps to 0. The positives sit on opposite corners. Try to draw a single straight line that separates them.

You can't. No line works. XOR is not linearly separable.

In 1969, Marvin Minsky and Seymour Papert published a book called Perceptrons that proved this limitation with mathematical rigor. The reaction was severe — research funding dried up, labs pivoted to symbolic AI, and neural networks entered what's now called the first AI winter. For roughly a decade, the field was considered a dead end.

I'll be honest — the XOR thing confused me for years. Not the math, but the response to it. The fix is so obvious in hindsight: use more than one layer. A single neuron draws one line. Two neurons in a hidden layer draw two lines. A third neuron on top can combine those two boundaries in a way that carves out exactly the XOR pattern. The hidden layer transforms the input space — folding it, like origami, until the classes that were tangled together become separable.

Think of a sheet of paper with two blue dots and two red dots on opposite corners. You can't cut the paper with one straight cut and separate the colors. But fold the paper in half, and now both blue dots line up on one side and both red dots on the other. One cut does it. That fold is what a hidden layer does to the input space.

The problem wasn't that nobody imagined stacking layers. It was that nobody knew how to train them. The perceptron learning rule works for one layer, but breaks for two. It took until 1986 — when Rumelhart, Hinton, and Williams published their paper on backpropagation — for training deep networks to become practical.

Stacking Layers — The Multilayer Perceptron

A Multilayer Perceptron (MLP) is what you get when you stack layers of neurons and connect every neuron in one layer to every neuron in the next. The first layer takes the raw inputs. The last layer produces the output. Everything in between is a hidden layer — hidden because we never directly observe its outputs; they're intermediate representations the network invents on its own.

Back to our spam detector. With one neuron and two features, we could draw one line through the "word count vs. has-link" space. With a hidden layer of, say, four neurons, we get four lines. Each hidden neuron detects a different pattern — maybe one fires for short emails with links, another for long emails without links, a third for emails in a specific word-count range. The output neuron then combines these four signals into a final spam score. The hidden layer has created a new representation of the email — richer than the raw features, tailored to the task.

import numpy as np

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# 2 inputs → 4 hidden → 1 output
W1 = np.random.randn(4, 2) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(1, 4) * 0.5
b2 = np.zeros(1)

x = np.array([150, 1])            # our email
h = relu(W1 @ x + b1)             # hidden layer: 4 intermediate features
out = sigmoid(W2 @ h + b2)        # output layer: spam probability
print(f"Spam score: {out[0]:.3f}")

Each layer applies a linear transformation (matrix multiply plus bias) followed by a nonlinear activation. The linear part rotates and scales the space. The nonlinear part bends it. Stack enough of these bend-and-rotate operations and you can warp the input space into nearly any shape — separating classes that no single line could touch.

That's the depth in "deep learning." A single neuron computes one weighted sum. A deep network composes hundreds of them, each building on the representations created by the layer before it. The first layer of an image network might detect edges. The second combines edges into textures. The third composes textures into object parts. The fourth recognizes the object. Each layer is an assembly line station, and what comes off the line at the end is a representation purpose-built for the task. Our origami analogy works here too — each layer is another fold, and by the time you've made enough folds, the paper can take on remarkably complex shapes.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

What you have now: a mental model that takes you from a single neuron (weighted sum + bias + activation) through the perceptron (a neuron that learns) to the MLP (layers of neurons that compose increasingly complex representations). You understand why XOR breaks a single perceptron, what a hidden layer actually does to the input space, and why depth matters. That's a genuinely useful mental model — enough to understand most architecture diagrams and have informed conversations about neural network design.

It doesn't tell the complete story, though. We haven't talked about which activation function to use and why it matters so much, how to initialize the weights so training doesn't collapse before it starts, or the remarkable theorem that says these networks can approximate any function — and why that theorem is less helpful than it sounds.

The short version: use ReLU for hidden layers, sigmoid or softmax for outputs, He initialization for ReLU networks. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Why Nonlinearity Is Everything

Here's an experiment that reveals why activation functions aren't optional decoration. Take two linear layers: y = W₂(W₁x). Expand it: y = (W₂W₁)x = Wx. Two layers collapsed into one matrix multiply. Stack a hundred linear layers and you still get a single matrix multiply. The proof is one line of algebra and the implication is devastating: without a nonlinear activation between layers, depth is an illusion.

Your 50-layer network would compute exactly the same family of functions as a single layer. All that depth, all those parameters, and you'd have an expensive way to do linear regression.

Activation functions break this collapse. By inserting a nonlinear function after each linear transformation, you make each layer's output something the next layer cannot replicate with a matrix multiply alone. Each layer can now carve a new nonlinear boundary in representation space, and the composition of these boundaries — like successive folds in our origami — can approximate arbitrarily complex functions. The activation function is what turns a stack of matrix multiplies into a universal computing machine.

ReLU and the Dying Neuron Problem

For years, everyone used sigmoid as the activation function. Networks deeper than two or three layers wouldn't train — the gradients evaporated to nothing by the time they reached the early layers. The field was stuck. Then came the Rectified Linear Unit:

ReLU(x) = max(0, x)

That's it. If the input is positive, pass it through unchanged. If it's negative, output zero. It looks too simple to matter, but three properties made it the activation function that unlocked deep learning.

First, the gradient for positive inputs is exactly 1. Not 0.25 (sigmoid's maximum), not something that decays toward zero — a flat, constant 1. Gradients flow through ReLU layers like water through an open pipe, no matter how many layers deep. This is what killed the vanishing gradient problem for feedforward networks.

Second, sparse activation. For any given input, roughly half the neurons in a layer output zero. This sounds wasteful, but it's actually a feature — it's implicit regularization. The network is forced to distribute its computation across different neurons for different inputs, making the representations more interpretable and less prone to overfitting.

Third, computational cost. ReLU is a comparison and a branch. No exponentials, no divisions. On a GPU processing millions of activations per forward pass, this adds up.

import torch
import torch.nn.functional as F

x = torch.linspace(-3, 3, 7)
print(F.relu(x))
# tensor([0., 0., 0., 0., 1., 2., 3.])  — negative inputs silenced

But ReLU has a failure mode that bit me more than once when I was starting out. If a neuron's weights drift into a region where the pre-activation is negative for every training example — which happens with bad initialization or an overly aggressive learning rate — the output is permanently zero. The gradient is also zero. The weights never update. That neuron is dead. It contributes nothing for the rest of training, and it's never coming back.

This is the dying ReLU problem, and it's not rare. With poor initialization, 20–40% of your neurons can die in the first few epochs. I've watched training runs where loss plateaued suspiciously early, and when I checked the hidden activations, a third of the neurons were flatlined at zero.

Leaky ReLU is the standard insurance policy. For negative inputs, instead of outputting zero, it outputs a small fraction of the input — typically 0.01x. Dead neurons can now recover because they still have a nonzero gradient. The variant called PReLU makes that slope learnable, and ELU uses a smooth exponential curve for negatives, but Leaky ReLU with α=0.01 is the pragmatic default.

print(F.leaky_relu(x, negative_slope=0.01))
# tensor([-0.0300, -0.0200, -0.0100, 0.0000, 1.0000, 2.0000, 3.0000])
# negative side: tiny but nonzero — a lifeline for dead neurons

ReLU's limitation — that hard cutoff at zero — is exactly what motivated the next generation of activation functions.

The Smooth Gates — GELU and Swish

ReLU makes a binary decision at zero: pass or block. There's no in-between. For convolutional networks this works brilliantly. But when transformers came along — processing subtle, distributed patterns across long sequences — researchers found that a softer decision boundary worked better.

The Gaussian Error Linear Unit (GELU) handles this with a probabilistic twist:

GELU(x) = x · Φ(x)

Here Φ(x) is the cumulative distribution function of the standard normal distribution — the probability that a random sample from a bell curve falls below x. Large positive values pass through nearly unchanged (Φ(x) ≈ 1). Large negative values get zeroed out (Φ(x) ≈ 0). But — and this is the key — small negative values get a small, nonzero output weighted by their probability of being positive. GELU doesn't slam the gate shut at zero. It eases it closed, smoothly.

My favorite thing about GELU is the intuition: each input is gated by how likely it is to be "on" under a Gaussian distribution. It's as if every neuron asks "how confident am I that this signal matters?" and scales accordingly. This smooth gating plays well with the high-dimensional loss landscapes that transformers navigate, and with layer normalization that keeps activations roughly Gaussian.

x = torch.linspace(-3, 3, 7)
print(F.gelu(x))
# tensor([-0.0036, -0.0454, -0.1587, 0.0000, 0.8413, 1.9546, 2.9964])
# small negatives produce small negative outputs — the smooth gate in action

GELU is the default in BERT, GPT, and Vision Transformers. Frameworks provide an efficient approximation: nn.GELU(). Use it when building transformer architectures.

SiLU (also called Swish) takes a simpler path to the same destination: SiLU(x) = x · σ(x). The input gates itself through its own sigmoid. The shape is strikingly similar to GELU — the two functions differ by less than 0.1 across most of their range. Discovered via automated neural architecture search at Google Brain in 2017, Swish shows up in EfficientNet and some diffusion model backbones.

The practical difference between GELU and SiLU? Negligible in most experiments. Use whichever your architecture's reference implementation specifies. If you're designing from scratch, GELU has more empirical backing for transformers, SiLU for convolutional nets. But I'm still developing my intuition for when the difference actually matters — and I suspect in most cases, it doesn't.

Sigmoid, Tanh, and Softmax — The Output Activations

Some activation functions don't belong in hidden layers anymore, but they're far from dead. They've found permanent homes in output layers and inside specific architectures.

The sigmoid function, σ(x) = 1/(1+e⁻ˣ), squashes any real number into the range (0, 1). For hidden layers, this is a disaster — the maximum gradient is 0.25 (at x=0), and it drops to near-zero for |x| > 4. Chain five sigmoid layers and your gradient is 0.25⁵ ≈ 0.001. That's the vanishing gradient problem, and it's why deep sigmoid networks from the 1990s barely trained. Add that sigmoid outputs are always positive (never zero-centered), which forces weight gradients to zig-zag during optimization.

But sigmoid has one perfect job: the output layer of a binary classifier. You need a number between 0 and 1 that you can interpret as a probability, and sigmoid delivers exactly that. Paired with binary cross-entropy loss, the gradients simplify beautifully to (predicted − actual). That's where sigmoid lives now, and it's not going anywhere.

Tanh squashes inputs to (−1, 1). It's a rescaled sigmoid — tanh(x) = 2σ(2x) − 1 — but the zero-centered output eliminates the zig-zag gradient problem, and its maximum derivative is 1.0 instead of 0.25. Tanh still saturates at the extremes, so it shares sigmoid's vanishing gradient issues for deep stacking. You'll encounter tanh inside LSTM and GRU gates, where bounded outputs between −1 and 1 are a deliberate design choice. For new architectures, tanh in hidden layers is mostly historical.

Softmax is different from the others — it's not a per-neuron activation. It operates across an entire vector of raw scores (called logits) and converts them into a probability distribution:

softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ

Every output lands in (0, 1), and they sum to exactly 1. This makes softmax the standard final activation for multi-class classification — if our spam detector had three categories (spam, promotional, legitimate), the final layer would have three neurons feeding into softmax, and the output would be a probability distribution over those categories.

A useful trick: divide the logits by a temperature parameter T before applying softmax. Low temperature (T<1) sharpens the distribution toward the highest logit. High temperature (T>1) flattens it toward uniform. This shows up in LLM text generation (controlling "creativity"), in knowledge distillation (soft targets at T=20), and in calibrating overconfident models.

logits = torch.tensor([2.0, 1.0, 0.1])

print(F.softmax(logits / 0.5, dim=0))  # sharp:  [0.836, 0.113, 0.051]
print(F.softmax(logits / 1.0, dim=0))  # normal: [0.659, 0.242, 0.099]
print(F.softmax(logits / 2.0, dim=0))  # flat:   [0.457, 0.278, 0.265]

Weight Initialization — Xavier and He

Here's something that confused me for an embarrassingly long time: why does the way you start the weights matter so much? The network is going to learn the right weights through training anyway, right?

Wrong. The starting point determines whether training even gets off the ground.

Initialize all weights to the same value and every neuron in a layer computes the same thing. They receive the same gradients. They update identically. They stay identical forever. You've paid for 256 neurons but you're getting the expressive power of one. This is the symmetry breaking problem, and it's why we initialize randomly.

But random isn't enough. Initialize too large and the activations explode — outputs saturate, gradients become enormous or NaN. Initialize too small and the signal shrinks to nothing as it passes through layers — the vanishing activation problem. Both scenarios make training either impossible or painfully slow. What you need is random initialization at the right scale.

Xavier initialization (Glorot, 2010) was designed for sigmoid and tanh networks. The idea: keep the variance of activations roughly constant across layers. If a layer has n_in inputs and n_out outputs, draw weights from a distribution with variance 2/(n_in + n_out). This ensures that signals neither amplify nor shrink as they flow forward, and gradients neither explode nor vanish as they flow backward.

He initialization (Kaiming, 2015) adapted the same principle for ReLU. Because ReLU zeros out roughly half its inputs, the surviving activations need to be twice as large to maintain the same variance. The fix is to use variance 2/n_in — twice the scale of Xavier. This one change made it practical to train networks with dozens or hundreds of ReLU layers.

import torch.nn as nn

# Xavier — use with sigmoid/tanh
layer_xavier = nn.Linear(256, 128)
nn.init.xavier_normal_(layer_xavier.weight)

# He — use with ReLU
layer_he = nn.Linear(256, 128)
nn.init.kaiming_normal_(layer_he.weight, nonlinearity='relu')

The rule is mercifully simple: He initialization for ReLU networks, Xavier for sigmoid/tanh. PyTorch's default is Kaiming uniform, which is reasonable for ReLU. If you're using GELU in a transformer, the convention is Xavier (and most transformer codebases handle this for you). I still occasionally get tripped up mixing these when porting architectures between frameworks — the mismatch is subtle enough that training doesn't fail, it's instead mysteriously slower than expected, which is the worst kind of bug.

The Universal Approximation Theorem

Now we can tackle the theorem that sounds too good to be true.

The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) says this: a neural network with a single hidden layer and a sufficient number of neurons can approximate any continuous function on a bounded domain to any desired accuracy. Any function. One hidden layer. As long as you have enough neurons and the right weights.

When I first read this, I thought "then why do we need deep networks at all? One fat hidden layer should be enough." And in theory, it is. In practice, "sufficient number of neurons" can mean an astronomically large number. Some functions that a 10-layer network represents with a few thousand parameters would require a single-layer network with millions or billions of neurons. The theorem is an existence proof, not a construction guide — it tells you a solution exists but says nothing about how to find the weights, how long training will take, or how many neurons you'll actually need.

This is where depth becomes not a luxury but a necessity. Deep networks compose features hierarchically — the first layer detects edges, the second combines edges into textures, the third composes textures into parts. This compositional structure mirrors the hierarchical structure of real-world data, and it achieves the same expressive power with exponentially fewer parameters than a wide, shallow network.

My favorite thing about this theorem is that, aside from high-level explanations like the one above, no one is completely certain why depth provides such dramatic efficiency gains. We have theoretical results showing that certain functions require exponentially many neurons in a shallow network but only polynomially many in a deep one. We have empirical evidence from every successful deep learning system ever built. But the full picture of why depth + compositionality works so well on real-world problems — that's still an active area of research. The honest answer is: it works, we know some reasons why, and we're still figuring out the rest.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a single artificial neuron — three operations, five lines of code. We gave it the ability to learn from mistakes and got the perceptron. We hit the XOR wall, watched the field nearly die, and found the exit by stacking layers into MLPs. We discovered that without nonlinear activations, depth is an illusion — and traced the evolution from sigmoid (vanishing gradients, the 90s struggle) through ReLU (the breakthrough) to GELU and Swish (the smooth gates that power transformers). We learned why initialization isn't a detail but a prerequisite, and we confronted the universal approximation theorem — a guarantee that's simultaneously reassuring and frustratingly unhelpful.

My hope is that the next time you see an architecture diagram with layers and activation functions, instead of nodding along and changing the subject, you'll trace the signal through — from input to hidden layer to output — having a pretty darn good mental model of what's happening at each step, why each piece is there, and what would break if you removed it.

Resources