State Space Models & Mamba
I avoided state space models for longer than I'd like to admit. Every time I saw those equations — the A matrix, the discretization, the HiPPO operator — I'd nod along, close the tab, and go back to writing attention layers. Transformers worked. Why bother with something that felt like a control theory textbook? Finally, the discomfort of watching Mamba benchmarks roll in, each one faster than the last, grew too great for me. Here is that dive.
State space models (SSMs) are a family of sequence models borrowed from control theory — a branch of engineering that has modeled dynamic systems since the 1960s. They were brought into deep learning around 2020–2021, and reached the mainstream with the S4 paper (Gu et al., 2022) and then Mamba (Gu & Dao, 2023). Their promise: process sequences with linear complexity instead of the transformer's quadratic cost, while still capturing dependencies thousands of steps in the past.
Before we start, a heads-up. We're going to be building systems that involve differential equations, matrix exponentials, and polynomial projections — but you don't need to know any of that beforehand. We'll add the concepts we need one at a time, with explanation. If you've followed along with RNNs and attention in earlier sections, you have more than enough background.
This isn't a short journey, but I hope you'll be glad you came.
The state space equations
From continuous to discrete
The convolution–recurrence trick
HiPPO — teaching the state to remember
S4 — making it practical
Rest stop
Mamba — the selective revolution
The hardware story
Mamba vs. Transformer
Mamba-2 and the duality theorem
Hybrid architectures
Why this matters
A Thermostat That Learns
Imagine you're building a smart thermostat. Not the kind that follows a fixed schedule — one that actually understands the thermal dynamics of a house. It monitors the temperature every minute and tries to predict what happens next.
The house has a current temperature — let's say 68°F. The furnace is blasting at some intensity. Outside, it's 30°F and dropping. Given this information, what will the temperature be one minute from now?
If you've been following along with the earlier sections on RNNs, you might reach for a recurrent network. Feed in the current temperature, maintain a hidden state, predict the next value. That would work. But engineers who design actual thermostats have been solving this exact problem since the 1960s, and they have a different tool: state space equations. These equations describe how a system's internal state evolves over time, how inputs push it around, and how we observe its behavior.
Our thermostat is the running example that will thread through this entire section. It starts tiny — one room, one sensor, one input. By the end, it'll grow into something much more interesting.
The State Space Equations
Here's the core math. A continuous-time state space model is described by two equations:
# The state space equations (math, not code):
#
# x'(t) = A · x(t) + B · u(t) ← how the hidden state evolves
# y(t) = C · x(t) + D · u(t) ← what we observe
Let's unpack each symbol using our thermostat. The variable x(t) is the hidden state — everything the system "remembers" at time t. For our thermostat, this might be a small vector: [wall temperature, air temperature, residual heat in the furnace]. We can't directly measure all of these, but they influence what happens next.
The variable u(t) is the input signal — the furnace setting, maybe the outside temperature reading. And y(t) is the output — the temperature reading on the thermostat display.
Now the matrices. A governs how the state evolves on its own. In our thermostat, it captures physics: heat dissipates through walls at some rate, the furnace core cools when turned off, the air temperature drifts toward the wall temperature. B maps the input to the state — how much cranking up the furnace actually changes the internal heat. C maps the state to what we observe — the sensor reads air temperature, not wall temperature. D is a direct skip connection from input to output, and in practice it's often dropped.
The first equation says: the rate of change of the state (that's what the prime symbol x'(t) means — the derivative) depends on where the state currently is and what input is arriving. The second says: the output is a linear readout of the state.
I'll be honest — when I first saw these equations, I thought "this is a fancy linear system, how could it possibly model language?" We'll get there. The key realization is that the state x(t) is a compressed summary of the entire input history. Every input that has ever arrived left its fingerprint on x(t), decayed and mixed by the A matrix. That's the mechanism that makes SSMs relevant for sequence modeling.
From Continuous to Discrete
Those equations describe a system evolving in continuous time — temperature changing smoothly, second by second. But our thermostat samples the temperature once per minute. Our language model processes one token at a time. We need to convert the continuous equations into discrete steps.
This process is called discretization, and the most common technique is the zero-order hold. The idea: assume the input u(t) stays constant between sample points. If the furnace is set to "high" when we sample at time k, assume it stays at "high" until the next sample at time k+1.
Under that assumption, we can solve the differential equation exactly and get discrete versions of A and B:
# Zero-order hold discretization:
#
# Ā = exp(A · Δ) ← matrix exponential
# B̄ = (Ā - I) · A⁻¹ · B ← integrated input effect
#
# Now the recurrence is:
# x_k = Ā · x_{k-1} + B̄ · u_k ← new state from old state + input
# y_k = C · x_k ← output readout
Look at that recurrence: x_k = Ā · x_{k-1} + B̄ · u_k. The new state is a linear combination of the previous state and the current input. If you squint, this looks exactly like a vanilla RNN — hidden state times a weight matrix plus input times another weight matrix.
And that's not a coincidence. An SSM is a kind of linear RNN. But there are two crucial differences that make it much more powerful than the RNNs we built in earlier sections. The first is that the A matrix isn't random — it comes from a carefully designed continuous-time system. The second difference is even more important, and it's what we'll explore next.
One thing about the step size Δ: it controls how much of the continuous dynamics happen between each discrete step. A large Δ means big jumps — the system evolves a lot between samples. A small Δ means tiny steps — the system barely changes. Think of it as the "playback speed" of the underlying continuous process. This knob will become very important when we get to Mamba.
The Convolution–Recurrence Trick
Here's the part that made me sit up in my chair.
We have a recurrence: x_k = Ā · x_{k-1} + B̄ · u_k. Let's trace what happens if we unroll it from the beginning, starting with x_0 = 0 (empty state). At step 1, the state is B̄ · u_1. At step 2, it's Ā · B̄ · u_1 + B̄ · u_2. At step 3, it's ² · B̄ · u_1 + Ā · B̄ · u_2 + B̄ · u_3.
The output at step k, after applying C, is:
# y_k = C · Ā^(k-1) · B̄ · u_1
# + C · Ā^(k-2) · B̄ · u_2
# + ...
# + C · B̄ · u_k
#
# This is a weighted sum of all past inputs!
# The weights form a fixed kernel:
#
# K = [C·B̄, C·Ā·B̄, C·Ā²·B̄, ..., C·Ā^(L-1)·B̄]
#
# And y = K ∗ u (where ∗ means convolution)
Let me make sure this lands. The output at every position is a weighted combination of all previous inputs, and the weights are determined entirely by the matrices A, B, C and the step size Δ. Those weights form a fixed pattern — a convolution kernel. And convolutions can be computed in parallel using the Fast Fourier Transform (FFT), in O(n log n) time.
This is the magic: the same model has two faces.
During training, when you have the entire sequence available, you compute the kernel K once, then convolve it with the full input. The FFT makes this parallel and fast — no sequential bottleneck, no vanishing gradients from unrolling through time.
During inference, when tokens arrive one at a time, you run the recurrence: take the previous state, multiply by Ā, add B̄ times the new input, read out with C. Each step costs O(1) in time and memory. The state has a fixed size regardless of how many tokens you've processed.
Compare this to our two earlier sequence models. RNNs are recurrences — they can do O(1) per step at inference, but they're painfully sequential during training and suffer from vanishing gradients. Transformers are parallel during training (great!), but their attention mechanism costs O(n²) for a sequence of length n, and their KV cache grows linearly at inference.
SSMs get O(n log n) parallel training and O(1)-per-step inference. I remember reading this and thinking "there has to be a catch." There is. But it's not where I expected.
HiPPO — Teaching the State to Remember
The catch is the A matrix. If you pick a random A, the recurrence either explodes (eigenvalues larger than 1 amplify previous states exponentially) or decays (eigenvalues smaller than 1 shrink old inputs to zero). Either way, the model forgets the distant past. This is the same vanishing/exploding gradient problem that plagued vanilla RNNs.
Back to our thermostat. Imagine the system needs to remember that a window was opened 3 hours ago, because the room still hasn't fully recovered. A random A matrix would have lost that information within a few time steps. We need an A matrix that's specifically designed to preserve long-range history.
This is where HiPPO comes in — the High-order Polynomial Projection Operator. The name sounds intimidating, but the intuition isn't.
Think of it this way. You're watching a stock price over the course of a day. At any given moment, you want to be able to recall the general shape of the price curve over the entire day so far — not every tick, but the trend, the bumps, the overall trajectory. One way to do this is to fit a polynomial to the price history. A degree-0 polynomial gives you the average. Degree-1 gives you the average plus a slope. Degree-2 adds curvature. Higher degrees capture finer details.
HiPPO says: at every time step, maintain a set of coefficients that represent the best polynomial fit to the entire input history. Specifically, it uses Legendre polynomials — a family of orthogonal polynomials that mathematicians have studied for centuries. Orthogonal means each coefficient captures independent information about the history, with no redundancy.
The beautiful part: there exists a specific matrix A (the HiPPO matrix) such that if you run the recurrence x_k = Ā · x_{k-1} + B̄ · u_k with this particular A, the state x_k automatically encodes the optimal polynomial approximation of everything you've seen so far. The state is, in a provable mathematical sense, the best possible fixed-size compression of the input history.
I'm still developing my intuition for why this particular matrix has this property — the derivation involves measure theory and function approximation in ways that take a while to absorb. But the practical takeaway is clear: models initialized with HiPPO can capture dependencies thousands of steps in the past, where vanilla RNNs and even LSTMs fail completely.
The A matrix in HiPPO isn't a learned weight in the usual sense. It's more like an architectural choice — a carefully designed inductive bias that gives the model a head start on the problem of remembering. Think of it as the difference between trying to memorize a phone number by staring at it (random A) versus using a mnemonic system (HiPPO). Same brain, radically different retention.
S4 — Making It Practical
HiPPO gave us the right A matrix. The convolution–recurrence duality gave us two modes of computation. But there was an engineering problem: computing the convolution kernel K = [C·B̄, C·Ā·B̄, C·Ā²·B̄, ...] requires repeated matrix powers of Ā, and for large state dimensions and long sequences, this is expensive.
S4 — Structured State Spaces for Sequence Modeling (Gu et al., 2022) — solved this by exploiting the specific structure of the HiPPO matrix. The HiPPO A is not an arbitrary matrix; it has a low-rank-plus-diagonal structure. S4 decomposes the kernel computation using connections to Cauchy kernels from numerical linear algebra, reducing the cost dramatically. The details are dense, but the effect is practical: you can now compute the SSM kernel for sequences of thousands of tokens in a reasonable amount of time on a GPU.
S4 was the proof that SSMs could compete with transformers on real benchmarks. The Long Range Arena — a suite of tasks designed to test long-range dependency modeling — became the proving ground. Tasks like classifying a document by its full content (sequence length 1K–4K), determining whether two points in an image are connected by a path (sequence length 1K–16K), and matching patterns in long byte-level sequences. Transformers struggled on several of these tasks. S4 crushed them, often by wide margins, while using a fraction of the compute.
But S4 had a fundamental limitation that I glossed over earlier, and it's time to confront it.
The matrices A, B, and C in S4 are fixed. They don't change based on what the model is looking at. Every token in the sequence gets processed with the exact same dynamics. The word "the" and the word "emergency" pass through identical state transitions. The model has no ability to selectively pay more attention to some inputs and less to others.
For our thermostat, this would be like processing the signal "window opened" with the same weight as "furnace humming at steady state." One of those signals demands immediate state changes. The other is background noise. A fixed-parameter SSM can't tell the difference.
🛑 Rest Stop
If you've made it this far, congratulations — you now have a solid mental model for state space models. You understand the core equations (state update + output readout), the discretization step that converts continuous math into token-by-token processing, the convolution–recurrence duality that gives SSMs their unique advantage over both RNNs and transformers, and the HiPPO framework that teaches the state to remember.
That's a genuinely useful understanding. If someone asks you "what's the deal with SSMs?" in an interview or a discussion, you can explain the entire concept ladder: control theory origins → discretization → dual computation → structured initialization. You'd be ahead of most practitioners.
But the story isn't complete. The limitation we ended on — fixed parameters that can't adapt to content — is a real problem. It means S4-style SSMs are powerful compressors but poor selectors. They remember everything equally, when what we really want is a model that decides what to remember.
The short version of what comes next: Mamba makes the SSM parameters depend on the input, giving it content-aware filtering. This breaks the convolution trick, so Mamba compensates with a clever hardware-aware algorithm. The result is transformer-quality modeling at linear cost. There. That's the 80% summary.
But if the discomfort of not knowing how that actually works is nagging at you, read on.
Mamba — The Selective Revolution
Mamba (Gu & Dao, December 2023) makes one change to the SSM framework. One change. And it transforms the entire field.
In S4, the matrices B, C, and the step size Δ are learned once during training and then applied identically to every token. In Mamba, B, C, and Δ become functions of the current input.
# S4 (fixed parameters):
# x_k = Ā · x_{k-1} + B̄ · u_k ← same Ā, B̄ for every token
# Mamba (selective parameters):
# B_k = Linear(u_k) ← B depends on this token
# C_k = Linear(u_k) ← C depends on this token
# Δ_k = softplus(Linear(u_k)) ← step size depends on this token
# Ā_k, B̄_k = discretize(A, B_k, Δ_k) ← discretize with token-specific params
# x_k = Ā_k · x_{k-1} + B̄_k · u_k ← dynamics change per token!
Let me connect this back to our thermostat. In S4, the system would process every temperature reading with the same dynamics — a reading of 68°F during normal operation would update the state the same way as a sudden spike to 95°F indicating a fire. In Mamba, the system looks at the incoming reading and adjusts its behavior on the fly. A boring, expected reading? Small Δ — barely update the state. An alarming anomaly? Large Δ — write this strongly into memory.
The step size Δ is the crucial lever here. When Δ is large, the discretized matrix Ā_k shrinks toward zero (because exp(A · Δ) decays faster), which means the model forgets more of the old state and writes the new input more forcefully. When Δ is small, the model preserves the existing state and barely notices the new input.
I'll be honest — this confused me at first. I expected "selective" to mean something like attention scores, where the model explicitly compares pairs of tokens. It's not that. It's more subtle. The selection happens at the write step: the model decides, for each token, how strongly to imprint it onto the running state. Tokens the model deems important get written in with large Δ. Tokens it deems irrelevant pass through with almost no effect. It's content-based filtering — but with O(n) complexity instead of O(n²).
A Mamba block also includes a short convolution (kernel size 3 or 4) before the selective scan. This handles very local patterns — adjacent-token interactions — cheaply, without needing the SSM machinery. Think of it as a tiny preprocessing step: the convolution captures "the cat" as a local bigram, and the SSM captures "the cat mentioned three paragraphs ago."
The full Mamba block also has a gated architecture. The input is split into two paths: one goes through the convolution and selective SSM, the other is held as a gate. They're multiplied together at the end, similar to the gated linear units used in many modern architectures. This gives the model a way to modulate the SSM output with a direct signal from the input.
The Hardware Story
Here's where things get interesting — and where Mamba becomes a systems paper, not just a math paper.
Remember the convolution–recurrence duality from S4? That duality relied on the kernel being fixed — the same for every token. Once you make B, C, and Δ depend on the input, the kernel changes at every position. You can no longer precompute a single kernel and use FFT. The convolution mode is gone.
Without the convolution shortcut, you're stuck with the recurrence: x_k = Ā_k · x_{k-1} + B̄_k · u_k. And recurrences are sequential — step k depends on step k-1. Exactly the problem that made RNN training so slow.
Mamba's solution is a hardware-aware parallel scan. The parallel scan (also called a parallel prefix sum) is a well-known algorithm from the 1980s — it can compute cumulative operations like running sums in O(log n) parallel steps instead of O(n) sequential steps. Mamba adapts this to the SSM recurrence.
But here's the thing that made the difference: knowing the algorithm isn't enough. Modern GPUs have a hierarchy of memory — registers (fastest, tiniest), SRAM on each streaming multiprocessor (fast, small, ~20MB total on an A100), and HBM or high-bandwidth memory (large, 40–80GB, but much slower to access). The naive implementation of a parallel scan would bounce data between HBM and SRAM repeatedly, wasting most of the time on memory transfers rather than computation.
Tri Dao (co-author of both FlashAttention and Mamba — not a coincidence) designed the selective scan kernel to keep the state in SRAM as much as possible. The sequence is split into chunks that fit in SRAM. Within each chunk, the scan runs entirely in fast on-chip memory. Between chunks, only the boundary states move to HBM. The computation is fused — discretization, the scan, the gating, and the output projection all happen in one kernel call, minimizing memory round-trips.
I haven't figured out a great way to visualize the memory hierarchy optimization, but here's a crude attempt at the mental model: imagine doing a long math problem. If all your scratch paper fits on your desk (SRAM), you work fast. If you have to keep walking to a filing cabinet across the room (HBM) to retrieve intermediate results, you slow to a crawl. Mamba's implementation keeps the scratch paper on the desk.
The result: Mamba's selective scan runs at about 3× the throughput of an equivalent transformer at sequence length 2K, and the gap widens as sequences get longer. At 64K tokens, the advantage is closer to 5×.
Mamba vs. Transformer
Let's lay them side by side. Not to declare a winner — the honest answer is "it depends" — but to understand the trade-offs.
Training cost. A transformer's self-attention computes an n × n score matrix. That's O(n²) in both time and memory. For 512 tokens, manageable. For 100K tokens — the length you'd need for a full codebase or a novel — it's brutal. Mamba's selective scan is O(n). Double the sequence, double the cost. That's the headline number.
Inference cost. When generating one token at a time, a transformer must consult its KV cache — every previously generated key and value. That cache grows linearly with context length. Generating the 50,000th token requires reading 50,000 key-value pairs. Mamba maintains a fixed-size state vector, regardless of how many tokens have been processed. The 50,000th token costs the same as the 5th.
Long-range retrieval. This is where transformers shine. Attention gives every token a direct path to every other token. If the answer to a question is buried in token 47 of a 10,000-token document, attention can find it by looking right at it. Mamba must have compressed the information from token 47 into its fixed-size state — and whether that compression preserved the right details depends on the learned dynamics. In practice, Mamba does well on most long-range benchmarks, but there are tasks — particularly "needle in a haystack" retrieval — where attention has an edge.
In-context learning. Transformers are remarkably good at learning patterns from examples provided in the prompt. This ability seems connected to the attention mechanism's ability to do explicit retrieval and comparison. Mamba can do in-context learning too, but the results are less studied and, in some settings, less robust. This is an active area of research.
Ecosystem. Transformers have years of accumulated engineering wisdom. HuggingFace, vLLM, TensorRT, quantization tools, RLHF pipelines, distributed training recipes — all built around attention. Mamba's ecosystem is growing but young. If you're shipping to production tomorrow, the transformer ecosystem is a significant practical advantage.
Proven scale. GPT-4, Claude, Gemini, LLaMA 3 — every frontier model is a transformer. Mamba has been validated convincingly up to around 3B parameters, with some results at larger scales. The scaling laws are encouraging but not yet proven at the 100B+ scale where transformers have demonstrated emergent capabilities.
My favorite thing about this comparison is that neither architecture dominates on every axis. Transformers are better at precise retrieval; Mamba is better at long-sequence efficiency. Transformers have a battle-tested ecosystem; Mamba has a more favorable scaling trajectory. This is exactly the kind of situation that leads to hybrids — which is what happened.
Mamba-2 and the Duality Theorem
Six months after Mamba, Gu and Dao dropped another paper that changed how we think about the relationship between SSMs and attention. Mamba-2 introduces what they call Structured State Space Duality (SSD): a mathematical proof that selective SSMs and a certain form of attention are, under specific parameterizations, computing the same thing.
The intuition: attention computes a weighted sum of values, where the weights come from comparing queries and keys. The SSM recurrence computes a weighted sum of past inputs, where the weights come from the state transition matrices. Mamba-2 shows that you can write both computations in a common framework — they're dual views of the same operation.
Why does this matter? Because it lets you pick the best implementation for each setting. For short sequences, the attention-like view (materialized as a matrix product) is faster on GPUs. For long sequences, the SSM recurrence view is more efficient. Mamba-2 can dynamically choose between them based on the hardware and sequence length.
The practical result: Mamba-2 is 2–8× faster than Mamba-1 on typical workloads, with the same or better model quality. The implementation is also cleaner — the connection to attention makes it easier to use existing transformer infrastructure (optimizers, parallelism strategies, quantization) with SSM-based models.
I'll be honest — the full SSD proof involves structured matrices and semi-separable forms that I'm still working through. But the high-level takeaway is profound: attention and state spaces are not rival paradigms. They're different views of the same underlying computation. The debate isn't "which is better?" but "which implementation is faster for this particular setting?"
Hybrid Architectures
If SSMs and attention are two views of the same thing, the obvious question is: why not use both?
Jamba (AI21 Labs, 2024) is the clearest answer. It's a 52-billion parameter model that interleaves Mamba layers with transformer layers, wrapped in a Mixture of Experts (MoE) framework. Only about 12 billion parameters are active for any given token, and the model supports a 256K-token context window.
The architecture is roughly 1 attention layer for every 7 Mamba layers. The Mamba layers handle the bulk of sequence processing — compressing information, maintaining the running state, doing the heavy lifting of "reading" the input. The occasional attention layer acts as a precision instrument — when the model needs to look up a specific fact buried somewhere in the context, attention's direct token-to-token path handles that retrieval.
Back to our thermostat analogy one last time. Imagine monitoring a building with 1,000 sensors over a full year. Most of the time, the readings follow predictable patterns — the SSM layers handle this efficiently, compressing months of routine data into a manageable state. But occasionally, you need to answer a specific question: "What was the temperature in room 307 at 2:47 PM on March 15th?" For that, you need the attention layer's ability to reach back and pull out an exact data point.
Jamba isn't the only hybrid. Google DeepMind's Griffin and Hawk models use recurrence-based layers that compete with transformers on standard benchmarks. RWKV takes a related approach with linear attention, scaling to 14B+ parameters. The trend is clear: the future isn't "transformers vs. SSMs." The future is architectures that use each mechanism where it's strongest.
# Conceptual hybrid: Jamba-style interleaving
class HybridModel:
def __init__(self, d_model, n_layers, attn_every=8):
self.layers = []
for i in range(n_layers):
if (i + 1) % attn_every == 0:
# Sparse attention layers for precise retrieval
self.layers.append(AttentionLayer(d_model))
else:
# Mamba layers for efficient sequence compression
self.layers.append(MambaLayer(d_model))
def forward(self, tokens):
x = embed(tokens)
for layer in self.layers:
x = layer(x) # Each layer: either fast SSM or precise attention
return x
# The ratio matters — Jamba found ~1:7 (attention:mamba)
# captures most of attention's retrieval ability
# at a fraction of the compute
Why This Matters
You might be wondering: if transformers work well enough, why should I care?
Because the sequences are getting longer. Codebases are hundreds of thousands of tokens. Genomic sequences — DNA strands where a model needs to understand regulatory elements — can be millions of base pairs. Audio waveforms at 16kHz produce 960,000 samples per minute. A full day of sensor data from our imaginary building thermostat, sampled every second, is 86,400 time steps.
At O(n²), attention on a million-token sequence would require roughly a trillion operations for a single layer. At O(n), the SSM handles it with a million. That's not an optimization — it's the difference between possible and impossible.
There are domains where SSMs already make more practical sense than transformers: long-document processing, time-series forecasting at high frequency, genomics and protein modeling, continuous audio processing, and any application where you need inference with constant memory — like running a language model on an edge device where the KV cache can't grow without bound.
If you're still with me, thank you. I hope it was worth the journey.
We started with a thermostat and the control theory equations that describe how systems evolve over time. We discretized those equations and discovered they look like a linear RNN — but one with a mathematical backbone that gives it two faces: a convolution for fast parallel training and a recurrence for efficient streaming inference. We learned that the HiPPO framework provides a principled initialization that lets the state remember the distant past, and that S4 made this practically computable. Then Mamba came along and added the ability to select — to decide, based on content, what's worth remembering and what's noise. And we saw the field converging toward hybrid architectures that use the best tool for each layer.
My hope is that the next time you see a paper comparing "linear complexity" sequence models to transformers, or someone mentions Mamba in a discussion about LLM efficiency, instead of that vague sense of "I should learn about this eventually," you'll have a clear mental model of what's happening under the hood — from the state equations all the way up to the hardware-aware scan.
Resources
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces — the O.G. Mamba paper by Gu & Dao. Dense but essential reading.
- Efficiently Modeling Long Sequences with Structured State Spaces (S4) — the paper that started it all for deep learning SSMs. Wildly influential.
- Transformers are SSMs (Mamba-2) — the duality paper that unifies SSMs and attention. Changes how you think about both.
- The Annotated S4 by Sasha Rush — an unforgettable walkthrough with code. If you want to implement an SSM from scratch, start here.
- Jamba: A Hybrid Transformer-Mamba Language Model — AI21's production hybrid. Proof that SSMs and attention can live together.
- A Visual Guide to Mamba and State Space Models by Maarten Grootendorst — insightful visual explanations if the math feels heavy.