State Space Models (SSM) Explained: How They Work, Why They Beat Transformers, and What’s Next
AI Architecture · Deep Dive · 2026

State Space Models: How They Actually Work — From the Math to Mamba

📅 June 9, 2026 ⏱ 16 min read 🧮 Equations Explained 2,800+ words

01Why State Space Models Are the Most Important Architecture You’re Not Using

Every AI system you interact with today — GPT, Claude, Gemini — runs on Transformer architecture. Transformers are brilliant for what they do: they can attend to any token in context at once, making them powerful reasoners. But they carry a fundamental physical cost. Every new token generated must compare itself against every previous token. Double the context window, quadruple the compute. This is why running a 100,000-token context window costs so much money — and why truly long-context inference remains a solved-in-theory, broken-in-practice problem.

State Space Models (SSMs) take a completely different approach. Instead of remembering the whole past, they compress it. At every step, an SSM updates a fixed-size hidden state — a compact summary of everything that came before — and uses that to produce the next output. The past is not forgotten; it is continuously distilled. This single architectural decision gives SSMs O(n) compute and O(1) memory at inference — properties that no Transformer can match.

This guide explains exactly how that works: the math, the matrices, the tricks that make it trainable at scale, why the selective variant (Mamba) was the breakthrough that made SSMs competitive with Transformers on real language tasks, and where the field stands in 2026 with Mamba 3.

State Space Model architecture diagram showing hidden state evolution
State Space Model: input x(t) → hidden state h(t) → output y(t). The hidden state is the compressed memory that makes SSMs efficient.
O(n)
Training complexity
O(1)
Inference memory
5+
Model generations (S4→Mamba3)
100k+
Sequence lengths handled

02The Core Equations — What an SSM Actually Computes

Definition — State Space Model: A mathematical system that models how hidden internal state h(t) evolves over time in response to an input x(t), and how that state produces an observable output y(t). Originated in control engineering (Kalman filtering, 1960s), adapted for deep learning sequences in 2021.

An SSM is defined by two equations. That’s it. Everything — the entire architecture, every variant from S4 to Mamba 3 — is ultimately a rethinking of these two lines of math:

h'(t) = A · h(t) + B · x(t) State equation — “how memory updates”
y(t) = C · h(t) + D · x(t) Output equation — “how we read the memory”

The first equation says: the new hidden state is the old hidden state transformed by A, plus the new input scaled by B. The second says: the output is the hidden state read through C, plus a direct skip connection through D. Every token processed updates h(t). The history of all previous tokens lives — compressed — inside h(t).

Live SSM Signal Processing — Watch the Hidden State Track Input
Input x(t) Hidden state h(t) Output y(t)

The hidden state (purple) is a smoothed, compressed version of the input history — it never grows in size, regardless of sequence length.

The key insight: h(t) has a fixed size N, regardless of how long the sequence is. A 10-token sequence and a 1,000,000-token sequence produce the same size hidden state. Compare this to a Transformer’s KV cache, which grows by one entry per token, forever. This is the root of every efficiency advantage SSMs have.

Short Extractable Answer: A State Space Model is defined by two equations: h'(t) = Ah(t) + Bx(t) (state update) and y(t) = Ch(t) + Dx(t) (output). Matrix A controls how memory decays, B controls how input enters the state, C controls how the state is read, and D is a skip connection. The hidden state h(t) has fixed size N regardless of sequence length — giving O(1) inference memory.

03The Four Matrices — A, B, C, D Decoded

Everything an SSM learns lives in four matrices. Understanding what each one does intuitively is essential for understanding why SSM variants make the specific changes they do.

Click any matrix to see its role — Interactive
Controls memory decay & oscillation frequency
A
State Transition
How the hidden state evolves on its own. Controls what is remembered and what is forgotten over time.
How much each input dimension matters
B
Input Projection
How incoming tokens are mixed into the hidden state. Determines how strongly the current input updates memory.
Maps hidden state back to output space
C
Output Projection
How the hidden state is “read” to produce output. Determines which parts of memory are relevant for the current prediction.
Direct input bypass — often set to 0 or 1
D
Skip Connection
Direct path from input to output, bypassing the state entirely. Equivalent to a residual connection in the SSM formulation.
👆 Click a matrix above to see its detailed role in the SSM computation.
MatrixShapeRoleWhat happens if it’s wrong?
A N × N State dynamics — how memory evolves Unstable training: exploding or vanishing gradients
B N × 1 Input→state projection New tokens fail to influence the hidden state
C 1 × N State→output projection Rich hidden state but meaningless output
D 1 × 1 Direct input skip connection Usually zero — low sensitivity, safe to remove

The most critical matrix is A. If its eigenvalues are too large, the hidden state explodes. If too small, it forgets everything immediately. Getting A right — initializing it to have the right eigenvalue structure — is the core problem that HiPPO and S4 solved. In Mamba, A is diagonal (only N values stored, not N²), and B, C become input-dependent — the key to selectivity.

04Discretization — Bridging Continuous Math and Discrete Tokens

Definition — Discretization: The process of converting a continuous-time differential equation (h'(t) = Ah + Bx) into a discrete-time recurrence (h[k] = Āh[k-1] + B̄x[k]) suitable for processing sequences of tokens. The parameter Δ (delta) controls the time step size and is a learned parameter in Mamba.

Tokens are discrete. The SSM equations are continuous. This gap is bridged by discretization — converting the differential equation into a difference equation that steps forward one token at a time. The most common method is the Zero-Order Hold (ZOH):

Ā = exp(A · Δ) Discretized A — matrix exponential
= (A)⁻¹ · (ĀI) · B Discretized B
h[k] = Ā · h[k-1] + · x[k] Discrete state update — runs once per token
Token-by-Token State Update — Live Step Visualizer
Current state h[k]:
h[k]
Step 0

The parameter Δ (delta) is especially important in Mamba. It controls how much the model “focuses” on the current token vs relying on existing memory. A large Δ means the current input dominates; a small Δ means memory persists strongly. Making Δ input-dependent is what enables selective attention without the quadratic cost of Transformer attention.

05The Duality: Recurrent at Inference, Convolutional at Training

One of the most important — and least understood — properties of SSMs is that the same model can be computed two ways, and the optimal choice depends on the context:

RECURRENT MODE

🔄
Used at: Inference (token generation)

Process one token at a time. Each step: update h[k] using the previous h[k-1]. Requires only the last hidden state — O(1) memory. Cannot be parallelized (each step depends on previous).
O(1) memory Sequential Streaming

CONVOLUTIONAL MODE

Used at: Training (full sequence)

Unroll the recurrence into a convolution kernel K̄. Apply K̄ to the full input via FFT. The entire sequence is processed in parallel — O(n log n) compute. Enables GPU utilization.
O(n log n) Parallelizable GPU-friendly

This duality is the secret sauce of S4 and all subsequent SSMs. During training, the entire input sequence is known, so the model unfolds as a convolution and processes everything in parallel on GPU — matching Transformer training throughput. During inference, tokens arrive one at a time, so the model runs as a recurrence — O(1) memory, no KV cache growth, constant latency regardless of sequence length.

Direct Answer: SSMs are trained as convolutional networks (full sequence parallel, O(n log n)) and deployed as recurrent networks (one-token-at-a-time, O(1) memory). This duality is the fundamental reason SSMs can match Transformer training efficiency while being dramatically cheaper at inference — the best of both RNN and CNN computation.

06HiPPO: The Memory Initialization That Made SSMs Work

Definition — HiPPO (High-order Polynomial Projection Operators): A mathematically optimal initialization scheme for the A matrix that compresses input history into Legendre polynomial coefficients. Introduced by Gu et al. (2020), HiPPO changed the SSM A matrix from “random guess” to “principled memory” — jumping performance on sequential MNIST from 60% to 98% accuracy overnight.

The original problem with SSMs was simple: how do you initialize matrix A so it doesn’t cause vanishing or exploding gradients while also capturing long-range dependencies? Random initialization fails. Even well-conditioned random matrices lose track of inputs from hundreds of steps ago.

HiPPO solves this by grounding A in mathematics. The idea: represent the history of inputs as a projection onto orthogonal polynomial bases (Legendre polynomials). The A matrix is then derived analytically to maintain this projection optimally. The result is a hidden state that is provably a compressed, lossless representation of the recent input history — not just a hopeful approximation.

HiPPO Memory Compression — How the State Represents Input History

Each bar represents a Legendre coefficient — together they encode the full recent input history. Watch the state update as new tokens arrive:

Recent past → ← Distant past
Key insight: The S4 paper notes that “simply modifying an SSM from a random matrix A to the HiPPO matrix improved its performance on sequential MNIST from 60% to 98%” — a single initialization change, no architecture change.
  • HiPPO-LegS: uniform coverage — equal attention to all past inputs
  • HiPPO-LagT: exponentially decaying attention — recent tokens weighted more
  • S4 uses HiPPO-LegS with diagonal-plus-low-rank (DPLR) parameterization for efficient computation
  • Mamba simplifies to pure diagonal A (just storing N numbers, not N×N) while preserving the HiPPO spirit
  • Mamba 3 extends this with complex-valued diagonal A — recovering oscillatory dynamics that real-valued diagonals can’t capture

07S4: The Model That Proved SSMs Could Scale

The Structured State Space for Sequences (S4) model, published by Gu, Goel, and Ré in 2021, was the first SSM to match or beat Transformers on a diverse benchmark suite (Long Range Arena). It did this through three specific innovations stacked on top of HiPPO:

S4 Innovation Stack
DPLR Parameterization. The N×N matrix A is decomposed as diagonal plus low-rank — making the matrix exponential in the ZOH formula computable in O(N) instead of O(N³). This made S4 30× faster and 400× more memory-efficient than naive SSMs.
Convolutional kernel via Cauchy kernel. S4 computes the full convolution kernel K̄ using the Cauchy matrix, enabling efficient O(n log n) parallel training via FFT — the same trick used in fast polynomial multiplication.
HiPPO initialization for A. A is not learned from scratch — it is initialized to the HiPPO matrix and fine-tuned. This gives the model a principled memory structure from the first training step.

S4 achieved state-of-the-art on the Path-X benchmark (sequence length 16,384), demonstrated that 100,000+ token sequences were tractable, and inspired a generation of follow-on work: DSS, S4D, S5, H3, Hyena — each simplifying or extending the core architecture in different ways before Mamba made the next leap.

Trainable Sequence Length — SSM Generations vs Transformers
Transformer
4,096 tokens
~4K
LSTM/RNN
512 tokens
~512
S4 (2021)
16,384 tokens
16K+
Mamba (2023)
100K tokens
100K+
Mamba 3 (2026)
Millions+
∞ (O(1))

08Selective Scanning: How Mamba Made SSMs Context-Aware

Definition — Selective State Space Model (S6): An SSM where the B, C, and Δ matrices are functions of the current input token — not fixed learned parameters. This allows the model to dynamically choose what to remember and what to ignore, overcoming the linear time-invariant (LTI) limitation of S4. Called “selective scan” or “S6” in the Mamba paper.

S4 and all LTI SSMs share a critical weakness: the A, B, C matrices are fixed for every token. The model applies the same transformation to “the” as it does to “Transformer” as it does to “quantum”. It has no way to say “this token matters more than that one.” The state updates mechanically, equally, regardless of content.

This is exactly what Transformers do well: via attention weights, they can focus on specific past tokens. Mamba’s core innovation is recovering this capability without the quadratic cost, by making B, C, and Δ functions of the current input:

B(x) = Linear(x) Input-dependent state input — “how much to write”
C(x) = Linear(x) Input-dependent state output — “what to read”
Δ(x) = Softplus(Linear(x)) Input-dependent time step — “how much to forget old state”
Selective Scan — Which Tokens Update the State Strongly vs Weakly

Bright tokens: Δ is large — this token dominates the state update (like attending). Dim tokens: Δ is small — previous memory persists strongly (like forgetting). This happens per-token, per-channel, in parallel.

The cost of making B, C, Δ input-dependent is that the convolutional representation no longer applies — you cannot precompute a fixed kernel if the kernel changes every step. Mamba solves this with a hardware-aware parallel scan algorithm: a custom CUDA kernel that computes the selective recurrence in parallel using prefix scan operations, achieving S4-like training throughput without the fixed-kernel shortcut.

  • A remains diagonal and fixed (not input-dependent) — preserving stable gradients
  • B, C, Δ are all projected from the current token via small linear layers
  • Large Δ ≈ “pay attention to this token” (discretization zooms in)
  • Small Δ ≈ “ignore this token, preserve state” (discretization steps slowly)
  • This is mathematically equivalent to soft, content-based gating — without attention heads, without O(n²) cost
Short Extractable Answer: Mamba’s selective scan makes B, C, and Δ input-dependent — so each token can dynamically write to, read from, and gate the hidden state based on its content. Large Δ focuses the update on the current token (like attention); small Δ preserves existing memory. This replaces O(n²) attention with O(n) selective recurrence, achieving context-awareness without quadratic cost.

09SSM vs Transformer — A Precise Comparison

Complexity Scaling — Sequence Length Growth Impact
Transformer O(n²) SSM O(n) LSTM O(n) but sequential
PropertyTransformerSSM (S4/Mamba)LSTM/RNN
Training complexity O(n²) attention O(n log n) convolution O(n) but sequential
Inference memory O(n) KV cache — grows forever O(1) fixed hidden state O(1) hidden state
Inference speed per token Slows down as context grows Constant regardless of context Constant
Long-range dependencies Excellent (direct attention) Strong (HiPPO + selective) Weak (vanishing gradient)
Random access retrieval Strong — direct key-value lookup Moderate — compressed state Poor
Streaming input Expensive — re-attends all past Native — state update per token Native
Parallelism (training) Fully parallel Parallel scan (hardware kernel) Sequential only
State compression None — stores all tokens in KV Aggressive — fixed N dimensions Partial — gating mechanism
Theoretical grounding Empirical (attention is useful) Control theory + HiPPO mathematics Heuristic gating

10The SSM Family Tree — From S4 to Mamba 3

SSM Architecture Evolution Timeline
2020
HiPPO
Gu et al. prove that A can be initialized optimally using polynomial projection operators. Sequential MNIST jumps from 60% to 98%.
Foundation
2021
LSSL → S4
S4 introduces DPLR parameterization and the Cauchy kernel for O(N) matrix exponentiation. Processes sequences 16K+ tokens. 30× faster than LSSL.
First scalable SSM
2022
DSS, S4D, S5
Simplifications of S4: diagonal-only A (DSS/S4D), true MIMO formulation (S5). Faster to implement, slightly reduced theoretical power, comparable empirical results.
Simplification wave
2023
Mamba (S6)
Gu & Dao make B, C, Δ input-dependent — selective scanning. Hardware-aware parallel scan CUDA kernel. First SSM to outperform Transformers at 3B scale on language benchmarks.
Breakthrough
2024
Mamba 2
State Space Duality (SSD) framework unifies SSMs and linear attention. Multi-head structure, SMA (State Space Attention), faster training with tensor parallelism.
Unification
2026
Mamba 3 — ICLR 2026
Exponential-trapezoidal discretization + complex-valued A + MIMO formulation. +1.8pp accuracy over Gated DeltaNet, 50% smaller state, removes causal convolution entirely. Inference-first design.
State of the artICLR 2026

11Real-World Use Cases — Where SSMs Win

DomainWhy SSM excelsSpecific advantage
Long-document Q&A 100K+ token contexts fit in fixed memory No KV cache explosion — constant latency
Real-time audio/speech Streaming token-by-token O(1) Live transcription without buffer delays
Time-series forecasting Continuous dynamics modeled natively Better than Transformers on irregular intervals
Genomics / DNA sequences Sequences millions of tokens long Transformer is physically impossible; SSM scales
Agentic AI pipelines 100s of agents × constant memory = feasible Predictable memory per agent regardless of steps
Edge / mobile AI Constant RAM — no growth at runtime Fits on device without managed KV cache
RAG retrieval encoding Fixed-size embedding regardless of chunk length Consistent vector dimensions for any document size
  • Use SSMs when your sequences exceed 8K–16K tokens — Transformer KV cache becomes the bottleneck here
  • Prefer Transformers for tasks requiring precise retrieval of specific facts from a fixed-length context — SSM compression can lose exact token positions
  • Hybrid architectures (alternating SSM and attention layers) often give best-of-both results — consider this for general-purpose LLMs
  • For streaming inference (voice assistants, live analysis), SSM is the only architecture that maintains constant latency regardless of session length
  • In RAG pipelines, SSM-encoded embeddings have the advantage of fixed size regardless of input length — simplifying indexing

12FAQ — Every Real Question About State Space Models Answered

What is a State Space Model in simple terms?

Fact A State Space Model (SSM) is a sequence processing framework that maintains a fixed-size hidden state — a compact summary of all past inputs — and updates it one token at a time using matrix multiplications.

Think of it like a person with perfect short-term working memory but a fixed number of “memory slots.” Every new input updates those slots — more important things can overwrite less important things (in selective SSMs like Mamba). Unlike a Transformer, which stores the full conversation history in a growing KV cache, an SSM always uses the same amount of memory no matter how long the sequence. The tradeoff: if something important was long ago and has been gradually overwritten, it may no longer be recoverable — this is why SSMs historically struggled with exact retrieval tasks, though Mamba and Mamba 3 have largely addressed this with selective updating.

What do the A, B, C, D matrices actually do?

Fact In the SSM equations h'(t) = Ah + Bx and y = Ch + Dx: A controls how the hidden state evolves on its own (memory dynamics), B projects input into the state space, C projects the state back to output space, and D is a direct skip connection from input to output.

A is the most critical. Its eigenvalues determine stability: eigenvalues with negative real parts cause the state to decay (forget), while values near the imaginary axis create oscillatory memory (useful for periodic signals). The HiPPO initialization sets A to the mathematically optimal structure for remembering polynomial projections of input history. B and C are input-output projections that are fixed in LTI SSMs (S4) but become input-dependent in Mamba — this is the core of selective scanning. D is usually set to zero or one and adds a residual path similar to skip connections in ResNets.

Why is discretization (Δ) so important in Mamba?

Fact Δ (delta) is the time-step parameter that bridges continuous SSM theory and discrete token sequences. In Mamba, Δ is made input-dependent, acting as a content-based gating mechanism: large Δ means “absorb this token strongly into state,” small Δ means “let existing state persist, barely update.”

This is a profound insight. The discretization parameter, usually treated as a fixed hyperparameter in S4, becomes the model’s attention mechanism in Mamba — without any of the quadratic cost. When a token has large Δ, the discretized A approaches the identity matrix (Ā ≈ I), meaning the state is almost reset to the new input. When Δ is small, Ā ≈ A (original dynamics dominate), meaning memory is retained. Making Δ = Softplus(Linear(x)) gives the model full control over when to remember and when to update — per token, per dimension, per channel — all in O(n) time.

Can SSMs completely replace Transformers?

Fact Pure SSMs match or exceed Transformers on most language modeling benchmarks at 1B–3B scale, but Transformers retain advantages in tasks requiring precise random-access retrieval of specific tokens from a fixed known context — such as certain in-context learning and multi-hop reasoning tasks.

The current research consensus (2026) points toward hybrid architectures — models that alternate SSM layers with occasional attention layers — as the optimal design for general-purpose language models. Pure SSMs are the right choice for deployment-constrained scenarios: streaming inference, long contexts, edge devices, and agentic pipelines where O(1) memory is a hard requirement. Transformers remain superior for tasks that resemble database lookup (retrieving a specific fact verbatim from a known passage). The two architectures are complements, not pure substitutes.

What is the difference between S4, Mamba, and Mamba 3?

Fact S4 (2021) is a linear time-invariant SSM with fixed A/B/C matrices and HiPPO initialization. Mamba (2023) adds input-dependent B, C, Δ (selective scanning). Mamba 3 (ICLR 2026) adds complex-valued A, exponential-trapezoidal discretization, and MIMO formulation — together achieving +1.8pp accuracy over the prior best SSM with 50% smaller state.

The progression tracks increasing expressiveness at each step: S4 proves SSMs can train efficiently at scale via the convolutional representation. Mamba proves they can be context-aware via selective scanning. Mamba 2 unifies SSMs and linear attention into a single theoretical framework (State Space Duality). Mamba 3 shifts the design philosophy from training efficiency to inference efficiency — asking not “how do we train this on GPU?” but “how do we deploy this with the best quality-per-compute ratio?” The three innovations of Mamba 3 (complex A, new discretization, MIMO) work together to improve the quality-efficiency Pareto frontier without increasing inference latency.

What is the HiPPO matrix and why does it matter?

Fact HiPPO (High-order Polynomial Projection Operators) is a mathematically derived initialization for the SSM A matrix, constructed so that the hidden state always represents the optimal polynomial approximation of the input history. It was introduced by Gu et al. in 2020 and improved sequential MNIST accuracy from 60% to 98% without any other architectural change.

Without HiPPO, the A matrix must be randomly initialized — and random matrices either cause the hidden state to explode (eigenvalues > 1) or instantly forget everything (eigenvalues << 1). HiPPO solves this by grounding A in the mathematics of approximation theory: specifically, the problem of compressing a function into its Legendre polynomial coefficients updates exactly as the HiPPO-LegS matrix specifies. This means the hidden state is provably the best possible fixed-size representation of the input history — not just a heuristic. S4 uses HiPPO-LegS with DPLR structure. Mamba simplifies to diagonal A but retains the HiPPO spirit of spectral initialization. Mamba 3 extends this with complex-valued diagonal entries, recovering richer dynamics.

How do SSMs handle positional information without a positional encoding?

Fact SSMs encode positional information implicitly through the recurrence structure — each token is processed in order and the hidden state carries temporal context naturally, unlike Transformers where position is injected explicitly (sinusoidal, RoPE, ALiBi) because attention is permutation-equivariant without it.

Because the state update h[k] = Āh[k-1] + B̄x[k] processes tokens sequentially and the previous state h[k-1] depends on all prior tokens in order, the model inherently knows where it is in the sequence. Token 5’s representation in the hidden state is different from token 500’s even if the input token is identical — because the state it was added into was different. This means SSMs get position for free from the recurrent structure, without adding sinusoidal encodings, rotary embeddings, or any other positional machinery. In practice, SSMs may still benefit from relative position encodings for very long sequences, but the baseline position encoding requirement is lower than for Transformers.

Why are SSMs faster than Transformers on ARM hardware specifically?

Fact SSM inference (the recurrent mode) has much higher arithmetic intensity than Transformer KV-cache attention — meaning more useful math operations per byte of memory transferred. On ARM chips (Apple Silicon, AWS Graviton, Qualcomm), memory bandwidth is the bottleneck, so higher arithmetic intensity directly translates to faster throughput.

Transformer decoding is memory-bandwidth-bound: for each new token, the entire KV cache must be loaded from memory to compute attention scores — O(n) memory reads per token. SSM decoding is compute-bound: only the fixed-size hidden state (N parameters) is needed per step, and the matrix multiply Āh + B̄x keeps the ALUs busy. Modern SIMD instruction sets (NEON on ARM, AVX-512 on x86) are designed for exactly this kind of dense small-matrix math. This is also why TurboVec’s nibble-split lookup table maps perfectly to NEON — both TurboVec and Mamba inference are structured to maximize SIMD utilization on the same hardware.

13Where SSMs Go From Here

State Space Models began as a niche idea from control engineering, became a serious Transformer alternative with S4, crossed the quality threshold with Mamba’s selective scanning, and in 2026 stand as the most principled, theoretically grounded architecture for inference-efficient AI. The journey from the HiPPO matrix’s 60%→98% accuracy jump to Mamba 3’s 1.8pp advance over Gated DeltaNet is a story of mathematics steadily winning over heuristics.

The next frontier is clear: hybrid architectures that combine SSM layers’ O(1) inference memory with occasional attention layers’ retrieval precision; multi-modal SSMs that apply the same recurrent efficiency to vision, audio, and genomics simultaneously; and on-device inference where the constant-memory property finally makes truly local, private AI tractable on laptops and phones.

If you are building AI systems today — RAG pipelines, agentic workflows, real-time analysis, or long-context applications — understanding State Space Models is no longer optional. They are the architecture that defines what comes after the Transformer era.

Related reading: Mamba 3 vs Transformer Deep Dive · TurboVec & Google TurboQuant Guide · RAG Pipeline Implementation

Master the Architecture Replacing Transformers

Get our SSM implementation notebooks, Mamba 3 benchmarking scripts, and the complete sequence model comparison guide — free.

🚀 Get the Free Guides
logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox.

We don’t spam! Read our privacy policy for more info.

Scroll to Top
-->