A question most inference tutorials skip
Here is a clean, specific question. You have a 7-billion parameter language model in FP16. You have an H100 SXM — Nvidia's top-of-line Hopper GPU, 989 TFLOPs of BF16 compute, 3.35 TB/s HBM3 bandwidth. You want to decode one token. How long does it take, at best?
Decoding one token requires a forward pass through the model. For a dense transformer, the forward pass takes roughly FLOPs where is the parameter count:
Fourteen microseconds. Seventy thousand tokens per second. That would be incredible.
It is also completely wrong. Real H100 inference on a 7B model in FP16 is roughly 170 tokens per second, or about 5.9 milliseconds per token. That's 400 times slower than the compute bound.
The compute model missed the actual binding constraint: to compute the forward pass, the GPU has to read every weight in the model from HBM into the cores. Once per token.
The bandwidth bound is about 4.2 milliseconds. That is consistent — within scheduling overhead — with the real 5.9 ms number. The compute was never the bottleneck; the memory bus was.
This is the memory wall. It is the single most important fact in Act VIII. Every optimisation you are about to learn — PagedAttention, FlashAttention, KV-cache quantization, speculative decoding, chunked prefill — is a specific attack on this number. If you don't understand where the 4.2 ms comes from, the rest of the act reads like a pile of disconnected tricks.
The memory pyramid — a tour from register to HBM
Every GPU has a hierarchy of memory levels. Each level is faster but smaller than the one below it. The top levels sit inside the compute units; the bottom levels are physically separate chips connected by a bus. The further you go down the pyramid, the slower access gets, and the more the bytes cost you in time.
Click through the layers on the right — the numbers are the actual H100 figures. These are the constraints every inference engine is actually optimising against.
The roofline — one picture, every bottleneck
Now we can state the memory wall precisely. A GPU has two peak numbers:
- Peak compute (FLOPs per second) — how fast it can multiply-accumulate if nothing else gets in the way.
- Peak bandwidth (bytes per second) — how fast it can move bytes from HBM to the compute units.
An operation has an arithmetic intensity : FLOPs divided by bytes of data the op has to touch.
If your intensity is low (memory-bound), you get : performance grows linearly with intensity. If your intensity is high (compute-bound), you get : the peak, flat ceiling. The crossover happens at the ridge point: .
For an H100: TFLOPs/s, TB/s, so FLOPs per byte. Any operation with arithmetic intensity below 295 is memory-bound on an H100. LLM decode has an intensity of about 1 (two FLOPs per FP16 weight byte). It is 295 times below the ridge point.
The interactive plot below is the roofline. Pick a GPU on the right. Each workload marker sits at its arithmetic intensity; its y-value is the achievable performance. Notice how LLM decode lives in the steep bandwidth-limited region and is at the mercy of , not .
Decode vs prefill — different sides of the roofline
Look at where the four markers sit on the plot. Notice what happens as arithmetic intensity changes:
- Element-wise ops (ReLU, add, normalise) have AI ≈ 0.25. Almost no compute per byte. They sit on the floor of the bandwidth-limited region.
- LLM decodehas AI ≈ 1. Decoding one token reads all weights once and does 2 FLOPs per parameter. In FP16, that's 2 FLOPs per 2 bytes = 1 FLOP/byte.
- LLM prefill (batched over many tokens) has AI ≈ 64 or higher. The same weight read is amortised over many token computations, so AI scales with batch size and sequence length. Prefill is usually compute-bound.
- Dense BLAS matmul (e.g., GEMM in training on big matrices) has AI ≈ 500. Deep in the compute-bound regime. GPUs are designed to run at peak here, not during LLM decode.
Compute vs bandwidth — see it in wall-clock time
Now actually play with it. Pick a model size, a precision, and a GPU. Watch the compute time and the memory-read time. For decode, memory time is always dramatically larger. The ratio is your bandwidth-bound multiplier.
The physical GPU — where every byte actually lives
The memory pyramid is a useful abstraction, but it hides the geography. An H100 is a physical 814 mm² chip — if you hold one in your hand, you can see the pieces. Eight processing clusters ring the die. A split L2 cache sits in the middle. Five HBM3 memory stacks hug the outside, each connected by thousands of micro-wires through a silicon interposer. This layout is not cosmetic; it determines the topology of data movement.
One Llama layer, one tile at a time
Now to answer the question you should have been asking all along: when the GPU “reads” the weights of a layer from HBM, what actually happens?Not all of the layer streams in at once — that would need gigabytes of SRAM, which we don't have. Instead, the matrix is tiled. A small tile of the weight matrix streams from HBM into an SM's shared memory, gets multiplied against a tile of the input, and produces a partial output. Meanwhile the next tile is already arriving. The 132 SMs all do this in parallel, each owning its own tiles.
Here is the math made completely concrete, for one Q projection in one Llama-7B layer.
Why the tile size is 128 and not 4096
The matrix is 4096 × 4096, but tiles are 128 × 128. Why? The answer sits in the memory hierarchy you just saw on the pyramid. An H100 SM has about 228 KB of shared memory (SRAM). A single 128 × 128 tile of FP16 weights is KB — it fits comfortably with room to spare for input tiles, output accumulators, and software pipelining. A tile would be 128 KB — also fits, which is why Hopper's tensor cores often use that size. But you cannot fit the full 4096 × 4096 matrix in SRAM. It is 33 MB, two orders of magnitude too big. So you tile.
The tiling is what makes the data-flow pipeline possible. While tile is being computed, the Tensor Memory Accelerator (TMA) is already streaming tile from HBM into a different SRAM buffer. When compute finishes tile , the data for tile is already waiting. This is double buffering, and it's what keeps the 132 SMs fed. Without it, every SM would spend most of its time waiting for HBM.
Watch it move — data flow in motion
All the static diagrams are true, but they hide the one thing that makes the pipeline actually work: double buffering. While one tile is being computed in the tensor cores, the next tile is already streaming in from HBM. The pipeline is always full, which is the only reason the SMs don't spend their lives waiting for memory.
Press play below. Watch the copper tiles flow from HBM through L2 into SRAM and finally into the tensor cores, flash briefly while they compute, and exit as results. The dwell ratiois real — tiles spend most of their time in the HBM zone because that's where most of the wall-clock time is actually spent. Speed up the animation and you can see three, four tiles in flight at once — that's the pipeline filling.
What the animation is actually telling you
Three observations worth writing down:
- The HBM zone is wide on purpose.At any given moment, more tiles are in the HBM-to-L2 read phase than in any other phase, because that's the slowest hop in the pipeline. The visual proportions match the real time ratios: ~70% of a tile's lifetime is spent being streamed from HBM.
- Multiple tiles are always in flight. While tile 5 is computing in the tensor cores, tile 6 is in SRAM waiting, tile 7 is streaming from L2, and tile 8 is streaming from HBM. This is double buffering. It's the only reason peak throughput is achievable.
- The tensor cores flash fast.The “compute” phase — the actual math — is the shortest segment on screen. That visual is the memory wall. Compute is fast. Moving bytes is slow. Nothing in Act VIII changes that; everything in Act VIII is a different way of moving fewer bytes for the same amount of math.
How a matmul actually runs on the hardware
The word “matmul” hides an enormous amount of mechanical work. Here is what happens when an H100 executes a matrix multiply of :
- The scheduler divides into tiles sized to fit in one SM's SRAM (≈ 128 KB).
- For each tile, the TMA (Tensor Memory Accelerator, a hardware DMA engine) starts streaming the tile from HBM into SRAM asynchronously.
- While one tile loads, the tensor cores multiply the previous tile. This is the single most important trick in GPU design: overlap data transfer with compute so the compute units never stall.
- The partial result stays in registers. When all tiles of a row are done, the result is written back to HBM.
Where this leads — Act VIII as a tour of the memory wall
Every lesson that follows this one attacks the memory wall from a different angle:
- KV cache — what it is, why it is the biggest source of memory traffic in long-context decode.
- PagedAttention — making the KV cache layout coalesced and recyclable.
- Continuous batching — pack more tokens per weight read, increasing effective AI.
- FlashAttention — keep intermediate tensors in SRAM, avoid HBM round-trips.
- Speculative decoding — verify several proposed tokens in a single weight read, amortising the cost.
- RadixAttention — reuse the KV bytes of shared prefixes instead of re-reading them.
With the roofline in mind, each one reads as a different way of shifting a red dot rightward on the plot you just played with.