The cost curve

“Small” is a question about deployment, not a parameter count

When you decide how big to train a model, you're not optimising one thing — you're trading off at least three:

Training compute — one-time cost, paid on a cluster for weeks.
Inference compute — paid on every single token served, for the entire life of the model.
Memory footprint — determines what hardware can host it at all.

The landmark Chinchilla result (Hoffmann et al., 2022) showed that given a fixed training budget, you should scale model size and training tokens roughly equally, at a ratio near $D \approx 20N$ . That gives the minimum training loss per training FLOP. But it says nothing about inference.

The 2024 follow-up “Beyond Chinchilla-Optimal” (Sardana et al.) added the missing term. Once you include the cost of serving, the optimum moves toward smaller, longer-trained models, because inference cost dominates lifetime compute the moment you serve more than a handful of billion tokens. Llama 3 8B is trained at $D/N \approx 1{,}875$ — about 94× more than Chinchilla says is optimal. That's not a mistake. It's the new frontier.

Play with the dial

Slide the parameter count and the serving volume. Watch the numbers on the right. In particular, watch the dashed teal curve at the bottom: it shows what fraction of lifetime compute is inference, not training, as a function of model size. For any meaningful deployment volume, that curve rises fast.

parameters (N, billions)3.0B

from Qwen3-0.6B to frontier GPT-4-class

lifetime tokens served (billions)10B

from a research demo to a global product

memory FP16

6.00 GB

memory Q4 (gguf)

1.50 GB

train H100-hours

12.5 d

serve H100-hours

16.7 h

what fraction of lifetime cost is inference, at your current serving volume?

inference share of lifetime compute

your current (N, serving volume)

The math, made explicit

Three well-known approximations underlie the plot:

C_{\text{train}} \;\approx\; 6 \cdot N \cdot D \quad\text{[Kaplan, 2020]}

D \;\approx\; 20 \cdot N \quad\text{[Chinchilla, 2022]}

C_{\text{infer per token}} \;\approx\; 2 \cdot N

Combining the first two gives the Chinchilla compute $\;C_{\text{train}} \approx 120 N^2$ . The inference cost over $T$ served tokens is $\;C_{\text{infer}} = 2 N T$ . The crossover (inference equals training) happens when $T \approx 60 N$ — in words: once you serve more tokens than 60× your parameter count, inference starts dominating. For a 3B model that's 180B tokens — just a few weeks of a mid-traffic product. After that, every one of those parameters is paying rent every token, and shrinking N pays back linearly.

The 6 in

C_{\text{train}} \approx 6ND

comes from: 2 FLOPs for the forward pass matmul (add + multiply), 2 for the gradient w.r.t. activations, 2 for the gradient w.r.t. weights. It's approximate — real training has optimizer state, attention, normalisation, etc. — but for scaling-law reasoning it's close enough. Inference is just the forward pass, so ~2N.

Memory is a separate constraint

Compute tells you what costs money. Memory tells you what fits at all. FP16 weights are 2 bytes per parameter, so a 3B model is 6 GB of weights alone. A 70B model is 140 GB — already past any single consumer GPU. 4-bit quantization (Q4) slashes this by 4× — a 70B becomes 35 GB, fitting a single A100-80 or a Mac Studio with 96 GB unified memory. That's the whole reason Act VII exists.

Translate the FLOPs into the number a CFO actually asks for. At April-2026 spot pricing on open-weight inference providers (Together, Fireworks, DeepInfra), a Llama-3.1-8B endpoint runs roughly $0.10–0.20 per million output tokens; a 70B is $0.80–0.90; GPT-4o is $10; Claude Opus 4 is $75. That's a 500× gap from the cheapest 8B workhorse to the top of the frontier, for tokens that — on a well-scoped intent classification or a RAG rewrite — are indistinguishable to the end user. The entire commercial case for SLMs lives in that ratio: if your workload fits inside what an 8B can do, paying Opus-per-token is lighting ~99.8% of your inference bill on fire.

The “pragmatic frontier” of sub-3B isn't an aesthetic call — it's where three hard budgets intersect. Disk:a Q4-quantized 3B is ~1.8 GB, small enough to ship inside an iOS app bundle (Apple's soft limit on cellular download is 200 MB, but on-device downloads routinely hit a few GB — Gemma-2-2B-Q4 at 1.6 GB ships in Google AI Edge today). RAM: a 3B Q4 fits in the ~3 GB working set a mid-tier Android phone will give you without being killed by the OOM killer; a 7B at ~4 GB already needs a Pixel 9 Pro or better. Latency:at the 2N-FLOPs-per-token approximation, a 3B model decodes at ~60 tok/s on an M3-class NPU and ~120 tok/s on an H100 — fast enough to stay under the ~300 ms first-token budget a voice agent needs. A 13B model misses all three of those budgets at once. That's why Phi-3.5-mini (3.8B), Gemma-3-4B, Qwen3-4B, and Llama-3.2-3B cluster so tightly — they are each other's competitive set because the hardware envelope says so.

comprehension check

comprehension · 1 / 3

“Small” is a question about deployment, not a parameter count

Play with the dial

The math, made explicit

Memory is a separate constraint

What does Chinchilla say the compute-optimal ratio of training tokens to parameters is?