How Google made Gemma 4 3x faster (without changing the model)
Google released Multi-Token Prediction (MTP) drafters for Gemma 4, achieving up to 3x inference speedup with zero quality degradation. The technique -- called speculative decoding -- pairs a small fast drafter model with the large target model. The drafter guesses several tokens ahead; the target verifies them all in one parallel pass. This post explains how it works, why it works, and when it does not.
LLM 101
| Term | Simple definition |
|---|---|
| Prefill | The "reading" phase: the model processes your whole prompt to understand the context. |
| Decode | The "writing" phase: the model generates the answer one token at a time. |
| KV cache | The model's short-term memory: saved calculations from earlier tokens so it does not re-read everything. |
Ready to ace your system design interview?
This article is just one piece. SWE Quiz gives you structured, interview-focused practice across every topic that comes up in senior engineering rounds.
- 1,000+ quiz questions across system design and ML/AI
- Spaced repetition to lock in what you learn
- Full case study walkthroughs of real interview topics
- Track streaks, XP, and progress over time
TL;DR
- LLM text generation (the decode phase) is slow because generating each token requires loading billions of model parameters from GPU memory -- a memory-bandwidth bottleneck, not a math bottleneck.
- Speculative decoding pairs a small "drafter" with the large target model: the drafter quickly guesses several tokens ahead, and the target verifies all of them in one parallel pass.
- If the drafter is right, you get multiple tokens for the cost of one verification. If it is wrong at position K, you discard from K onward and lose nothing -- you are never slower than running the target model alone.
- The output distribution is mathematically identical to running the target model by itself. Speedup is free -- it does not change what the model says or how it reasons.
- Gemma 4's MTP drafters share the target model's KV cache (pre-computed context), avoiding redundant work. The result: up to 3x tokens-per-second with no quality loss.
Speculative decoding in one sentence
Speculative decoding is an optimization where a tiny, fast model guesses the next few tokens, and the big, smart model checks them all at once.
Why this matters for interviews
LLM inference optimization is increasingly common in ML system design interviews. "How would you reduce LLM latency in production?" is a real question, and speculative decoding is the right answer for decode-phase bottlenecks. Understanding the prefill/decode split, KV cache, and why verification parallelizes but generation does not will make your answers precise rather than vague.
Breakdown
1.Tokens and the decode loop
LLMs generate text one token at a time. A token is roughly a word fragment -- "unbelievable" might be three tokens: "un", "believ", "able". Each time the model produces a token, it reads that token back as input and runs another full forward pass through all its layers to produce the next one. This is called autoregressive decoding: each output depends on everything before it, so you cannot skip ahead or parallelize generation. You are always waiting for token N before you can start token N+1.
Interview angle: Autoregressive decoding is the root cause of most LLM latency problems. If an interviewer asks why LLMs feel slow, the first-principles answer is here: sequential token generation with a hard dependency chain.
2.Two phases: prefill vs. decode
LLM inference has two distinct phases with very different performance characteristics.
Prefill is when the model processes your input prompt. All input tokens are available upfront, so the model can process them all in parallel using large matrix-matrix multiplications. This is compute-bound -- the GPU is doing heavy arithmetic and is well-utilized. A long prompt takes more time than a short one, but it completes in one shot.
Decode is when the model generates the response, one token at a time. At this point, the model is doing matrix-vector multiplications: one forward pass per token, loading all the model's weights from VRAM (the GPU's dedicated memory, separate from your laptop's regular RAM -- think of it as the GPU's working memory, with much higher bandwidth but typically 24-80GB of capacity) to the GPU compute units each time. For a 31B parameter model stored in 16-bit floats, that is roughly 62GB of data moving across the memory bus -- per token.
The problem is arithmetic intensity: the ratio of math operations performed to bytes read. Matrix-vector products at batch size 1 have very low arithmetic intensity. The GPU finishes the math almost instantly and then waits for the next batch of weights to arrive from memory. On modern hardware, the GPU is often less than 10% utilized during decode at batch size 1. The bottleneck is memory bandwidth, not compute.
This distinction matters: the claim that "LLM inference is memory-bandwidth bound" is specifically true for the decode phase at small batch sizes. During prefill, or when serving many users simultaneously with large batch sizes, inference can shift toward being compute-bound.
Interview angle: The prefill/decode distinction is frequently tested. Know that prefill is compute-bound (parallel, fast) and decode is memory-bandwidth bound at small batches (sequential, slow). If asked "how would you speed up LLM inference?", the right first question back is: "are you bottlenecked on time-to-first-token (prefill) or tokens-per-second throughput (decode)?" -- they have different solutions.
3.The insight behind speculative decoding
Here is the asymmetry that makes speculative decoding possible: generating a token from scratch is expensive, but checking whether a given token is correct is cheap.
Think of it like a proofreader and a fast drafter. The drafter writes several sentences quickly -- they might be right, might need a small fix. The proofreader reads the whole draft in one pass and either stamps it approved or marks the first mistake. If the draft was right, you just got four sentences for the price of one review. If the drafter made a mistake at sentence two, you discard sentences two, three, and four -- but you kept sentence one, which you would have had to generate anyway.
In speculative decoding, the drafter is a small, fast model (Gemma 4's MTP drafter is far smaller than the 31B target). It generates K tokens quickly -- each pass is cheap because the model is tiny. The large target model then runs one forward pass over all K draft tokens simultaneously and produces a probability distribution over what each token should have been.
For each position, the target model either accepts the draft token (it agrees the token is plausible given the context) or rejects it. If all K tokens are accepted, the target model also produces one additional token of its own, giving you K+1 tokens for the cost of one target model forward pass. If the draft is wrong at position i, tokens after position i are discarded and inference continues normally from there.
Critically, this is not an approximation. With the proper rejection sampling scheme, the probability distribution of the output is mathematically identical to running the target model alone. The target model always has final authority. Speedup is free -- you are not trading quality for speed.
Interview angle: The key insight to articulate in an interview is the asymmetry: sequential generation requires one forward pass per token, but parallel verification only requires one forward pass for K tokens at once. The mathematical equivalence proof -- that the output distribution is unchanged -- is what makes this safe to deploy.
4.Why verification parallelizes but generation cannot
This question trips people up: why can you verify K tokens in parallel but not generate them in parallel?
During verification, you already have all K draft tokens. You can feed the entire sequence into the target model at once as a single forward pass (just like prefill). The model produces probability distributions for each position simultaneously. There is no sequential dependency between positions here -- you are just scoring a complete sequence that already exists.
During generation, you do not have the next token yet. To produce token N+1, the model must condition on token N. There is no way around this -- the whole architecture is designed to predict the next token given everything before it. You cannot run position N+1 until position N is finalized. The dependency is fundamental, not an implementation detail.
This asymmetry is precisely what speculative decoding exploits: the drafter fills in the dependency chain with guesses, then the target model evaluates the complete sequence in one cheap parallel pass.
Interview angle: A strong systems answer: sequential generation has O(n) serial dependencies. Parallel verification eliminates them by treating a complete draft as a single batch. This is the same class of insight as pipeline parallelism in CPUs -- you can't start stage N+1 of a single operation until stage N completes, but you can process many operations simultaneously in a pipeline.
5.What makes Gemma 4's MTP drafter particularly efficient
Google introduced several architectural choices that go beyond basic speculative decoding.
KV cache sharing is the most important. The KV cache is a stored record of the Key and Value matrices the model computed for every token in the context so far. It is what prevents the model from recomputing context from scratch on each new token -- without it, a 1000-token conversation would require reprocessing all 1000 tokens on every generation step. In a naive speculative decoding setup, the drafter would maintain its own separate KV cache and recompute context independently. Gemma 4's drafter instead reads directly from the target model's KV cache. It does not duplicate work the large model already did. This is significant because KV cache computation can be a non-trivial fraction of inference cost for long contexts.
For the E2B and E4B edge models (designed to run on phones and tablets), the bottleneck shifts: generating the final token logits (the raw probability scores over the entire vocabulary -- typically 250,000 tokens) becomes expensive because the vocabulary is large relative to the model size. Google added an efficient embedding clustering technique to speed this up, essentially grouping similar vocabulary entries so fewer computations are needed.
For the 26B Mixture-of-Experts (MoE) model on Apple Silicon, batch size matters more than it does for dense models. In a MoE architecture, each token is routed to only a small subset of specialized sub-networks ("experts"). At batch size 1, most experts sit idle. At batch sizes of 4-8, different tokens in the batch activate different experts in parallel, dramatically improving hardware utilization -- up to 2.2x speedup compared to batch size 1.
Interview angle: KV cache is a standard interview topic. Know that it trades memory for speed by caching computed context. KV cache sharing between drafter and target is an architectural optimization that reduces drafter overhead. The MoE batch size observation is a good example of hardware-aware optimization -- the same technique behaves differently on different architectures.
6.When speculative decoding wins -- and when it does not
The speedup from speculative decoding is not constant. It depends on how accurately the drafter can predict what the target model would say -- the acceptance rate.
Speculative decoding works best on predictable text: structured code (the next token in `for i in range(` is almost always `0`), formulaic writing, repetitive patterns, and factual recall. In these cases, the drafter gets most tokens right and the acceptance rate is high. Google reports up to 3x speedup across tested hardware and frameworks.
It works less well on highly creative or unpredictable outputs where the target model might reasonably generate any of many different tokens. The drafter's guesses miss more often, sequences get discarded more frequently, and the net benefit shrinks. In the worst case, if the drafter is consistently wrong, you pay the overhead of running the drafter on top of the target model. In practice, speculative decoding is almost never slower than standard inference -- rejecting all draft tokens costs one wasted drafter pass, and you fall back to normal target model behavior.
The general principle here extends beyond Gemma 4: speculative decoding is now a standard technique in production LLM serving. Anthropic, OpenAI, and most serious inference frameworks (vLLM, SGLang, TensorRT-LLM) support it. When you see "fast inference mode" or "draft model" in an API or library, this is almost always what is happening under the hood.
Interview angle: "How would you reduce LLM inference latency?" The structured answer: first identify whether you are bottlenecked on prefill or decode. For decode latency at small batch sizes, speculative decoding is the most effective technique when the drafter acceptance rate is high. Complementary approaches include quantization (smaller model weights = faster memory reads), continuous batching (fill idle GPU capacity with other requests), and KV cache offloading for long contexts. Each targets a different part of the bottleneck.
7.Why 3x matters for users
A 3x speedup is not just an infrastructure metric. It changes how the product feels.
Imagine a chat response that would normally take 9 seconds to stream. At 3x faster, it can finish in about 3 seconds. That moves the experience from "I am waiting for a slow printer" toward "someone is typing back quickly." For coding assistants, that gap is even more obvious: completions that arrive after your train of thought has moved on feel laggy, while completions that arrive immediately feel like part of the editor.
This is why speculative decoding is especially valuable for interactive LLM products. It improves tokens-per-second during the decode phase without changing the model output, so the same model can feel more responsive without retraining, shrinking, or accepting lower quality.
Interview angle: Tie performance work back to product impact. Lower decode latency improves perceived responsiveness for chat, code completion, voice assistants, and tutoring flows. The systems win is not only cheaper serving -- it is keeping the user in flow.
Key Concepts
Token
The basic unit of text an LLM processes. A token is roughly a word fragment -- most common words are one token, longer or rarer words are split into several. LLMs generate one token at a time.
Autoregressive decoding
The standard LLM generation loop: produce one token, feed it back as input, produce the next token. Each output depends on all previous outputs, so generation is inherently sequential.
Prefill
The phase where the model processes the input prompt. All input tokens are available at once, so this runs in parallel and is compute-bound.
Decode
The phase where the model generates output tokens one by one. Sequential and memory-bandwidth bound at small batch sizes -- loading model weights from VRAM is the bottleneck, not arithmetic.
VRAM
Video RAM -- the dedicated memory on a GPU. Separate from your system RAM. Has much higher bandwidth (useful for fast weight loading) but limited capacity, typically 24-80GB on high-end consumer or workstation GPUs.
Analogy: If your CPU and its RAM is a desk where you do all your thinking work, VRAM is a smaller but faster workbench right next to the machine that actually runs the LLM. Everything the GPU needs during inference must fit on that workbench.
KV cache
A cache of the Key and Value matrices the model computed for each previous token in the conversation. Reusing these avoids re-processing the entire context from scratch on every generation step. Grows linearly with sequence length.
Analogy: Like keeping your meeting notes on the table instead of re-reading the transcript every time someone asks a follow-up question. The KV cache is what lets an LLM remember what you said at the start of a long conversation without re-reading it word by word every time it generates a reply.
Speculative decoding
An inference technique where a small drafter model predicts several tokens ahead, and the larger target model verifies them all in one parallel pass. Accepted drafts give multiple tokens for the cost of one verification. The output distribution is identical to running the target model alone.
Acceptance rate
The fraction of drafter tokens that the target model agrees with during verification. Higher acceptance rate = more tokens produced per target model forward pass = larger speedup.
Mixture of Experts (MoE)
A model architecture where each token is routed to a small subset of specialized sub-networks ("experts") rather than running through all parameters. Reduces per-token compute but requires routing overhead. Gemma 4 26B is an MoE model.
Arithmetic intensity
The ratio of floating-point operations to bytes read from memory in a computation. Low arithmetic intensity means the hardware spends more time waiting for data than computing -- the memory-bandwidth bound regime.
Knowledge Check
10 questions — your answers are saved locally so you can come back anytime.