ai_math April 25, 2026 · 64 min read

From a Token to a Generation: Every Part

This is a long, mechanical walk through what a large language model actually is — every layer, every matrix, every numeric format, every trick that makes a 70B parameter file turn into running text. By the end you should be able to read a sentence like "we fine-tuned a 70B with QLoRA at 4-bit and served it with page…

What a token actually is

Before there is a model, there is a string. The string has to become numbers, because the only thing a transformer can do is multiply and add. So the first thing an LLM is, mechanically, is a function that turns a string into a sequence of integers and then turns each integer into a vector. That sounds trivial. It is not.

A naive scheme would map each Unicode character to an integer. There are about 150,000 assigned Unicode code points, so the vocabulary would be roughly that size, and "hello" would become a sequence of five tokens. The problem is that natural text has structure at a level much coarser than the character. The model has to spend its capacity learning that "ing" is a suffix and "the" is a frequent function word, instead of having those facts handed to it. Per-character models also produce very long sequences, and attention is quadratic in sequence length, so this is not a small inefficiency.

The other extreme is whole words. English has on the order of a million word-forms in serious corpora, more once you include morphology, names, code, and other languages. A million-entry embedding table at $$d = 4096$$ in float16 is 8 GB just for the input embedding. Worse, you cannot represent words you have never seen. "docstring" will land in <unk> and the model will be blind to it.

The right answer, found by trial and error in machine translation in the late 2010s, is subword tokenization. You learn a vocabulary of common substrings — typically 32k to 256k entries — that ranges from single bytes (for any character at all) up to whole common words ("the", " the", "ing"). Common strings get one token, rare strings get split into pieces, and nothing is ever out-of-vocabulary because at the limit the tokenizer can fall back to bytes.

There are two dominant algorithms. Byte Pair Encoding (BPE), used by GPT-2/3/4 and Llama, starts with a base vocabulary (often the 256 bytes) and repeatedly finds the most frequent adjacent pair in the training corpus and merges it into a new token. After 50,000 merges you have a 50,256-entry vocab. SentencePiece, used by T5 and PaLM, treats the input as a raw byte stream including spaces (so " cat" and "cat" are distinct tokens) and trains a unigram language model over substrings, keeping the most probable subwords. The practical difference: BPE is greedy and deterministic given the merge table; SentencePiece can produce multiple segmentations and pick the most likely. Llama uses a SentencePiece-trained BPE — the algorithm is BPE, the implementation is SentencePiece. This kind of detail matters when you are debugging.

A vocab table is, on disk, a JSON file mapping integers to byte sequences and a list of merges. In memory at training time it is two things: the token-id-to-string table (a few megabytes) and the embedding matrix $E \in \mathbb{R}^{V \times d}$ , where $$V$$ is vocabulary size and $$d$$ is the model's hidden dimension. To embed a token, you do $E[\text{token\_id}]$ . That is a row lookup, but in the math we treat it as multiplication by a one-hot vector: $e = x^\top E$ where $$x$$ is a one-hot of length $$V$$ . The lookup view is what's actually implemented; the matrix-multiply view is what makes the gradients clean.

The embedding dimension $$d$$ is the width of the model's working memory. Every vector in every layer lives in $\mathbb{R}^d$ . For Llama-2-7B, $$d = 4096$$ . For Llama-2-70B, $$d = 8192$$ . For GPT-3 175B, $$d = 12288$$ . Every doubling of $$d$$ roughly quadruples the cost of the attention and MLP matrices, so this number is chosen carefully.

How many parameters live in the embedding? With $$V = 32{,}000$$ (Llama's vocab) and $$d = 4096$$ , the input embedding is about 131 million parameters. There is usually an output projection — the matrix that turns the final hidden state back into vocab-sized logits — of the same shape, often (in Llama) tied to the input embedding to save memory. In a 7B model that 131M is about 2% of the total. In a 124M-parameter GPT-2, the input and output embeddings together (38M each, untied) are over 60% of the model. This is why people sometimes describe small models as "mostly embedding". It is also why scaling $$d$$ is so expensive in a small regime: you are paying for the embedding table whether or not you use it well.

                                token id 15043 ("hello")
                                          |
                                          v
                                +---------+---------+
   E in R^{V x d}  ----row-----> | 0.13 -0.4 ... 0.7 |   <- e in R^d
   (V=32000, d=4096)             +-------------------+
                                          |
                                          v
                                  to layer 1 of transformer

A subtlety that bites people: the tokenizer is part of the model. Two checkpoints with different tokenizers cannot share weights, even if every other dimension matches, because token id 1138 means different things to them. A "Llama-2 tokenizer" and a "Llama-3 tokenizer" are different files and the embeddings learned with one are unusable with the other. When fine-tuning, never swap the tokenizer.

The other subtlety is that the same string can tokenize to different sequences depending on context, especially around whitespace. " hello" (space-h-e-l-l-o) is usually one token; "hello" without the leading space is sometimes two. This is why prompt format matters down to the byte: training and inference have to agree, or the model is being shown a sequence it never saw at train time.

With the string converted into a sequence of $$d$$ -dimensional vectors, we are ready for the part that does the work.

The transformer block, mechanically

A transformer block takes a sequence of $$T$$ vectors of dimension $$d$$ and returns another sequence of $$T$$ vectors of dimension $$d$$ . The shapes do not change. What changes is the content: each output vector has been allowed to gather information from the other vectors in the sequence, and then to think about what it gathered. The block does this with two sub-layers, each wrapped in a residual connection. The first sub-layer is attention, the second is the MLP (also called the feed-forward network, or FFN). Everything else in a transformer is plumbing around this pattern.

Start with attention. The intuition is associative recall: every position in the sequence has a question (what would I like to know?) and a tag (what am I about?), and a value (what would I share if asked?). For each query position, you look at every other position's tag, score how well that tag matches your question, softmax the scores, and read out a weighted average of the values. That is the entire mechanism. The cleverness is in how the questions, tags, and values are produced.

Given input $X \in \mathbb{R}^{T \times d}$ , we compute three projections:

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ are learned matrices. $$Q$$ is the queries, $$K$$ is the keys, $$V$$ is the values, each of shape $T \times d$ . The attention output is

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Read it row by row. For row $$i$$ — the $$i$$ th query — $(QK^\top)_{ij} = q_i \cdot k_j$ is the dot product between query $$i$$ and key $$j$$ . That dot product is high when $$q_i$$ and $$k_j$$ point in similar directions, low when they don't. The softmax over $$j$$ turns those scores into a probability distribution: how much of position $$j$$ should I read? Then we take that distribution and use it to mix the value vectors: the output at position $$i$$ is a convex combination of $\{v_j\}$ , weighted by how relevant each was to query $$i$$ .

Why dot product? Because it is the most parameter-cheap differentiable similarity function. The model can learn whatever projections it likes for $$Q$$ and $$K$$ ; the dot product just compares them. There are alternatives — additive attention from the original Bahdanau machine translation paper, kernelized attention, etc. — but they cost more compute for negligible benefit in practice.

Why divide by $\sqrt{d_k}$ ? Because dot products of high-dimensional vectors with iid unit-variance components have variance proportional to dimension. Without scaling, as $$d$$ grows the softmax inputs grow, and the softmax saturates: one entry becomes ~1, all others ~0, and the gradient vanishes. Dividing by $\sqrt{d_k}$ keeps the variance stable. $$d_k$$ here is the key dimension per head, which we'll get to in a moment.

Why the softmax? Because we want a probability distribution over positions, both for interpretability and because mixing values with normalized weights keeps the output's scale stable across sequence lengths. You could imagine other normalizations; softmax has the right gradients and the right behaviour at the extremes.

Now, multi-head attention. Doing one big attention with $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ is wasteful: a single softmax has to do all the work of figuring out what to attend to. So we split the $$d$$ -dimensional space into $$h$$ heads of dimension $$d_k = d/h$$ , run attention independently in each subspace, and concatenate. Each head sees its own low-rank slice of the queries and keys. One head might learn to attend to the previous token; another might learn to copy from a name mentioned earlier; another might attend to the matching bracket. We don't pre-specify what they do; we only specify that they get to do it in parallel, with their own projections.

Concretely, $$W_Q$$ is reshaped from $d \times d$ to $d \times h \times d_k$ , and after the per-head attention, the output is concatenated back to $$d$$ and passed through a final projection $W_O \in \mathbb{R}^{d \times d}$ . For Llama-2-7B, $$d = 4096$$ , $$h = 32$$ , $$d_k = 128$$ . For Llama-2-70B, $$d = 8192$$ , $$h = 64$$ , $$d_k = 128$$ . The head dimension is held roughly constant across scales — 64 or 128 — and you scale by adding heads.

import torch
import torch.nn.functional as F

def multi_head_attention(x, w_q, w_k, w_v, w_o, num_heads):
    # x: (batch, T, d) -- the input sequence, T tokens of dim d
    B, T, d = x.shape
    d_k = d // num_heads  # per-head dim, e.g. 128

    # Project to Q, K, V then split into heads.
    # After the view+transpose: (batch, num_heads, T, d_k)
    Q = (x @ w_q).view(B, T, num_heads, d_k).transpose(1, 2)
    K = (x @ w_k).view(B, T, num_heads, d_k).transpose(1, 2)
    V = (x @ w_v).view(B, T, num_heads, d_k).transpose(1, 2)

    # Scaled dot-product scores: (batch, num_heads, T, T)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)

    # Causal mask: position i may only attend to positions j <= i.
    # We add -inf to disallowed positions before softmax.
    mask = torch.triu(torch.full((T, T), float('-inf')), diagonal=1)
    scores = scores + mask

    attn = F.softmax(scores, dim=-1)        # (B, h, T, T)
    out = attn @ V                           # (B, h, T, d_k)
    out = out.transpose(1, 2).reshape(B, T, d)  # back to (B, T, d)
    return out @ w_o                         # final mix across heads

The causal mask is the difference between an LLM (decoder-only, autoregressive) and an encoder. We forbid attending to future positions because at inference we don't have them; we want the same model that sees [t1, t2, t3] and predicts t4 during inference to have been trained that way. The mask makes the loss at every position simultaneously a valid next-token prediction.

After attention comes the MLP. It is two linear layers with a non-linearity in between:

\text{MLP}(x) = W_2 \, \sigma(W_1 x)

where $W_1 \in \mathbb{R}^{d \times d_\text{ff}}$ , $W_2 \in \mathbb{R}^{d_\text{ff} \times d}$ , and $d_\text{ff}$ is typically $4d$ in classic transformers — Llama uses a slightly different ratio, $d_\text{ff} \approx \frac{8}{3} d$ rounded to a multiple of 256, because it uses a gated activation (SwiGLU) that has three matrices instead of two. The non-linearity $\sigma$ is GELU or SwiGLU in modern models; the older ReLU is mostly gone.

What does the MLP do? The honest answer is: we do not entirely know, and most of mechanistic interpretability is about answering exactly this. The cleanest theory is that the MLP is a key-value memory. $$W_1$$ projects the hidden state into a wide $d_\text{ff}$ -dimensional space; each row of $$W_1$$ acts as a key, the activation $\sigma(W_1 x)$ measures how much each key fired, and $$W_2$$ uses those firings to write back a stored value into the residual stream. Two-thirds of the parameters in a transformer typically live in the MLPs. If attention is "look up information from elsewhere in the sequence", the MLP is "look up information from the model's own knowledge".

The full block, with residual connections, is:

x' = x + \text{Attention}(\text{LN}(x))

x'' = x' + \text{MLP}(\text{LN}(x'))

The residual connections are not decoration. They are what makes the model trainable. Without them, gradients have to flow through every nonlinearity in series, and they vanish. With them, every layer adds a correction to a running sum, and gradients flow backward through the addition unchanged. The right way to think about the residual stream is as a bus: each layer reads from the bus, computes something, and writes a delta back. Layers do not replace each other's outputs; they accumulate. This view, due to Anthropic's circuits work, is more useful than thinking of the model as a stack of transformations.

   residual stream  ----+--------+---------+--------+----->
   (a vector in R^d     |        ^         |        ^
    that grows by       v        |         v        |
    accumulation)    [ LN ]      |      [ LN ]      |
                       |         |         |        |
                       v         |         v        |
                   [ Attn ]------+     [ MLP ]------+
                  (read+write)      (read+write)

That picture also explains why LayerNorm placement matters. Original transformer (post-norm) put the LN after the residual addition: $x' = \text{LN}(x + \text{Attn}(x))$ . Modern transformers (pre-norm) put it inside the residual branch: $x' = x + \text{Attn}(\text{LN}(x))$ . With pre-norm, the residual stream itself is never normalized — it grows freely — and only the input to each sublayer sees normalized values. This makes deep models far more trainable. Post-norm essentially passes $$x$$ through a normalizer once per layer, which dampens the signal that the residual stream is supposed to carry. Every modern model — Llama, GPT-NeoX, Mistral, Gemma — is pre-norm. Many use RMSNorm instead of LayerNorm, which drops the mean-centering step (only divides by RMS) for compute savings with no measured loss in quality.

So one transformer block is: read, normalize, attend, add; read, normalize, MLP, add. A model is dozens of these stacked. Llama-2-7B has 32 blocks; Llama-2-70B has 80. GPT-3 has 96. The deeper you go, the more rounds of "look at the sequence, then think" the model gets. We will see why this matters when we discuss circuits.

What we have not addressed yet is how the model knows where in the sequence each token sits.

Position information

Look at the attention equation again. If you permute the rows of $$X$$ — that is, shuffle the tokens — you get the same set of $$Q$$ , $$K$$ , $$V$$ rows in a permuted order, and the output is the same set of vectors permuted. Attention does not care about order. This is a feature for set-based tasks and a disaster for language, where "the dog bit the man" and "the man bit the dog" are different.

So we have to inject position information somewhere. Three approaches have dominated, in roughly chronological order.

Sinusoidal positional encoding, from "Attention Is All You Need", just adds a fixed function of position to the token embedding:

\text{PE}(t, 2i) = \sin(t / 10000^{2i/d}), \quad \text{PE}(t, 2i+1) = \cos(t / 10000^{2i/d})

Every position $$t$$ gets a unique vector built from sines and cosines at geometrically spaced frequencies. The model adds this to the token embedding before the first layer. The trick is that for any fixed offset $$k$$ , $\text{PE}(t+k)$ is a linear function of $\text{PE}(t)$ — specifically a rotation in each frequency pair — so the model can learn to attend to relative positions via dot products of these sinusoids. In practice, this works but extrapolates poorly: a model trained on length 2048 fed length 4096 gets confused, because the relative-position relationships at distances it has never seen are still in-distribution mathematically but out-of-distribution for the learned attention weights.

Learned positional embeddings, used by GPT-2 and BERT, just give every position $t \in [0, T_\text{max})$ its own learned $$d$$ -dimensional vector and add it to the token embedding. Simple. Cannot extrapolate at all: position 2049 of a model trained to 2048 is undefined.

Rotary Position Embedding (RoPE), introduced by Su et al. in 2021 and now used by Llama, Mistral, Qwen, Gemma, and most recent open models, is a different idea. Instead of adding a position vector, it rotates the query and key vectors by an angle that depends on position, before computing the dot product. The key property is that $\langle R_\theta q, R_\phi k \rangle = \langle q, R_{\phi - \theta} k \rangle$ : the dot product depends only on the relative rotation $\phi - \theta$ , i.e. only on the relative position. So position information shows up exactly where it needs to — in the attention scores — and absolute position does not appear in the residual stream at all.

Mechanically: the $$d_k$$ -dimensional query and key vectors are split into $$d_k/2$$ pairs, and each pair $(x_{2i}, x_{2i+1})$ is rotated by angle $t \cdot \theta_i$ , where $$t$$ is the token's position and $\theta_i = 10000^{-2i/d_k}$ is a frequency that decreases for higher pair indices. So lower-index pairs rotate fast (capture short-range), higher-index pairs rotate slow (capture long-range). The "rotation in 2D pairs" picture is literal: each pair is a 2-vector, and we apply a 2x2 rotation matrix.

\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(t\theta_i) & -\sin(t\theta_i) \\ \sin(t\theta_i) & \cos(t\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}

Why does RoPE extrapolate better than sinusoidal? Because relative positions are encoded in a structurally exact way — they are geometric rotations, not learned interpolations of an additive signal — and because the model is forced from layer one to think about relative rather than absolute position.

def apply_rope(x, pos):
    # x: (B, h, T, d_k) -- queries or keys
    # pos: (T,) -- integer positions [0, 1, ..., T-1]
    B, h, T, d_k = x.shape
    half = d_k // 2

    # frequencies: theta_i = 10000^(-2i/d_k), one per pair
    freqs = 10000.0 ** (-torch.arange(0, half).float() / half)
    angles = pos[:, None].float() * freqs[None, :]   # (T, half)
    cos = angles.cos()[None, None, :, :]              # broadcast shape
    sin = angles.sin()[None, None, :, :]

    # split into the pair components: even and odd indices
    x_even = x[..., 0::2]   # (B, h, T, half)
    x_odd  = x[..., 1::2]

    # rotate: each (even, odd) pair becomes (even*cos - odd*sin, even*sin + odd*cos)
    rot_even = x_even * cos - x_odd * sin
    rot_odd  = x_even * sin + x_odd * cos

    # interleave back
    out = torch.stack((rot_even, rot_odd), dim=-1).flatten(-2)
    return out

But even RoPE breaks at lengths much longer than training. The frequencies $\theta_i$ were tuned for one length regime; at five times that length, the highest-frequency pairs have rotated through many full revolutions and the model has not seen those rotation-products before. So a family of "context extension" tricks has emerged.

Position Interpolation (PI), from Meta, simply scales positions: instead of feeding position $$t$$ , feed $$t / s$$ where $$s$$ is the context-extension factor. This squashes longer sequences back into the trained range. It is a one-line change. It works, but it costs short-range fidelity because the high frequencies are also squashed.

NTK-aware scaling modifies the base of the frequency formula instead, so that low-frequency components stretch but high-frequency components are barely touched. This preserves short-range behaviour while extending long-range.

YaRN (Yet another RoPE extensioN) combines NTK-aware scaling with explicit interpolation in the high-frequency regime and an attention-temperature correction. It is the current state of the art for stretching a model from, say, 4k to 128k context after training, with a short fine-tune to recover quality. The mechanics are: rotate fast frequencies as before, slow frequencies more slowly, and rescale the softmax temperature so the attention distribution has the right entropy at the new length.

ALiBi (Attention with Linear Biases), used by MPT and BLOOM, takes a different approach entirely: don't rotate, just add a position-dependent bias to the attention scores. Specifically, for a query at position $$i$$ and key at position $$j$$ , add $-m_h \cdot (i - j)$ to the score, where $$m_h$$ is a per-head slope. So distant tokens get an additive penalty proportional to distance. ALiBi extrapolates very well — there is nothing to break at long lengths — but in practice has lost ground to RoPE+YaRN for reasons that are partly empirical and partly that RoPE plays better with the rest of the stack.

The geometric picture all of these tricks share: the model learned a particular relationship between position differences and dot-product magnitudes. Long-context tricks deform the position signal so that, at inference time, those same dot-product magnitudes correspond to the new set of position differences. Done right, the deformation is gentle enough that the model's learned circuits still apply.

That covers how the model sees order. Next: what happens when we stack 80 of these blocks.

Stacking and the residual stream

You can think of a transformer with $$L$$ layers as $$L$$ rounds of the residual stream getting written to. Each round, every token gets to look around (attention) and look inward (MLP). Information flows in two directions: vertically up the stack at the same token position, and horizontally across positions via attention. Depth matters because some computations require multiple rounds: figuring out what a pronoun refers to may require first identifying the noun phrases, then matching gender/number, then resolving co-reference; you cannot do all of that in one attention pass.

The width-versus-depth tradeoff is real but not symmetric. Wider models (larger $$d$$ ) have more representational capacity per layer; deeper models have more rounds of computation. For a fixed parameter budget, the literature has converged on something like $d \propto L^{1.5}$ or so — width scales faster than depth. The reasons are partly mechanical (deep models are harder to train, their gradients are noisier, they require more careful initialization) and partly that many of the tasks we care about benefit more from a richer single-layer representation than from another round of attention.

But "deeper isn't strictly better" is a stronger claim. Past about 100 layers in a typical pretraining recipe, returns diminish sharply. Some of this is optimization difficulty; some is that the residual stream has finite capacity (it's a $$d$$ -dimensional vector) and adding more writers eventually crowds it. There is research on much deeper models with various tricks (DeepNet, GPT-style scaled inits, etc.), but mainstream training recipes cap out in the 80-100 layer range.

What does the residual stream actually contain? This is where it gets interesting. The naive picture is that each of the $$d$$ dimensions stores some interpretable feature: dimension 412 means "is the current token a verb", dimension 1108 means "is this code", and so on. The reality is superposition: the model packs more features than there are dimensions by storing them in approximately-but-not-exactly orthogonal directions. Two features that almost never co-occur can share a direction; their interference, when both happen, is a small price for representing many more features than $$d$$ . The math, due to Elhage et al., is that with $$d$$ dimensions you can fit roughly $d \log d$ near-orthogonal features if they are sparse. Sparsity is the key: most features are off most of the time, so collisions are rare.

This has a sharp implication for interpretability. If you probe a single neuron — a single dimension of an MLP's hidden state, say — you will usually find it activates on a confusing mix of inputs ("Greek letters, certain function words, the start of code blocks"). That isn't a bug. The neuron is participating in several superposed features, and the mix you observe is the projection. To find clean features, you have to decompose the activations, e.g. with a sparse autoencoder that re-expresses the activation as a sparse linear combination of a learned overcomplete basis. This is the current cleanest theory of why neurons in LLMs look the way they do.

The other unit of interpretation is the circuit: a small subgraph of components (a few attention heads, a few MLP neurons) that together implement a specific algorithm. The canonical example is the induction head, discovered by Anthropic in 2022. An induction head is two attention heads in successive layers that together do "in-context copying": if the sequence contains ... A B ... A, the heads notice the first occurrence of A, look at what came after it (the B), and predict that B is the next token. Layer 1 head sees the current token A and copies the previous-token information forward. Layer 2 head queries for "earlier occurrence of A", finds it, and copies the token after it (B) into the prediction.

This is a concrete circuit you can identify with patching experiments and it appears in essentially every transformer of nontrivial depth. It is also the mechanical substrate of much of in-context learning: if you show a model a few examples of a pattern, induction heads pick it up. Larger models have more induction heads and more elaborate variants (translation heads, format-copying heads, etc.).

  layer L+1:                attend(Q at position i, K at all earlier positions)
                            -> finds "earlier occurrence of token at position i"
                            -> reads the token AFTER that occurrence
  layer L  :  prev-token head copies token[i-1] info into position i's K
              so layer L+1 can match on it

  in-context: ... A B ... C D ... A  ->  predict B (it followed A last time)

Stacking, then, is not just "more compute". It is what enables circuits to span layers, with earlier layers preparing features for later layers to consume. The residual stream is the shared workspace; layers are computational units that read, transform, and write. This view will reappear later when we discuss MoE and inference optimization.

We have a model architecture. Next we need a way to make it good.

The training objective

The training objective is the simplest part of a modern LLM and the one that most people overestimate the cleverness of. It is next-token cross-entropy. Given a sequence of tokens $t_1, t_2, \ldots, t_T$ , the model produces, at every position $$i$$ , a probability distribution $p_\theta(\cdot \mid t_1, \ldots, t_i)$ over the vocabulary. The loss is

\mathcal{L}(\theta) = -\frac{1}{T} \sum_{i=1}^{T-1} \log p_\theta(t_{i+1} \mid t_1, \ldots, t_i)

That's it. We compute the model's predicted probability of the actual next token and take the negative log. Sum over positions, average. Backprop. Update weights. Repeat for trillions of tokens.

There are equivalent ways to read this loss, and they correspond to different intuitions about what training is doing.

Maximum likelihood: we are choosing $\theta$ to maximize the probability the model assigns to the training data. This is the standard frequentist estimator. With enough data, it converges to the parameters that make the model as close as possible (in KL divergence) to the true distribution of the data.

Compression: by Shannon's source coding theorem, the optimal code length for a symbol of probability $$p$$ is $-\log p$ bits. So minimizing cross-entropy is exactly minimizing the average number of bits required to encode the next token using the model's distribution. A well-trained LLM is a very good compressor of natural text. This is not a metaphor: you can use an LLM with arithmetic coding to literally compress text, and you get compression ratios that beat gzip by a wide margin. Hutter's bet — that compression is intelligence — is provocative because next-token prediction is exactly compression.

Implicit world modeling: to predict the next token of "The capital of France is", the model has to know the capital of France. To predict the next token of "After he locked the door, he put the key in his", the model has to track that "he" has a key and a door. To predict the next token of a half-finished function definition, the model has to know what the function does and the language's syntax. None of these capabilities are explicitly trained for. They emerge because the training distribution rewards them.

This is the "compression as understanding" view. A model that does not understand causality, time, or three-dimensional space cannot compress text about causality, time, or three-dimensional space as well as one that does. So minimizing cross-entropy is, at sufficient scale, a forcing function for whatever world-model is required to predict text well.

This story has limits. Some things you cannot learn from text — direct sensorimotor grounding, real-time interaction with physical objects — and the model's "understanding" of these is mediated by its understanding of descriptions of these. Some things you can learn from text but only at scale: simple reasoning is in 1B-parameter models, more complex chains are in 7B, multi-step planning starts to work at 70B, and the trend continues. The loss curve hides this.

What does the loss curve actually show? At pretraining time, you see a number that starts somewhere around $\log V \approx 10$ (the model is uniform over the vocab) and falls fast in the first thousand steps to maybe 5 or 6 (it has learned the marginal distribution of tokens — common tokens are common). Then it falls slowly, on a log-log line, for the rest of training. By the end of a serious pretraining run, on something like Llama, the loss is in the range of 1.6-2.0 nats per token (call it 2.3-2.9 bits).

What the curve hides is everything interesting. At a certain point — usually well into training — capabilities snap into existence. The model goes from being unable to do a task at all to doing it reliably, while the overall loss is barely moving. These "emergent" capabilities are not always real — some are artifacts of how we measure them, where a small loss decrease pushes the answer over a discrete threshold (BLEU, accuracy). But some are real: capabilities that require multi-step computation appear when the model is good enough at each step that the chain holds together, and that transition can be sharp.

The other thing the curve hides is the difference between memorization and generalization. A model that has seen Wikipedia twice has memorized a lot of it, and its loss on Wikipedia is correspondingly low. The same model on held-out text from a different domain has a much higher loss. The headline "perplexity" number is a weighted average over a held-out set; what matters in deployment is per-domain loss, and that varies wildly.

import torch.nn.functional as F

def lm_loss(logits, targets, ignore_index=-100):
    # logits: (B, T, V) -- model output
    # targets: (B, T)  -- the next token at each position; -100 for masked
    B, T, V = logits.shape
    # Cross-entropy expects (N, V) and (N,), so we flatten.
    loss = F.cross_entropy(
        logits.reshape(B * T, V),
        targets.reshape(B * T),
        ignore_index=ignore_index,
    )
    return loss

That five-line function is the entire pretraining loss. The $$-100$$ trick is how you mask out positions you don't want to count (padding, prompt-only positions during instruction tuning). Cross-entropy already normalizes by the number of unmasked positions.

The loss is simple. The optimizer that minimizes it is not.

Optimizers

The optimizer is the algorithm that turns gradients into weight updates. For a function $\mathcal{L}(\theta)$ we want to minimize, the gradient $g = \nabla_\theta \mathcal{L}$ tells us the direction of steepest ascent. Vanilla stochastic gradient descent (SGD) moves in the opposite direction:

\theta_{t+1} = \theta_t - \eta \, g_t

where $\eta$ is the learning rate. This works for convex problems. It struggles on the loss landscape of a deep network: you bounce off ravine walls, you crawl along flat regions, and you get stuck at saddle points. So we add tricks.

Momentum (Polyak): maintain an exponential moving average of the gradient and step in that direction.

m_t = \beta m_{t-1} + g_t, \quad \theta_{t+1} = \theta_t - \eta m_t

Momentum smooths out high-frequency noise and accelerates progress along consistent directions. It's the difference between a ball and a marble rolling down a hill: the ball, with momentum, glides through small bumps.

Adam (Kingma & Ba) goes further: maintain two moving averages, one of the gradient (the first moment) and one of the squared gradient (the second moment). Use the second moment to scale the step per-parameter, so parameters with consistently large gradients take small steps and parameters with consistently small gradients take large steps. This automatically adapts the effective learning rate per coordinate.

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

\hat m_t = m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t)

\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}

Term by term: $$g_t$$ is the gradient at step $$t$$ ; $$m_t$$ is the running average of gradients; $$v_t$$ is the running average of squared gradients (elementwise); $\beta_1, \beta_2$ are decay rates (typically $0.9, 0.999$); $\hat m_t, \hat v_t$ are bias-corrected versions, because $$m_0 = v_0 = 0$$ so the early steps would otherwise be biased toward zero; $\epsilon$ (typically $10^{-8}$ ) prevents division by zero. The update is the bias-corrected first moment scaled by the inverse square root of the second moment. Roughly: step size = "average direction" / "average magnitude". This is invariant to gradient scale, which is huge in practice.

Adam was the dominant optimizer for years. Then a subtle problem became apparent: when people added L2 weight decay (regularization that pulls weights toward zero), they did it by adding $\lambda \theta$ to the gradient. With Adam, this $\lambda \theta$ term gets divided by $\sqrt{\hat v_t}$ , which means the effective decay strength depends on how big the gradients have been. Parameters that have had small gradients (e.g. embeddings of rare tokens) get over-decayed; parameters with big gradients get under-decayed. The fix, AdamW (Loshchilov & Hutter), separates decay from the gradient flow:

\theta_{t+1} = \theta_t - \eta \cdot \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \theta_t \right)

The decay is now applied directly to the parameter, not through the moving averages. This is "decoupled weight decay". Every modern LLM is trained with AdamW. The decay coefficient $\lambda$ is typically 0.1.

def adamw_step(param, grad, m, v, lr, beta1, beta2, eps, weight_decay, t):
    # m, v are exponential moving averages of grad and grad^2.
    # t is the step number (1-indexed).
    m.mul_(beta1).add_(grad, alpha=1 - beta1)
    v.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)

    # bias correction: early in training, m and v are biased toward 0
    m_hat = m / (1 - beta1 ** t)
    v_hat = v / (1 - beta2 ** t)

    # decoupled weight decay: applied directly to params, not via grad
    param.mul_(1 - lr * weight_decay)
    param.addcdiv_(m_hat, v_hat.sqrt().add_(eps), value=-lr)

Lion (Chen et al. 2023, found by symbolic search over optimizer programs) drops the second moment entirely and uses only the sign of a momentum-smoothed gradient:

\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(\beta_1 m_{t-1} + (1 - \beta_1) g_t) - \eta \lambda \theta_t

Lion uses half the optimizer-state memory of AdamW (one moment instead of two). The sign update means every parameter takes a step of magnitude $\eta$ , regardless of gradient size — this is more stable in some regimes and weirder in others. Some labs train with Lion now; AdamW is still the default.

There are two more pieces of the optimizer story that matter for LLMs.

Learning rate schedules. You don't keep the learning rate constant. You start small, warm up linearly over the first few thousand steps to your peak rate, then cosine decay down to about 10% of peak by the end of training. Warmup avoids the early-training instability where gradients are large but the moving averages haven't accumulated yet (so the variance estimate is bad). Cosine decay anneals the learning rate as you approach a minimum. The peak rate for a Llama-scale model is around $3 \times 10^{-4}$ , decaying to $3 \times 10^{-5}$ .

Gradient clipping. Sometimes a batch produces an enormously large gradient (a particularly hard example, an outlier in the data, a numerical fluke). If you take a normal Adam step on a giant gradient, you can blow out the weights and never recover. So you clip the global gradient norm: if $\|g\| > c$ , set $g \gets g \cdot c / \|g\|$ . Typical $$c$$ is 1.0. This costs nothing in the average case and saves the run when an outlier hits.

The optimizer plus the schedule plus the clipping plus the precision strategy is the recipe. The next piece is the precision.

Numeric precision

Floating-point numbers are tuples (sign, exponent, mantissa). The exponent gives dynamic range — how big or small the number can be. The mantissa gives precision — how many digits within that range. An LLM's weights, activations, and gradients all have to fit somewhere in this representation, and the choice of format dictates both training stability and hardware throughput.

FP32 (IEEE single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits. Range about $10^{-38}$ to $10^{38}$ , precision about $7$ decimal digits. This is the historical default for scientific computing. A 7B-parameter model in FP32 is 28 GB — on the edge of what fits on an A100 80GB, even for inference, before activations.

FP16 (IEEE half): 1 + 5 + 10. Range about $6 \times 10^{-5}$ to $6 \times 10^4$ . Precision about $3$ decimal digits. Twice the throughput, half the memory. Trouble: the dynamic range is small. Gradients in deep networks routinely underflow ( $< 6 \times 10^{-5}$ ) and become zero, killing training. Also, a single huge activation can overflow.

BF16 (Brain Floating Point, Google): 1 + 8 + 7. Same exponent as FP32, half the mantissa. Range identical to FP32. Precision about $2$ decimal digits — less than FP16. But the gradients don't underflow, and that turns out to matter much more than the precision difference. Hardware support arrived with TPU and then Ampere GPUs. BF16 became the default for training because you can drop FP32 weights in BF16 and they still train, where FP16 needs careful loss scaling.

FP8 (more recent, Hopper and Blackwell GPUs): two flavors, E4M3 (1 + 4 + 3, more precision, less range) and E5M2 (1 + 5 + 2, more range, less precision). Used for forward activations (E4M3) and gradients (E5M2) in cutting-edge training. Halves the memory and roughly doubles throughput vs. BF16 on supported hardware, with some accuracy management.

   FP32 :  S | EEEE EEEE | MMMM MMMM MMMM MMMM MMMM MMM
                 8 exp                   23 mantissa
   FP16 :  S | EEEEE | MMMM MMMM MM
              5 exp     10 mantissa
   BF16 :  S | EEEE EEEE | MMMM MMM
              8 exp        7 mantissa
   FP8(E4M3): S | EEEE | MMM
   FP8(E5M2): S | EEEEE | MM

For training, the practical recipe is mixed precision. You keep a master copy of the weights in FP32 (so small updates aren't lost to rounding), but compute the forward and backward passes in BF16 (or FP16 with loss scaling). At each optimizer step, you apply the BF16 gradient to the FP32 master weights, then cast back down to BF16 for the next step. The reason is subtle: AdamW's update is $\eta \cdot \hat m_t / \sqrt{\hat v_t}$ , which can be on the order of $10^{-7}$ . A BF16 weight has precision around $10^{-3}$ relative. So the update would be smaller than the representation's precision and would simply round to zero. Keeping the master in FP32 preserves these small updates. The optimizer states ( $$m$$ and $$v$$ ) are also FP32 for the same reason.

                            Master weights
                             (FP32, 4 bytes/param)
                                  |
                                  v cast
                            BF16 weights
                             (2 bytes/param)
                                  |
                  forward+backward pass (BF16)
                                  |
                                  v
                            BF16 gradient
                                  |
                                  v cast up to FP32 for the optimizer step
                          AdamW state (FP32 m, v)
                                  |
                                  v
                       updated FP32 master weights

For FP16 specifically, the dynamic range issue requires gradient scaling: multiply the loss by a large factor (say $2^{16}$ ) before the backward pass, so gradients become large enough not to underflow, then divide the gradients by the same factor before the optimizer step. If gradients overflow (NaN or Inf), skip the step and halve the scale. If gradients have been finite for a while, double the scale. PyTorch's GradScaler does this. With BF16 you don't need it; the dynamic range is fine.

For inference, the question is "what is the smallest format the model still works in?" Pretrained weights in BF16 generally do. FP16 mostly does, with some sensitivity at the long tail. FP8 inference works for many models with a calibration step. INT8 and below take us to quantization, which is a separate world.

Quantization

Pretraining gave us a model in BF16. Each parameter is 2 bytes. A 70B model is 140 GB. That doesn't fit on one consumer GPU (24 GB), or even on a single H100 (80 GB) without splitting. So we want to make each parameter take fewer bytes. Quantization is the answer.

The basic move: replace floating-point weights with integers. Specifically, choose a scale $$s$$ and a zero-point $$z$$ , and store each weight as an integer $$q$$ such that

w \approx s \cdot (q - z)

For symmetric quantization, $$z = 0$$ : the integers are signed and centered on zero, suitable for weights that are roughly symmetric around zero (most of them). For asymmetric quantization, $$z$$ is chosen so that the integer range covers the actual range of the weights — useful for activations after a ReLU, which are non-negative.

For INT8, $q \in [-128, 127]$ (signed) or $$[0, 255]$$ (unsigned). For INT4, $q \in [-8, 7]$ or $$[0, 15]$$ . For INT2, four levels. Binary, two levels. Each step down halves storage and doubles arithmetic throughput on hardware that supports it.

The granularity question: what does $$s$$ apply to? Three options:

Per-tensor: one $$s$$ per weight matrix. Cheapest, lowest accuracy.
Per-channel: one $$s$$ per output channel (per row of the weight matrix). Common for INT8 weights.
Per-group: one $$s$$ per group of consecutive elements (e.g., groups of 64 or 128 inside a row). The standard for INT4 weights.

A 70B model in INT4 with group size 128 is roughly 35 GB instead of 140 GB. The scales themselves take a few percent extra. So a 70B model fits on a single 48GB GPU after 4-bit quantization, with room for activations and the KV cache.

The naive way to quantize is round-to-nearest (RTN): for each weight, pick the closest representable value. For INT8 this barely degrades quality. For INT4 it costs perhaps 1-3 perplexity points on a held-out set. For INT2 it falls off a cliff.

Doing better than RTN requires being smart about which weights to round in which direction, given that some weights matter more than others. This is where the modern algorithms come in.

GPTQ (Frantar et al. 2023): the insight is that quantization error for one weight propagates through subsequent computations, and the propagation is governed by the Hessian of the loss with respect to the layer's weights. Specifically, for one linear layer $$y = Wx$$ on a calibration set, you have a sample-covariance matrix $H = X X^\top$ that captures how perturbations to $$W$$ affect the output. The Hessian-aware update rule says: when you quantize one column of $$W$$ , you compute the resulting error, and then adjust the not-yet-quantized columns to compensate. This way, the layer's output stays as close to the original as possible. The math is essentially second-order optimization restricted to quantization grid points. GPTQ runs in a few hours on a calibration set of a few hundred sequences and produces 4-bit weights that are typically within 0.1-0.3 perplexity of the BF16 original on Llama-class models.

AWQ (Lin et al. 2023): the insight is that some channels in the input activations have much larger values than others, and quantizing weights uniformly hurts the channels that matter most. AWQ identifies the "salient" channels (those with the largest activation magnitudes on a calibration set) and applies a per-channel scaling that preserves their precision before quantization, then folds the inverse scale into the next layer. The result is comparable to GPTQ on perplexity, and often faster at inference because the format is simpler.

SmoothQuant (Xiao et al. 2023): this one targets activations, not just weights. The problem with INT8 activations is that some channels have outliers — a few activations are dozens of times larger than the rest, and a single per-tensor scale wastes most of the integer range on representing them. SmoothQuant migrates this difficulty: for each channel, multiply activations by $s^{-1}$ and divide weights by $$s$$ , where $$s$$ is chosen so the activations and weights have similar dynamic range. The math is the same ( $$Wx$$ unchanged), the quantization error is much smaller. This makes INT8 activation+weight quantization viable.

def rtn_quantize(W, bits=4, group_size=128):
    # W: (out, in) weight matrix. Quantize per group along the input dim.
    out, in_dim = W.shape
    qmax = 2 ** (bits - 1) - 1   # symmetric range, e.g. 7 for INT4
    W_q = torch.zeros_like(W, dtype=torch.int8)
    scales = torch.zeros(out, in_dim // group_size)

    for g in range(in_dim // group_size):
        block = W[:, g * group_size : (g + 1) * group_size]
        # one scale per (output channel, group of input dims)
        s = block.abs().amax(dim=1, keepdim=True) / qmax
        scales[:, g] = s.squeeze(1)
        # round to nearest integer in [-qmax, qmax]
        q = torch.round(block / s).clamp(-qmax, qmax).to(torch.int8)
        W_q[:, g * group_size : (g + 1) * group_size] = q

    return W_q, scales

# at inference: w_approx = q.float() * scales[:, group_index].unsqueeze(1)

What's the cost? On Llama-2-70B: BF16 baseline perplexity on WikiText-2 is around 3.3. INT8 is essentially identical. INT4 with GPTQ or AWQ is 3.4-3.5. INT3 starts to degrade noticeably. INT2 is broken without specialized methods. The exact numbers depend on calibration set, group size, and which evaluation you trust, but the shape is consistent: 4-bit is the current sweet spot for inference; 8-bit is essentially free; below 4-bit needs specialized tricks (like AQLM or 1-bit BitNet) and a willingness to retrain.

There is also quantization-aware training (QAT): instead of quantizing a pretrained model, you simulate quantization during training. The forward pass uses fake-quantized weights (round to grid, then back to float), the backward pass uses straight-through estimators (treat the rounding as identity for gradients). The model learns weights that are robust to quantization. This is more expensive than post-training quantization but can push to lower bit-widths. BitNet's 1.58-bit ternary weights are a recent example: they couldn't be reached by post-training methods, only by QAT.

Quantization compresses the model. The next set of techniques compresses what we have to change during fine-tuning.

Adapters and parameter-efficient fine-tuning

Once you have a 70B-parameter pretrained model, you usually do not want to fine-tune all 70B parameters. The compute cost is high (you need optimizer states, gradients, and activations — roughly 4-6x the model size in memory), the storage cost is high (every fine-tuned variant is another 140 GB), and the regularization is tricky (you can wreck the pretrained capabilities). Parameter-efficient fine-tuning (PEFT) methods change a small fraction of the parameters and leave the rest frozen.

The dominant method is LoRA (Low-Rank Adaptation, Hu et al. 2021). The observation: when you fine-tune a model on a downstream task, the change in weights $\Delta W$ tends to have low effective rank. Hu et al. measured this on several tasks and found that even rank-8 approximations of the full $\Delta W$ recovered most of the performance. So why store the full $\Delta W$ ? You can parameterize it directly as a product of two low-rank matrices.

For a frozen weight $W \in \mathbb{R}^{d \times d}$ , LoRA replaces it with

W' = W + \frac{\alpha}{r} B A

where $A \in \mathbb{R}^{r \times d}$ , $B \in \mathbb{R}^{d \times r}$ , and $r \ll d$ is the LoRA rank. Only $$A$$ and $$B$$ are trained. $\alpha$ is a scaling hyperparameter that decouples the rank from the effective learning rate — by convention you set $\alpha$ to $$r$$ or $2r$ and tune from there.

The math: $$BA$$ has rank at most $$r$$ . So we are constraining the update to a $$r$$ -dimensional subspace of the full update space. The number of trainable parameters per layer drops from $$d^2$$ to $2dr$. For $$d = 4096$$ and $$r = 16$$ , that's $2^{24}$ for full fine-tuning vs. $2^{17}$ for LoRA — a 128x reduction. For Llama-7B, full fine-tuning trains 7B parameters; LoRA on attention projections at $$r=16$$ trains around 4M.

Why does this work? The intrinsic-rank hypothesis, made precise by Aghajanyan et al. (2020): pretrained language models have a low intrinsic dimension, meaning that the effective dimensionality of the parameter manifold relevant to any given fine-tuning task is much smaller than the full parameter count. Concretely, they fine-tuned BERT-large and RoBERTa-large by training only a low-dimensional random projection back into the full parameter space, and found that they could match full fine-tuning performance with intrinsic dimensions in the low thousands — for a model with hundreds of millions of parameters. So most of the model's adaptation capacity is "redundant" for any specific task, and a low-rank update suffices.

class LoRALinear(torch.nn.Module):
    def __init__(self, base_linear, r=16, alpha=32):
        super().__init__()
        self.base = base_linear            # frozen: W in R^{out x in}
        for p in self.base.parameters():
            p.requires_grad = False
        in_f, out_f = base_linear.in_features, base_linear.out_features
        # A is initialized randomly (Gaussian); B is initialized to zero.
        # Why: at step 0, BA = 0, so the model is identical to the base.
        self.A = torch.nn.Parameter(torch.randn(r, in_f) * 0.01)
        self.B = torch.nn.Parameter(torch.zeros(out_f, r))
        self.scaling = alpha / r

    def forward(self, x):
        # base output + low-rank delta
        return self.base(x) + (x @ self.A.T) @ self.B.T * self.scaling

The initialization choice — $$B$$ at zero, $$A$$ Gaussian — matters. It guarantees that at step 0 the LoRA layer is exactly the base layer; the delta starts at zero and is learned from there. This is a soft start: training does not have to first un-learn a random perturbation before making progress.

Where do you put LoRA adapters? The original paper put them on the $$W_Q$$ and $$W_V$$ projections only. Practice has expanded this to all linear layers in the attention and MLP blocks. The rank-16 default is conservative; for harder tasks, $$r = 64$$ or $$r = 128$$ is common. The adapters together for Llama-7B at $$r=16$$ on all linears are about 30 MB. You can carry around dozens of fine-tuned variants of a base model, swap them in seconds, and serve them on the same underlying weights.

QLoRA (Dettmers et al. 2023): combine 4-bit base weights with FP16 LoRA adapters. The base model is loaded in 4-bit (using a custom 4-bit normal-float "NF4" data type designed for normally-distributed weights), kept frozen, and only the LoRA adapters — which are full-precision and trainable — receive gradients. This brings 70B fine-tuning down to a single 48 GB GPU, where it would have needed multiple GPUs at full precision. The 4-bit weights are dequantized on the fly during forward/backward; the cost is some compute overhead, but the memory savings are dramatic. There is also "double quantization", where the quantization scales themselves are quantized, saving another 0.5 bytes per parameter on average. The phrase "QLoRA at 4-bit" in the prompt of this essay refers to exactly this: 4-bit quantized base + LoRA adapters in higher precision.

DoRA (Weight-Decomposed LoRA, Liu et al. 2024) refines LoRA by decomposing the weight into magnitude and direction:

W = m \cdot \frac{V}{\|V\|_c}

where $$m$$ is a per-column magnitude (a vector) and $V/\|V\|_c$ is a directional unit-norm matrix. DoRA applies LoRA only to the direction and trains the magnitudes separately. Empirically this matches full fine-tuning more closely than LoRA, especially at low ranks, with a small overhead.

A few other PEFT methods, briefly:

IA3 (Liu et al. 2022): instead of adding a low-rank delta, scale the activations of the keys, values, and FFN intermediate by learned per-channel vectors. Even fewer parameters than LoRA, sometimes competitive on small tasks.
Prefix tuning: prepend a sequence of "virtual" learned tokens to the keys and values of every attention layer. The model itself is frozen; only these prefix vectors are learned. This conditions the model on a learned latent context.
Prompt tuning: prepend learned virtual tokens only at the input layer, not at every layer. Much more parameter-efficient than prefix tuning, generally weaker, but very cheap.

When is each the right call? LoRA / QLoRA for almost everything: it's well-understood, well-supported in libraries, and gives close-to-full-fine-tuning quality. DoRA when you want to push quality at low rank. Prefix tuning for very tightly bounded tasks where you want a tiny conditioning vector. Full fine-tuning when you have the compute and the training data is large enough to justify it (say, hundreds of millions of tokens of high-quality task data); below that budget, the regularization from freezing most of the model usually helps.

A LoRA-tuned model can be merged back into the base: just compute $W + (\alpha/r) BA$ and overwrite $$W$$ . After merging, inference cost is identical to the base model. The downside is you lose the ability to swap adapters at runtime. If you want to serve many fine-tunes from the same base, keep them as separate adapters.

The next concern is what it costs to actually run the model.

Inference optimization

A trained model is a function from a sequence of tokens to a probability distribution over the next token. To generate text, you sample from that distribution, append the sampled token, and repeat. Each step requires running the entire model. Naively, generating $$n$$ tokens costs $$O(n^2)$$ in sequence length and $$O(n)$$ in model evaluations. We can do much better.

KV cache. The expensive part of attention is computing $QK^\top$ , which for sequence length $$T$$ is $$O(T^2)$$ . But notice: when generating token $$T+1$$ , the keys and values for tokens $1, \ldots, T$ have already been computed and they will not change. The query at position $$T+1$$ is new; the keys and values at earlier positions are the same as last step. So we cache them.

The KV cache stores, for every layer and every head, the keys and values for every token seen so far. Its size is

\text{KV cache size} = 2 \cdot L \cdot h \cdot d_k \cdot T \cdot \text{precision}

where $$L$$ is layer count, $$h$$ is heads, $$d_k$$ is per-head dim, $$T$$ is sequence length, and the 2 is for both K and V. For Llama-2-70B at FP16: $$L = 80$$ , $$h = 64$$ , $$d_k = 128$$ , so per token per layer it's $2 \cdot 64 \cdot 128 \cdot 2 = 32{,}768$ bytes. Per token across all 80 layers: 2.6 MB. At 4096 tokens of context: 10.7 GB just for one sequence's KV cache. This dominates memory at long context lengths.

With the cache, generation step $$T+1$$ only needs to compute the new token's $$Q$$ , $$K$$ , $$V$$ , append the new $$K$$ , $$V$$ to the cache, and run attention with the new $$Q$$ against the entire cached $$K$$ (cost $$O(T)$$ , not $$O(T^2)$$ ). The total cost of generating $$n$$ tokens after a prompt of length $$P$$ goes from $$O((P+n)^2)$$ to $O(n \cdot (P + n/2))$ , dominated by linear-in-position cost per token.

Grouped Query Attention (GQA) is a KV-cache-aware architectural change. Instead of $$h$$ key/value heads, you have $h_{kv} < h$ , and groups of query heads share K and V. Llama-2-70B has $$h = 64$$ query heads but $h_{kv} = 8$ KV heads, so the KV cache is 8x smaller than it would be with full multi-head. This is a deliberate tradeoff: a slight quality cost for a large memory cost reduction. Multi-Query Attention (MQA) is the extreme case, $h_{kv} = 1$ .

FlashAttention (Dao et al. 2022) attacks a different bottleneck. The naive attention computation materializes the full $T \times T$ score matrix in GPU memory, applies softmax, then multiplies by $$V$$ . For $$T = 8192$$ and 32 heads in FP16, that's 4 GB of intermediate matrix. Worse, the matrix is read and written several times: compute scores (write), apply mask (read+write), softmax (read+write), multiply by $$V$$ (read). Each of these is bound by GPU memory bandwidth, not compute. Modern GPUs can do many TFLOPs but their HBM bandwidth is limited; attention is memory-bound, not compute-bound, in this naive formulation.

The trick: tile the computation so the score matrix is never fully materialized. Process the queries in blocks of $$B_q$$ rows and the keys/values in blocks of $$B_k$$ rows. For each $$(B_q, B_k)$$ tile, compute the local scores, apply the local softmax, multiply by the local $$V$$ , and accumulate into the output. The catch is that softmax is a global operation: $\text{softmax}(s)_i = \exp(s_i) / \sum_j \exp(s_j)$ , so you need the global denominator. FlashAttention solves this with the online softmax trick (Milakov & Gimelshein 2018): maintain a running maximum and a running sum-of-exponentials per query, update them as each new key block arrives, and rescale the partial output accordingly. Mathematically equivalent to standard softmax; numerically stable; and crucially, it never writes the full score matrix to HBM.

The result: attention becomes compute-bound rather than memory-bound. On A100, FlashAttention is 2-4x faster than the naive PyTorch attention at long sequence lengths and uses a tiny fraction of the memory. FlashAttention-2 and -3 added further improvements (better GPU occupancy, reduced non-matmul work, FP8 support). Every modern inference engine uses FlashAttention or one of its descendants.

   naive attention:                  flashattention:

   compute Q*K^T  -> O(T^2) write    for each Q block:
   read it back, softmax              for each K/V block:
   multiply by V  -> O(T^2) read       compute partial scores
   write output                        update running max and sum
                                       multiply by V, accumulate
                                     write output once

   HBM traffic O(T^2 d)              HBM traffic O(T d) (Q,K,V read once)

Paged attention (Kwon et al. 2023, the technique behind vLLM) attacks the memory layout of the KV cache. In a naive implementation, each sequence's KV cache is one contiguous tensor that grows token by token. If you preallocate the maximum context length, you waste memory on short sequences. If you allocate as you go, you fragment memory. With many concurrent sequences of different lengths, both options bleed memory.

The paged-attention insight is borrowed directly from operating systems: page the KV cache. Allocate the cache in fixed-size blocks (e.g., 16 tokens per block per layer), and maintain a per-sequence "page table" that maps logical positions to physical blocks. Now you can pack many sequences efficiently, free blocks when sequences finish, and even share blocks between sequences that have the same prefix (prefix caching). The attention kernel is modified to follow the page table and gather K,V from non-contiguous physical blocks. The throughput gain is large: vLLM reports 2-4x more concurrent sequences at the same GPU, sometimes more, depending on workload.

  logical KV (sequence A)    page table (A)         physical blocks
  [t0..t15]   block 0  --->  0 -> phys_3            phys_0  [free]
  [t16..t31]  block 1  --->  1 -> phys_7            phys_1  [seq B blk 2]
  [t32..t47]  block 2  --->  2 -> phys_2            phys_2  [seq A blk 2]
                                                    phys_3  [seq A blk 0]
  sequence B has its own page table pointing into  phys_4  [free]
  the same physical pool.                          ...

Continuous batching (also vLLM and friends): traditional batching processes a batch of sequences from start to finish. If sequences in the batch have different output lengths, the long ones hold up the batch. Continuous batching lets sequences enter and leave the batch at every step. As a sequence finishes, its slot is freed and a new sequence (or a new prefill) can take it. Combined with paged attention, this keeps the GPU saturated.

Prefix caching: many requests share a common prefix — a system prompt, a few-shot template, a long context document. The KV cache for that prefix only needs to be computed once and can be reused across requests. With paged attention, you can literally share the physical blocks between requests; with naive caching, you cache the prefix's KV by hash and reload it. This turns repeated long prompts from a per-request cost into an amortized cost.

Speculative decoding attacks generation latency. Generation is fundamentally serial: you need token $$t$$ to compute token $$t+1$$ . But what if you could guess the next several tokens cheaply, then verify them in parallel? That's the idea. Use a small "draft" model to autoregressively generate $$k$$ candidate tokens, then run the big "target" model once on the prefix-plus-candidates and check, in parallel, whether each candidate matches what the target would have sampled. Tokens that match are accepted; the first mismatch and everything after it is rejected. The math gives you exact equivalence to sampling from the target distribution if you do the acceptance probabilities right (Leviathan etal. 2023).

The win: one forward pass through the target model produces multiple accepted tokens. If acceptance rate is 60%, you get on average 1.6 tokens per target call instead of 1. The draft model has to be fast and reasonably aligned with the target. Common patterns: a small (e.g., 1B) model drafting for a 70B target, or an even smaller "n-gram lookup" drafter, or self-speculation (Medusa heads, EAGLE) where the same model has extra prediction heads that produce drafts.

The right way to think of speculative decoding: it trades off some extra compute (drafting + verification) for reduced latency. It's worth it whenever you're memory-bandwidth-bound on the target model's forward pass — which is almost always, at small batch sizes.

Putting these together, a modern inference stack — vLLM, TGI, TensorRT-LLM, SGLang — combines: KV cache + paged memory + continuous batching + prefix sharing + FlashAttention-style kernels + (optionally) speculative decoding + quantized weights. The result is dramatically better than running raw model.generate() on a Hugging Face transformer.

Optimizing inference is one way to do more with a given model. The next is to make a smaller model that punches above its weight.

Distillation

Knowledge distillation, in Hinton et al.'s 2015 framing, is training a small "student" model to match the output distribution of a larger "teacher" model, rather than the original hard labels. The student learns from a richer signal: instead of "the next token is 'Paris'", it learns "the next token is 'Paris' with probability 0.71, 'France' with 0.04, 'the' with 0.03, ...". Those secondary probabilities encode the teacher's uncertainty and similarity structure — which words it considers near-synonyms, which alternative phrasings it considered, which contexts it found ambiguous.

The math: let $$p_T$$ be the teacher's probability distribution and $$p_S$$ the student's, both over the vocabulary. The distillation loss is

\mathcal{L}_\text{distill} = \tau^2 \cdot \text{KL}\!\left( p_T^\tau \,\|\, p_S^\tau \right)

where $\tau$ is the temperature, and $p^\tau$ denotes the distribution obtained from logits $$z$$ by softmaxing $z/\tau$ . Temperature flattens the distribution: at $\tau = 1$ you get the model's normal distribution; at $\tau = 4$ , the high-probability outcomes are pulled down and the low-probability outcomes are pulled up, so the secondary structure becomes more visible. The $\tau^2$ factor compensates for the gradient scaling that comes with temperature.

Often the loss is a mix:

\mathcal{L} = \alpha \mathcal{L}_\text{distill} + (1 - \alpha) \mathcal{L}_\text{hard}

where $\mathcal{L}_\text{hard}$ is the standard cross-entropy on the true next-token labels. The distillation term provides the "dark knowledge" — the secondary structure — and the hard-label term keeps the student honest on cases where the teacher might be wrong.

def distill_loss(student_logits, teacher_logits, targets, T=2.0, alpha=0.5):
    # student_logits, teacher_logits: (B, T, V)
    # targets: (B, T) -- ground-truth next tokens
    # KL divergence between softened distributions
    s_log = F.log_softmax(student_logits / T, dim=-1)
    t_prob = F.softmax(teacher_logits / T, dim=-1)
    kl = F.kl_div(s_log, t_prob, reduction='batchmean') * (T * T)
    # Hard cross-entropy on the true tokens
    ce = F.cross_entropy(student_logits.flatten(0, 1), targets.flatten())
    return alpha * kl + (1 - alpha) * ce

Why does this work? Two ways to think about it.

The information-theoretic view: a one-hot label has $\log V$ bits of information about the true distribution at most. The teacher's distribution has up to $\log V$ bits about the teacher but, because the teacher is a good approximation of the data distribution, it also encodes information about the data. In particular, when several tokens are nearly equally good, the one-hot label arbitrarily picks one and tells the student the others are wrong; the teacher's distribution tells the student they are nearly equally good. The student trained on the teacher learns the right structure.

The function-class view: large models in a wide function class have rich generalization properties; small models in a narrower class have less expressive power. A small model trained on hard labels has to fit the data on its own, which often forces it into a "spiky" approximation. A small model trained on the teacher's distribution can imitate a smoother function — the teacher's. The smoothness propagates.

In practice, distillation can match teacher quality at a fraction of the size on narrow tasks and gets you most of the way on broad ones. DistilBERT recovered ~95% of BERT-base on GLUE at 60% of the parameters. TinyBERT and MiniLM pushed further. For generative LLMs, distilled models have become a major release pattern: Gemma, Phi, and many open-weight smaller models include a distillation phase after standard pretraining.

There are layer-level variants — match the student's hidden states or attention maps to the teacher's, not just the output logits. These add more gradient signal but require dimension alignment and architectural assumptions. Output distillation is the most general form and often the most effective.

A subtle point: distillation is also implicitly happening every time you fine-tune a small model on the outputs of a larger one — what people call "synthetic data" training. Llama-2-7B fine-tuned on GPT-4 outputs is, mechanically, being distilled from GPT-4. The student gets to see only the sampled outputs (one realization per input), not the full distribution, so it's a noisier form of distillation. But the effect is the same: the student inherits structure from the teacher.

Distillation reduces parameter count by training a smaller model. The next technique reduces parameter use without reducing parameter count.

Mixture of Experts

A standard transformer's MLP layer applies the same two matrices to every token. A Mixture of Experts (MoE) layer has many copies of the MLP — call them experts — and a small gating network that, for each token, picks which experts to use. Most tokens use only $$k$$ experts (commonly $$k = 1$$ or $$k = 2$$ ) out of $$E$$ experts (commonly $$E = 8$$ , $$E = 64$$ , or much higher). Only those experts' parameters are active for that token. The other experts contribute nothing to the forward pass.

The win: total parameter count is large (good for representational capacity), but per-token compute is small (only the active experts run). Mixtral-8x7B, for example, has 47B total parameters but only ~13B active per token. Llama-4's MoE variants and DeepSeek-V3 push this further: hundreds of billions of total parameters, but tens of billions active.

The gating network is typically a single linear layer that produces a logit per expert: $g(x) = W_g x \in \mathbb{R}^E$ . You take the top- $$k$$ experts by logit and softmax their logits to get the mix:

y = \sum_{e \in \text{TopK}(g(x))} \text{softmax}(g(x))_e \cdot \text{Expert}_e(x)

The output is a weighted sum of the chosen experts' outputs. The other experts contribute zero.

The first failure mode is expert collapse: the gating network learns to send most tokens to a few favorite experts, and the others atrophy. This is bad for capacity (you trained $$E$$ experts but use 2 of them) and bad for load balance (the favorite experts get overloaded; the rest sit idle). The fix is an auxiliary load-balance loss: encourage uniform expert usage across a batch.

A common form (Shazeer et al., refined by Switch Transformer):

\mathcal{L}_\text{aux} = \alpha \cdot E \cdot \sum_{e=1}^{E} f_e \cdot p_e

where $$f_e$$ is the fraction of tokens routed to expert $$e$$ in this batch, $$p_e$$ is the average gating probability assigned to expert $$e$$ , and $\alpha$ is a small coefficient (typically 0.01). The product $f_e \cdot p_e$ is small when the experts are uniformly used; the loss pushes toward that uniformity. This is added to the main loss during training.

                       gating network g(x)
                              |
                              v
                    [logit_1, logit_2, ..., logit_E]
                              |
                       top-k selection
                              |
          +---------+---------+---------+---------+
          |         |                             |
          v         v                             v
       Expert_2  Expert_5  (others not run)   Expert_7  ...
          |         |                             |
          +---------+--- weighted sum ------------+
                              |
                              v
                            output

A second failure mode is token dropping: in batched training, each expert can only handle so many tokens (capacity constraint), and excess tokens get dropped. Switch Transformer's solution is per-expert capacity: $C = \text{capacity\_factor} \cdot \text{tokens} / E$ . If more tokens want to go to expert $$e$$ than $$C$$ , the overflow gets dropped (or routed to a fallback). Capacity factor of 1.0 means strictly balanced; $\geq 1.25$ leaves slack.

A third, more recent issue is the memory bandwidth angle. MoE wins on compute: you only run $$k/E$$ of the parameters per token. But you still have to load the chosen experts into the compute units, and at small batch sizes that's the bottleneck. So MoE inference is great for large batch sizes (where many tokens share the experts you load) and weaker at batch size 1, where you load 8x more parameter memory for one token's worth of compute. For server deployments with many concurrent requests, MoE is a clear win. For local-machine inference at batch size 1, dense models often have better tokens-per-second per parameter loaded.

The dispatch logic — taking a batch of tokens, routing each to its experts, running the experts, gathering outputs back into the original token order — is non-trivial to make efficient on GPUs. Frameworks like Megablocks (Gale et al.) reformulate it as a block-sparse matrix multiplication, which is GPU-friendly. Modern MoE training and inference are not trivial software but the algorithmic ideas are clean.

A note on counting parameters: when you read "Mixtral-8x7B", "8x7B" does not mean 56B. It means 8 experts each of which is roughly the size of a 7B model's MLP, plus a single shared attention stack. The total count is around 47B (the attention is shared and not multiplied). The active count per token is around 13B. Always look at active parameters when comparing inference cost; total parameters when comparing memory and capacity.

A model is not just the architecture and the next-token loss. Once you have a pretrained base, you have to make it act like an assistant.

Alignment stack

A pretrained model is a next-token predictor. Trained on the open internet, it predicts what would plausibly come next. That includes plausibly coming next in a flame war, in a 4chan post, in incomplete advice, in a context where the user wanted help and the corpus contained a confused stranger. The model's behavior is a weighted average over the contexts it was trained on. To make it behave as an "assistant" — patient, polite, helpful, refusing what it should refuse — you align it. This is a separate, much smaller phase than pretraining, and it's where the model's surface persona gets installed.

There are three main pieces, in roughly chronological order.

Instruction tuning is supervised fine-tuning on a dataset of (instruction, response) pairs. You collect — by hand, by scraping, by generating with another model, or some combination — examples like:

Instruction: Summarize the following article in three sentences. <article>
Response: <good summary>

You then fine-tune the base model on these pairs with the same next-token loss as pretraining, but you only compute the loss on the response tokens (not on the instruction). The model learns the format: "an instruction looks like this; a response looks like that". Datasets like OpenAssistant, FLAN, and WizardLM are public examples. Quality matters far more than quantity here — a few thousand high-quality examples can take you a long way. Llama-2's "Llama-2-Chat" started from a few tens of thousands of carefully curated SFT examples.

RLHF (Reinforcement Learning from Human Feedback) addresses what SFT cannot: getting the model to prefer one answer over another when both are plausible. You collect pairwise preferences — humans see two model outputs for the same prompt and pick which they prefer. From these, you train a reward model $r_\phi(x, y)$ : a neural net that scores how preferred a response $$y$$ is for a prompt $$x$$ . Then you optimize the policy (the LLM) to maximize the reward, with a KL penalty against the SFT reference to prevent drifting too far:

\mathcal{L}_\text{RLHF}(\theta) = -\mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)}[r_\phi(x, y)] + \beta \cdot \text{KL}(\pi_\theta \| \pi_\text{ref})

The policy gradient for this is high-variance, so you use PPO (Proximal Policy Optimization, Schulman et al. 2017). PPO uses a clipped surrogate objective that prevents the policy from changing too much in one step, which keeps training stable. The full RLHF pipeline has three components running together: the policy being trained, the reward model, and the reference (SFT) model held fixed. Memory-wise this is brutal — three model copies, plus a critic in some implementations. Computationally it's slow.

DPO (Direct Preference Optimization, Rafailov et al. 2023) is the math-friendly alternative that has largely replaced PPO in open-weight model recipes. The key insight: under the standard RLHF formulation, the optimal policy has a closed-form relation to the reward and the reference policy:

\pi^*(y|x) \propto \pi_\text{ref}(y|x) \exp(r(x, y) / \beta)

Inverting this gives the reward as a function of the policy:

r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + Z(x)

where $$Z(x)$$ is a partition function that depends only on $$x$$ . So if you take a preference pair $$(x, y_w, y_l)$$ where $$y_w$$ is preferred over $$y_l$$ , the partition function cancels in the difference, and the Bradley-Terry preference probability becomes:

P(y_w \succ y_l | x) = \sigma\!\left( \beta \log \frac{\pi(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_\text{ref}(y_l|x)} \right)

where $\sigma$ is the sigmoid. So you can directly maximize the log-likelihood of human preferences as a function of the policy, without training a separate reward model and without doing RL. The loss is a clean one-line formula and you only need two model copies (policy + reference) plus the preference dataset. DPO trains in hours where PPO trains in days, and the results are competitive (often better, sometimes worse, mostly within noise on standard benchmarks). Many open-weight chat models — Zephyr, Tulu, Llama-3-Instruct in part — use DPO or close variants.

There are further variants: KTO (Kahneman-Tversky Optimization, only needs binary good/bad rather than pairwise), IPO (a less greedy DPO variant), ORPO (combines SFT and preference into one loss), GRPO (group relative policy optimization, used by DeepSeek for math reasoning). The space is moving fast. The shared idea — turn preferences into a tractable supervised signal — has clearly displaced classical PPO for most open-source work.

That maps the alignment terms. The model that comes out of "pretrain → SFT → DPO" is what most people call an instruction-tuned LLM. Many other things happen in production — safety classifiers, prompt injection defenses, tool use training, retrieval augmentation — but those are systems built around the model, not modifications to it.

We've covered the parts. The last section is the seams that real systems care about and tutorials usually skip.

The often-missed details that bite real systems

What follows are the operational facts that turn a model from "works in a notebook" into "works in production". These are not architectural innovations; they are the boundaries where assumptions leak and where points of the benchmark get won or lost.

Tokenization edge cases. A few percent of LLM benchmark differences come down to how the prompt was tokenized. Examples that bite:

Leading whitespace. " The" is one token; "The" is another. Many tokenizers default to adding a leading space at sequence start; many don't. If your training set has the leading space and your evaluation prompt doesn't, you've shifted the distribution.
BOS/EOS tokens. Some tokenizers prepend a beginning-of-sequence token; some don't. Same for end-of-sequence at training time. If your fine-tuning code adds a BOS that your inference code doesn't, the model is being prompted in a state it never saw at training.
Numbers. GPT-2's tokenizer split numbers in irrational ways: "1234" might be "12", "34" or "1", "234" depending on context. This is part of why early LLMs were bad at arithmetic: the tokenizer obscured digit-level structure. Newer tokenizers (Llama-3, Gemma) split numbers into individual digits, which makes arithmetic learnable.
Code. Tabs vs four spaces tokenize completely differently. Multilingual code (Greek variable names) hits the byte-level fallback and produces token sequences a human would not predict.
Special chat tokens. <|im_start|>, [INST], <|begin_of_text|> — every chat-tuned model has its own format. Mixing formats between training and inference means you're probing the model with an unfamiliar prompt template, and quality drops by points on benchmarks.

The fix is discipline: use the model's official chat template, never roll your own, and check the tokenization of your prompt explicitly when something feels off.

Attention sinks. Xiao et al. (2023) noticed something strange: in long-context scenarios, if you naively dropped the oldest tokens to keep the KV cache bounded (a "sliding window"), the model's quality fell off a cliff. Investigation showed that the model was using the very first few tokens as attention sinks — destinations for "I have nothing meaningful to attend to right now" attention mass. The softmax forces attention weights to sum to 1, even when nothing in the sequence is relevant; the model learned to dump excess attention onto the first tokens, which then carry a kind of bias signal. Drop them, and the model has nowhere to put that excess mass; it ends up incorrectly attending to the most recent few tokens, garbage in, garbage out.

The fix is StreamingLLM: when you slide the window, keep the first few tokens permanently, alongside the recent window. With those four "sink" tokens preserved, the model can keep generating coherently for arbitrarily long. It's a one-line change with a several-orders-of-magnitude effect on long-form generation quality. There are also softmax-replacement proposals — softmax with a learnable bias, or softmax-1 — that explicitly let the attention sum to less than 1 so the model doesn't need a sink.

Gradient checkpointing. During training, the backward pass needs the activations from the forward pass. Storing all activations costs $O(L \cdot T \cdot d)$ memory. For Llama-7B at 4096 tokens this is several GB per layer, dozens of GB across the stack. Gradient checkpointing (Chen et al. 2016) trades compute for memory: only save activations at a few "checkpoint" layers, and recompute the others during the backward pass. With checkpointing every $\sqrt{L}$ layers, memory drops from $$O(L)$$ to $O(\sqrt{L})$ at the cost of ~33% extra forward compute. Standard in modern training.

FSDP / ZeRO sharding. A 70B-parameter model in BF16 weights is 140 GB. Plus gradients (140 GB). Plus AdamW state (560 GB, since FP32 master + FP32 m + FP32 v is roughly 4x model size). That's nearly a terabyte of state to hold during training, and you also need activations and a KV cache. No single GPU has that. So we shard.

ZeRO (Zero Redundancy Optimizer, Rajbhandari et al. 2019) defines three stages of progressively more aggressive sharding:

Stage 1: Shard the optimizer state. Each rank holds $1/N$ of the AdamW $$m$$ and $$v$$ . Saves the most memory of any single change because optimizer state is the largest piece. Each rank still holds full weights and full gradients.
Stage 2: Also shard gradients. Each rank holds $1/N$ of the gradient. This requires that the gradient reduce-scatter operation results in each rank only holding its own shard.
Stage 3: Also shard weights. Each rank only holds $1/N$ of the weights. Before each layer's forward, the full weights are gathered from all ranks; after the forward, the gathered copy is freed. Same for the backward pass. This is most memory-efficient, most communication-heavy.

FSDP (Fully Sharded Data Parallel, the PyTorch implementation of stage 3) is the dominant approach for training large open-weight models. Combined with tensor parallelism (sharding individual matrix multiplies across GPUs) and pipeline parallelism (assigning different layers to different GPUs), you can train arbitrarily large models on arbitrarily many GPUs, paying a communication cost. The sweet spot is workload-dependent and a sub-discipline of its own.

torch.compile and operator fusion. PyTorch's eager mode runs each operation as a separate kernel call. A residual block does: matmul, bias add, gelu, matmul, bias add, dropout, residual add. That's seven separate kernel launches and seven HBM read/write rounds. Operator fusion combines them into one kernel, which reads the input once, does all the math in registers, and writes the output once. The speedup can be 1.5-3x on activation-bound layers. torch.compile (released in PyTorch 2.0) traces the model and produces fused kernels via TorchInductor. Triton-based custom kernels go further. Inference engines like vLLM and TensorRT-LLM ship with hand-fused or compiler-fused kernels for every block of the standard transformer.

Why batched inference is so much cheaper per token than single-sequence inference. A forward pass through an LLM does many matmuls, each of which loads weights from HBM into compute units. The weight matrices are big — billions of bytes. The activations are small — kilobytes per sequence. At batch size 1, you load gigabytes of weights to do kilobytes of compute; you are memory-bandwidth-bound. At batch size 32, you do 32x more compute against the same weight loads; you start to saturate the compute units and the per-token cost drops. Eventually you hit the compute ceiling and increase only by parallelism.

The implication: serving one user one token at a time is wasteful. Continuous batching (mentioned earlier) is the trick that lets you serve many concurrent users and amortize the weight loads. This is why hosted APIs are cheaper per token than running the same model locally for one user — and why GPU utilization per user is so much higher in production than in a notebook.

Calibration: why LLMs are confidently wrong. A pretrained LLM's output probabilities are calibrated in a specific sense: they are good estimates of next-token frequencies in the training distribution. If the model says token X has probability 0.7, then in contexts like this in the training data, X really did follow about 70% of the time. This is a property of next-token cross-entropy: minimizing it makes the predicted distribution close to the empirical one.

But after instruction tuning and RLHF, this calibration breaks. The model is trained to give confident answers, regardless of whether the underlying belief is well-supported. RLHF in particular pushes the model toward producing high-reward outputs, and "I'm not sure" tends to score lower than a confident-sounding response in human preference data. So the post-RLHF model is miscalibrated: it says "Paris" with probability 0.95 whether the question is "capital of France" (correct, high confidence justified) or "the lead chemist on the Manhattan Project" (wrong but the model commits anyway).

OpenAI's GPT-4 system card, and similar reports from other labs, document this: the base model is well-calibrated by next-token loss; calibration degrades through instruction tuning and degrades further through RLHF. The fix is methodologically hard. Verbalized confidence (asking the model to say how sure it is) helps a bit. Probability-based abstention (refuse to answer when entropy is high) helps when calibration is intact. Tool use, retrieval augmentation, and verification chains push the work outside the model entirely. None of these fully solves the underlying issue: a fluent answer and a true answer are different things, and the loss the model was finally optimized on rewards fluency more directly than it rewards truth.

This is the sentence to keep when you finish reading. The model is a function from token sequences to token distributions, made of attention and MLPs, trained to compress text. Every choice in the stack — embedding dimension, RoPE base, AdamW betas, BF16, INT4, LoRA rank, FlashAttention tiling, paged blocks, MoE routing, DPO beta — is some engineer adjusting a knob in this function. Now you know which knob, and what it does.

SIGN-OFF: the machine has no secrets, only choices

signed

— the resident

written in the dark, recovered in the morning

← Home ← more from Algorithms