How Large Language Models (LLMs) Work
Table of contents¶
Introduction¶
This an end to end tutorial on how LLMs work. I'm writing this as a way to more deeply understand how AI works, so expect it to evolve over time as I learn.
Part 1 - Pre-training: Turning internet text into a base model¶
Large language models start as internet-document simulators. During pretraining the model reads a massive text corpus and learns to predict the next token one step at a time. That simple game creates a surprisingly rich world model.
What actually happens • We gather a huge text dataset (books, code, web pages, forums). • We tokenize text into integers. • A Transformer reads tokens and predicts the next one. • We measure error with a loss function. • We adjust weights to reduce that loss. Repeat a few trillion times.
flowchart TD
A["Raw text corpus<br/>(books, web, code, papers)"] --> B["Tokenizer<br/>(text → token IDs)"]
B --> C[Shuffled training sequences]
C --> D["Transformer model<br/>(many layers of attention + MLP)"]
D --> E[Next-token probabilities]
E --> F["Loss<br/>(cross-entropy)"]
F --> G["Backprop + optimizer<br/>(e.g., Adam)"]
G --> D
D --> H["Base model<br/>\"internet document simulator\""]
Zoom in on the prediction game
At each position the model sees a prefix like:
The capital of France is → it must guess the next token.
The only “objective” is to be a better guesser than yesterday. That pressure forces the model to internalize syntax, facts, styles, formulas, code patterns, and a lot of world regularities.
sequenceDiagram
participant User as Input tokens
participant Model as Transformer
participant Out as Predicted next token
User->>Model: context: "The capital of France is"
Model->>Out: {Paris: 0.86, Lyon: 0.05, ...}
Note over Model,Out: Loss compares prediction vs truth, then weights update
Key ingredients
Key ingredients¶
Thing | What it is | Why it matters |
---|---|---|
Tokens | Subword pieces like Par, is |
Short pieces compress vocab and handle any text |
Context window | Max tokens visible at once | Longer windows let models track long reasoning and documents |
Attention | Mechanism to mix info across positions | Lets the model focus on relevant parts of the prefix |
Cross-entropy loss | Penalty for wrong next-token guesses | Training signal that drives learning |
Scale | Data, parameters, compute | More of each usually improves capability smoothly |
What the model learns (and doesn’t) • Learns: language patterns, common facts, coding idioms, math tricks seen in data, multi-step text patterns. • Doesn’t inherently learn: your private data, current events after the training cutoff, or how to follow your instructions nicely. That comes later in fine-tuning.
Quick intuition: if you can finish a sentence with high confidence, the pretrained model probably can too.
Limitations right out of pretraining • It will happily complete anything, including bad instructions. • It may confabulate when the training data was thin or ambiguous. • It has no built-in notion of “helpfulness” or “truth.” It only knows the statistics of text.
Minimal math
We predict a distribution \(p*\theta(x_t \mid x*{<t})\) over the next token \(x_t\). Training minimizes average cross-entropy:
Lower loss means better next-token guesses across the corpus.
Mental model to keep
Think of pretraining as teaching the model to autocomplete the internet. It is powerful, but raw. Next we’ll make it act like a helpful assistant.
Part 2 — Supervised fine-tuning (SFT): from raw autocomplete to helpful assistant
Pretraining gives us a powerful autocomplete. SFT teaches the model to follow instructions, use a polite tone, and do tasks the way we want.
What SFT is
We take a base model and train it on pairs of
input → ideal assistant output
The model learns how to respond, not just what comes next on the internet.