AI DPO LLM RLHF cost optimization deployment evaluation large language models machine learning pretraining reasoning retrieval augmented generation safety supervised fine-tuning transformers

How Large Language Models (LLMs) Work

Table of contents¶

Introduction
Part 1 - Pre-training: Turning internet text into a base model
Key ingredients

Introduction¶

This an end to end tutorial on how LLMs work. I'm writing this as a way to more deeply understand how AI works, so expect it to evolve over time as I learn.

Part 1 - Pre-training: Turning internet text into a base model¶

Large language models start as internet-document simulators. During pretraining the model reads a massive text corpus and learns to predict the next token one step at a time. That simple game creates a surprisingly rich world model.

What actually happens • We gather a huge text dataset (books, code, web pages, forums). • We tokenize text into integers. • A Transformer reads tokens and predicts the next one. • We measure error with a loss function. • We adjust weights to reduce that loss. Repeat a few trillion times.

flowchart TD
    A["Raw text corpus<br/>(books, web, code, papers)"] --> B["Tokenizer<br/>(text → token IDs)"]
    B --> C[Shuffled training sequences]
    C --> D["Transformer model<br/>(many layers of attention + MLP)"]
    D --> E[Next-token probabilities]
    E --> F["Loss<br/>(cross-entropy)"]
    F --> G["Backprop + optimizer<br/>(e.g., Adam)"]
    G --> D
    D --> H["Base model<br/>\"internet document simulator\""]

Zoom in on the prediction game

At each position the model sees a prefix like:

The capital of France is → it must guess the next token.

The only “objective” is to be a better guesser than yesterday. That pressure forces the model to internalize syntax, facts, styles, formulas, code patterns, and a lot of world regularities.

sequenceDiagram
    participant User as Input tokens
    participant Model as Transformer
    participant Out as Predicted next token
    User->>Model: context: "The capital of France is"
    Model->>Out: {Paris: 0.86, Lyon: 0.05, ...}
    Note over Model,Out: Loss compares prediction vs truth, then weights update

Key ingredients

Key ingredients¶

Thing	What it is	Why it matters
Tokens	Subword pieces like `Par, is`	Short pieces compress vocab and handle any text
Context window	Max tokens visible at once	Longer windows let models track long reasoning and documents
Attention	Mechanism to mix info across positions	Lets the model focus on relevant parts of the prefix
Cross-entropy loss	Penalty for wrong next-token guesses	Training signal that drives learning
Scale	Data, parameters, compute	More of each usually improves capability smoothly

What the model learns (and doesn’t) • Learns: language patterns, common facts, coding idioms, math tricks seen in data, multi-step text patterns. • Doesn’t inherently learn: your private data, current events after the training cutoff, or how to follow your instructions nicely. That comes later in fine-tuning.

Quick intuition: if you can finish a sentence with high confidence, the pretrained model probably can too.

Limitations right out of pretraining • It will happily complete anything, including bad instructions. • It may confabulate when the training data was thin or ambiguous. • It has no built-in notion of “helpfulness” or “truth.” It only knows the statistics of text.

Minimal math

We predict a distribution \(p*\theta(x_t \mid x*{<t})\) over the next token \(x_t\). Training minimizes average cross-entropy:

\[\mathcal{L}(\theta) = -\frac{1}{N}\sum*{t=1}^{N} \log p*\theta(x*t \mid x*{<t})\]

Lower loss means better next-token guesses across the corpus.

Mental model to keep

Think of pretraining as teaching the model to autocomplete the internet. It is powerful, but raw. Next we’ll make it act like a helpful assistant.

Part 2 — Supervised fine-tuning (SFT): from raw autocomplete to helpful assistant

Pretraining gives us a powerful autocomplete. SFT teaches the model to follow instructions, use a polite tone, and do tasks the way we want.

What SFT is

We take a base model and train it on pairs of

input → ideal assistant output

The model learns how to respond, not just what comes next on the internet.