Based on Andrej Karpathy's video introduction to LLMs like ChatGPT. This guide walks through the entire pipeline of how these systems are built and how to think about what they are.
The goal of this guide is to give you mental models for thinking through what LLMs are. They're magical and impressive in some respects, very good at certain things, not good at others, and have a number of sharp edges to be aware of. We'll go through the entire pipeline of how these systems are built, covering cognitive and psychological implications along the way.
There are three major sequential stages to building a system like ChatGPT:
- Pre-training — acquiring knowledge from the internet
- Supervised Fine-Tuning (SFT) — shaping an assistant personality
- Reinforcement Learning (RL) — perfecting reasoning and problem-solving
The first step is to collect a massive amount of text from publicly available internet sources. A useful reference for understanding what this looks like in practice is HuggingFace's FineWeb dataset, which is similar to what major LLM providers like OpenAI, Anthropic, and Google use internally.
What are we trying to get?
- A huge quantity of high-quality documents
- A large diversity of documents, so the model has broad knowledge
- Achieving this is complicated and takes multiple filtering stages
The FineWeb dataset, for example, ends up being about 44 terabytes of disk space — large but not impossibly so (it could almost fit on a single hard drive). The internet itself is vastly larger, but aggressive filtering brings it down to this.
The starting point: Common Crawl
Most of the data comes from Common Crawl, an organization that has been scouring the internet since 2007. As of 2024, Common Crawl has indexed 2.7 billion web pages. The process works by starting from a few seed web pages, following all links, and recursively indexing everything encountered.
Filtering pipeline:
Raw Common Crawl data goes through many stages of filtering:
-
URL Filtering — Blocklists of domains you don't want: malware sites, spam, marketing sites, racist sites, adult sites, etc. These are eliminated entirely.
-
Text Extraction — Web pages are stored as raw HTML (with markup, CSS, JavaScript, navigation menus, etc.). All of that has to be stripped away to extract just the clean readable text.
-
Language Filtering — A language classifier guesses the language of each page. FineWeb, for example, keeps only pages that are more than 65% English. This is a design decision — companies that want multilingual models will keep more non-English content, but models that filter out Spanish heavily will be correspondingly worse at Spanish.
-
Deduplication — Duplicate or near-duplicate documents are removed.
-
PII Removal — Personally identifiable information (addresses, Social Security numbers, etc.) is detected and filtered out.
At the end of this pipeline, you have something like FineWeb: ~44TB of clean text — essentially web pages filtered down to their meaningful content. Examples include an article about tornadoes in 2012, a medical article about adrenal glands, and thousands of other diverse documents.
What does this data look like?
If you took the first 200 web pages and concatenated all their text together, you'd get a massive tapestry of raw text — a giant texture of characters. This text has patterns in it, and the goal of the next steps is to train neural networks to internalize and model how this text flows.
Before text can be fed into a neural network, we need to decide how to represent it. Neural networks expect a one-dimensional sequence of symbols from a finite set. So we need to:
- Define what the symbols (tokens) are
- Represent our text as a sequence of those tokens
From raw bytes to tokens:
Text on a computer is stored as binary (bits: 0s and 1s). We could represent text as a sequence of bits, but that creates an extremely long sequence using only two symbols. Instead, we can group bits into bytes (8 bits = 256 possible combinations), which gives us a shorter sequence with 256 possible symbols.
The sequence length is a precious, finite resource in neural networks — we want to trade off symbol count for sequence length: more symbols = shorter sequences.
Byte-Pair Encoding (BPE):
In practice, we go further than bytes using an algorithm called Byte-Pair Encoding. The process:
- Start with sequences of bytes (256 possible symbols)
- Find pairs of consecutive bytes that appear very frequently (e.g., the pair
116, 32appears often) - Mint a new token (symbol ID 256) representing that pair
- Replace every instance of that pair with the new token
- Repeat — each iteration mints a new token and further compresses the sequence
In practice, a vocabulary size of around 100,000 tokens works well. GPT-4 uses 100,277 tokens.
Tokenization in action:
You can explore tokenization at tiktokenizer.vercel.app. Using the cl100k_base tokenizer (GPT-4's base tokenizer):
"Hello world"→ 2 tokens:Hello(ID 15339) andworld(ID 11917)"Hello world"(two spaces) → 3 tokens (the extra space is its own token)"HELLO world"→ different tokens (tokenization is case-sensitive)
Important: tokenization is not intuitive. A line of text that looks simple to you might be 62 tokens. The FineWeb dataset is described not just in bytes (44TB) but also as approximately 15 trillion tokens.
Key insight: All those token IDs are just unique identifiers. Think of them like emoji — they represent little chunks of text, and the numbers themselves are meaningless except as IDs.
Now we have a 15-trillion-token sequence. The neural network's job is to model the statistical relationships of how these tokens follow each other.
The training loop:
- Sample a window of tokens from the dataset (e.g., up to 8,000 tokens — the "context length")
- Feed those tokens into the neural network as input (the "context")
- The network outputs a probability distribution over all ~100,000 tokens — a prediction for what token comes next
- Compare that prediction to the actual next token in the dataset (the "label")
- Compute how wrong the prediction was (the loss)
- Update the network's parameters slightly so the correct token gets a higher probability
- Repeat billions of times across the entire dataset
In the very beginning, the network is randomly initialized, so its predictions are random. Through iteration, the parameters get adjusted until the predictions match the statistical patterns of the training data.
Visualizing the network (inputs and outputs):
- Input: A sequence of tokens (0 to ~8,000 tokens)
- Output: Exactly ~100,277 probability values — one for each possible next token
- These probabilities indicate how likely each token is to come next
For example, given the context tokens for "bar view in and single", the network might say: "space Direction" has 4% probability, token 11799 has 2%, and "post" (token 3962) has 3%. If the correct answer is "post", we nudge the network to make "post" slightly more probable next time.
What's inside these networks?
Parameters (weights): Modern neural networks have billions of parameters — numerical "knobs." In the beginning, they're randomly set. Training adjusts them so the network's predictions match the training data. Think of parameters like knobs on a DJ set: as you twiddle them, you get different predictions.
The mathematical expression: The neural network is a giant mathematical function that takes input tokens and parameters, mixes them together using operations like multiplication, addition, and exponentiation, and produces output probabilities. The specific structure of this mathematical expression is the subject of neural network architecture research.
The Transformer architecture: Modern LLMs use an architecture called the Transformer. Inside a Transformer:
- Input tokens are first "embedded" — each token gets a vector representation
- Information flows through a series of "attention" blocks and "MLP" blocks
- Each block applies simple mathematical transformations (layer norms, matrix multiplications, softmax operations)
- Eventually, output probabilities are produced
A key clarification: these "neurons" are extremely simple compared to biological neurons. Biological neurons are complex dynamical processes with memory. These are stateless mathematical operations — no memory within a single forward pass, just inputs mixed with parameters to produce outputs.
Once a network is trained, we can generate new text from it. This is called inference.
How inference works:
- Start with some prefix tokens (e.g., token 91)
- Feed them into the network → get a probability distribution
- Sample from that distribution — flip a weighted coin biased by those probabilities
- Append the sampled token to the sequence
- Feed the longer sequence back in → get the next probability distribution
- Repeat
This sampling process is stochastic — you're flipping coins. For the same input, you'll get different outputs each time. Sometimes you'll get tokens that happen to form sequences identical to something in the training data; usually you get remixes and variations that have similar statistical properties to the training data but aren't verbatim copies.
Training vs. inference: In training, you're repeatedly adjusting the model's parameters. Once training is done, those parameters are fixed. When you're chatting with a model on ChatGPT, the model was trained months ago; all that's happening now is inference — the model's fixed weights are being used to complete token sequences.
OpenAI's GPT-2 (published 2019) was the first time a recognizably modern LLM stack came together. Its specs:
- 1.6 billion parameters
- Maximum context length: 1,024 tokens (modern models: ~100,000–1,000,000)
- Trained on ~100 billion tokens (modern datasets: ~15 trillion)
The cost to train GPT-2 in 2019 was approximately $40,000. Today it could be reproduced for around $100, due to better data quality, better hardware (GPUs), and better software.
What training looks like in practice:
Each line of a training run represents one update to the model, improving predictions across ~1 million tokens at once. The key number to watch is the loss — a single number measuring how well the network is currently predicting. Low loss is good. You watch the loss decrease over time as the network improves.
Every 20 steps or so, you run inference to see what the model generates. Early on (step 1–20), it produces completely random gibberish. At 1% through training, it produces text with some local coherence but still largely nonsensical. By the end (32,000 steps, ~33 billion tokens processed), it generates coherent English.
The hardware: This training runs on GPUs in the cloud. A commonly used setup is 8× NVIDIA H100 GPUs in a single node, rented for around $3/GPU/hour. Multiple nodes can be combined into entire data centers. The reason Nvidia's stock has exploded is precisely this: every major AI company needs massive quantities of GPUs to run these token prediction loops at scale.
The output of pre-training is a base model: a neural network with a fixed set of parameters that acts as a token sequence simulator. It doesn't answer questions — it continues token sequences based on the statistical patterns of internet text. It essentially "dreams" internet pages.
Trying a base model (Llama 3.1 405B):
Meta's Llama 3.1 405B (405 billion parameters, trained on 15 trillion tokens) was released publicly. Hosted at hyperbolic.xyz, you can try the base model directly.
Key observations when using a base model:
- It is not an assistant. If you type "What is 2+2?", it doesn't say "4." It treats those tokens as a prefix and autocompletes — perhaps going off into philosophical territory, or repeating variations of the question.
- It is stochastic. The same prompt gives different outputs every time.
- It can recite training data. Paste the opening of a Wikipedia article (e.g., for "zebra"), and the model will often recite the rest of that article nearly verbatim, because it's been trained on that page multiple times and has essentially memorized it. This is called regurgitation — generally undesirable.
- It hallucinates "parallel universe" completions. Ask it to complete a sentence about the 2024 election (which occurred after its training cutoff), and it will generate plausible-but-wrong completions — different running mates, different opponents, etc. Each sample produces a different fictional history.
Using a base model for practical tasks (few-shot prompting):
Even without fine-tuning, base models can be used cleverly via few-shot prompts — providing several examples of the input-output pattern you want, so the model continues the pattern:
English: house
Korean: 집
English: book
Korean: 책
...
English: teacher
Korean: [model completes here]
The model has in-context learning abilities: as it reads the context, it learns the pattern and continues it. This is how apps can be built on top of base models.
You can even simulate an assistant by crafting a prompt that looks like a conversation between a helpful AI and a human, and the base model will continue the conversation in that format.
Pre-training gives us an internet document simulator. What we actually want is an assistant — something we can ask questions and get answers from. That requires post-training.
The base model was trained on internet documents. For SFT, we swap out the dataset for a dataset of conversations — examples of the ideal interaction between a human and an assistant. We continue training the model on this new dataset.
This is computationally much cheaper than pre-training:
- Pre-training: ~3 months on thousands of computers, millions of dollars
- SFT: ~3 hours, much smaller dataset
The algorithm is identical — we're still predicting the next token. Only the data changes.
A conversation dataset might contain entries like:
Human: What is 2+2?
Assistant: 2+2 is 4.
A conversation dataset contains entries like:
- Human asks "What is 2+2?", assistant responds "2+2 is 4"
- Human asks for something the model should decline, assistant refuses politely
- The assistant has a consistent, helpful personality
These conversations are created by human labelers hired by companies. The labelers follow labeling instructions — often hundreds of pages written by the company — describing what good responses look like: helpful, truthful, harmless.
Key insight: When you use ChatGPT, what comes back is a statistical simulation of a human data labeler following OpenAI's instructions. Not a magical AI. A skilled person's response pattern, learned statistically.
Since ~2022-2023, LLMs themselves help generate training conversations. Labelers use existing models to draft responses and then edit them. Datasets like UltraChat contain millions of mostly-synthetic conversations. The core is still human judgment (prompts, instructions, evaluation), but LLMs dramatically accelerate the process.
Conversations must become token sequences. GPT-4 uses a protocol with special tokens:
<|im_start|>— start of a turnuserorassistant— whose turn<|im_sep|>— separator- The actual message
<|im_end|>— end of turn
A two-turn conversation becomes about 49 tokens — one flat, one-dimensional sequence, just like any other token stream. The model then applies all the same training and inference machinery.
During inference on ChatGPT: your message is wrapped in this format, the sequence is sent to the model, and the model generates the next tokens — the assistant's response.
What they are: LLMs confidently make up facts, invent citations, describe people who don't exist.
Why they happen: In SFT training data, "Who is Tom Cruise?" is answered confidently with correct information. The model learns the style of confident answers. When you ask about someone it's barely seen, it doesn't say "I don't know" — it generates a statistically plausible answer in that confident style, which happens to be fabricated.
Mitigation 1: Teaching the model to say "I don't know"
- Take random documents from the training set
- Use an LLM to generate factual questions about each document
- Ask the model those questions multiple times
- If the model consistently gets the answer wrong (checked by an LLM judge), it doesn't know
- Add a training example: question → "I'm sorry, I don't know"
This creates the association between internal uncertainty (which likely exists as a neuron state) and actually saying "I don't know" — something that wasn't previously wired up.
Mitigation 2: Tool use (web search)
The model can emit special tokens like <search>query</search>. When the inference code sees the closing token, it:
- Pauses generation
- Runs the actual web search
- Pastes results into the context window
- Resumes generation
The context window is the model's working memory. Data there is directly accessible — unlike vague parameter-stored recollections.
Analogy:
- Knowledge in parameters = something you read a month ago (vague, unreliable for details)
- Knowledge in the context window = something you just looked up (directly accessible, reliable)
This is why pasting a document directly into your prompt almost always gives better results than asking the model to recall it from memory.
There is a fixed, finite amount of computation per token. You cannot expect a model to do complex work in a single token.
Math example: "Emily buys 3 apples and 2 oranges. Each orange is $2. Total is $13. What does each apple cost?"
Bad assistant response:
The answer is $3. [then justifications follow]
The answer is produced in a single token — before any computation. Everything after is post-hoc rationalization.
Good assistant response:
Cost of 2 oranges = 2 × $2 = $4. Remaining for apples = $13 - $4 = $9. Per apple = $9 ÷ 3 = $3.
Each step is simple. Computation is distributed across tokens. By the time the model reaches the answer, all intermediate results are in context.
ChatGPT's verbose step-by-step answers aren't just for your benefit — they're the model thinking. Without them, the model is more likely to fail.
Practical rule: If the numbers get bigger (23 apples and 177 oranges instead of 3 and 2), the single-token approach fails immediately. Use code interpreter for anything numerical.
Same reason — counting all elements in a single forward pass is too much work per token. If you paste 177 dots and ask "how many?" the model will likely get it wrong.
Fix: ask it to use code. The model copy-pastes the content into Python and calls .count(). Python does the counting — not the model's mental arithmetic.
Models don't see characters — they see tokens (chunks of characters). The word "ubiquitous" is 3 tokens. The model cannot easily index into individual letters. This causes failures on:
- "Print every third character of 'ubiquitous'"
- "How many R's are in 'strawberry'?" ← went viral; models long insisted there were 2 R's (there are 3)
The strawberry problem combines two weaknesses: not seeing characters + poor counting.
Fix: always ask the model to use code for character-level tasks.
Models can solve graduate-level physics but fail at "Is 9.11 bigger than 9.9?" This is jagged intelligence — extraordinary capability with unexpected random holes.
One documented explanation for the 9.11/9.9 failure: neurons associated with Bible verse notation activate (where 9:11 does come after 9:9 in that system), interfering with the arithmetic.
Mental model: Swiss cheese — excellent across vast areas, but with random unpredictable holes. Check their work. Use them as tools, not oracles.
Models don't have persistent identity. Every conversation starts from scratch — no memory, no continuous existence. Asking "who built you?" without explicit programming yields hallucinated answers.
Older models from non-OpenAI companies would say "I was built by OpenAI" — not because they were, but because the internet has massive amounts of ChatGPT responses, making that the statistical best guess.
How companies fix this:
- Hardcoded training examples — add ~240 conversations where "what are you?" gets the correct answer
- System messages — a hidden message at conversation start reminds the model of its identity
Learning from textbooks has three phases:
- Exposition → building background knowledge → equivalent to pre-training
- Worked examples → imitating expert solutions → equivalent to SFT
- Practice problems → discovering solutions yourself, with only the final answer given → equivalent to RL
Practice problems are critical because they force you to discover solution strategies through trial and error, not just imitate.
Human labelers writing ideal SFT responses can't know which token sequences are actually easiest for the model. Human cognition and LLM cognition are different. What seems like a natural solution step to a human might be a massive leap for the model — or what seems elaborate might be trivial.
The model needs to discover the token sequences that work for it, not blindly imitate human solutions.
- Take a prompt with a known correct answer (e.g., a math problem where answer = $3)
- Sample many independent solutions from the model (perhaps 1,000)
- Score each: did it reach $3? (Check the boxed answer against the key)
- Encourage solutions that worked — train on them; discard failures
- Repeat across thousands of diverse prompts
The model plays in a playground, tries many things, discovers what works. No human decides which solution is "best" — the verifiable answer determines it automatically.
DeepSeek-R1's key finding: as RL trains on math problems, solutions get longer and more accurate. The model discovers that it's more accurate when it:
- Tries multiple approaches
- Backtracks when something seems off ("Wait, let me re-evaluate")
- Checks results from different angles
- Uses exploratory language: "Let me try setting up an equation instead"
These chains of thought emerge from RL without being hardcoded. No human told the model to do this. It discovered that this style of thinking leads to more correct answers.
Where to access thinking models:
- DeepSeek R1: chat.deepseek.com (enable Deep Think) or via together.ai
- OpenAI o1, o3-mini: confirmed to use similar RL techniques
- Google Gemini 2.0 Flash Thinking Experimental
When to use them: Hard math, code, multi-step logic. Overkill for simple factual questions.
RL's power was demonstrated in Go. DeepMind's AlphaGo showed:
- Supervised learning (imitating expert players) → gets good but plateaus; can never exceed human ceiling
- Reinforcement learning (self-play) → exceeds even the best humans
Famous example: Move 37 — AlphaGo played a move humans estimated had a 1-in-10,000 chance of any human attempting. It looked like a mistake. It was actually brilliant, and humans had never discovered it.
RL found a strategy not present in any human training data.
The same potential exists for LLMs on open-domain reasoning. In principle, RL-trained models could discover reasoning strategies, analogies, or problem approaches that no human has ever conceived. The model isn't even constrained to English — it could potentially develop its own internal language better suited to reasoning. We're in early days (early 2025), but the trajectory is extraordinary.
RL on verifiable domains (math, code) works because you can check answers objectively. For unverifiable domains (creative writing, jokes, summaries), there's no objective answer to check.
The RLHF solution:
- Train a Reward Model — a separate neural network that predicts human scores
- Collect a small amount of human supervision: have people rank 5 candidate responses (best to worst) for ~5,000 prompts
- Train the reward model to match human rankings
- Run RL against the reward model (queried automatically, no humans needed)
The reward model becomes a simulator of human preferences. Humans contribute by discriminating (ranking), not generating — which is a much easier task.
Why rankings, not scores? Easier and more consistent for humans to say "this joke is funnier than that one" than to assign precise numerical scores.
RLHF limitations:
The reward model is gameable. RL will eventually find adversarial inputs — nonsensical outputs that inexplicably score highly on the reward model. For example, after many RL steps, the model might discover that "the the the the the" receives a score of 1.0 from the reward model for joke quality, even though it's obviously not a joke.
This happens because reward models are neural networks with billions of parameters — complex enough that RL can find the cracks.
Practical consequence: You run RLHF for a limited number of steps (a few hundred), get a modest improvement, and stop. Running too long causes the model to game the reward model and degrade.
The key distinction: RLHF is not "real RL" in the transformative sense. Real RL on verifiable problems can run indefinitely and produce genuinely novel, superhuman strategies. RLHF is more like a small fine-tuning nudge — useful, but not magic.
When you type a message into ChatGPT:
- Your query becomes tokens, wrapped in the conversation protocol
- The model — a fixed mathematical function — generates the next tokens one at a time
- Each token is sampled from a probability distribution
For a standard GPT-4o response (SFT model): A neural network simulation of a human data labeler following OpenAI's instructions. Not a magical AI — a statistical pattern learned from skilled humans' example responses.
For a thinking model (o3, DeepSeek-R1): Something more novel — an emergent reasoning process discovered through RL. The chains of thought were not written by any human; they were discovered by a system that learned what thinking strategies reliably lead to correct answers.
Work with their strengths:
- Paste documents directly into your prompt rather than relying on model memory
- For math/counting/calculations, ask for code rather than mental arithmetic
- Use them for first drafts, brainstorming, and exploration
Guard against their weaknesses:
- Always verify factual claims, especially specific names, dates, numbers
- Let models show their work — don't force single-token answers on complex problems
- Spelling and character-level tasks need code tools
- Simple comparisons (9.11 vs 9.9) may fail
- Random holes appear in unexpected places — check the output
Mental model: Stochastic Swiss cheese — brilliant across most areas, with random unpredictable holes, and never producing the exact same output twice.
Multimodality: Audio and images can be tokenized (audio spectrograms, image patches) and fed into the same context window as text. The same training machinery applies.
Agents: Long-running multi-step autonomous tasks, with humans supervising rather than directly doing. Human-to-agent ratios will become a key metric.
Computer use: Models taking direct keyboard/mouse actions on your behalf (early examples: ChatGPT Operator).
Test-time training: Current models freeze after training. Active research into models that continue updating during inference — more like biological learning.
| Type | Where |
|---|---|
| ChatGPT / GPT-4o | chat.openai.com |
| Gemini (Google) | gemini.google.com or aistudio.google.com |
| Open-weight models | together.ai (playground) |
| Base models | hyperbolic.xyz |
| DeepSeek R1 | chat.deepseek.com or together.ai |
| Local/on-device | LM Studio (smaller distilled models on your laptop) |
| Model leaderboard | LMSYS Chatbot Arena (lmsys.org/chatbot-arena) |
| Stage | What happens | Analogy | Output |
|---|---|---|---|
| Pre-training | Train on internet text; predict next token | Reading all the exposition in all the textbooks | Base model (internet text simulator) |
| SFT | Train on human-curated conversations; imitate ideal responses | Studying worked examples from experts | Assistant model (simulates a human labeler) |
| RL | Practice problems with known answers; model discovers what works | Doing practice problems yourself | Thinking model (emergent novel reasoning) |
The overall trajectory: building textbooks and practice problems for AI across all domains of human knowledge, running increasingly sophisticated training algorithms at ever-larger scale. RL is the newest and most powerful stage — still early in development — but already producing models that verify their reasoning, try multiple approaches, and in principle could discover strategies humans have never thought of.