peterwwillis/intro_to_llms.md

## intro_to_llms.md

      
    Raw
  

              intro_to_llms.md
            
          
    Introduction to Large Language Models: A Comprehensive Walkthrough


Based on Andrej Karpathy's video introduction to LLMs like ChatGPT. This guide walks through the entire pipeline of how these systems are built and how to think about what they are.


Overview

The goal of this guide is to give you mental models for thinking through what LLMs are. They're magical and impressive in some respects, very good at certain things, not good at others, and have a number of sharp edges to be aware of. We'll go through the entire pipeline of how these systems are built, covering cognitive and psychological implications along the way.
There are three major sequential stages to building a system like ChatGPT:

Pre-training — acquiring knowledge from the internet
Supervised Fine-Tuning (SFT) — shaping an assistant personality
Reinforcement Learning (RL) — perfecting reasoning and problem-solving


Stage 1: Pre-Training

Step 1a: Downloading and Processing the Internet

The first step is to collect a massive amount of text from publicly available internet sources. A useful reference for understanding what this looks like in practice is HuggingFace's FineWeb dataset, which is similar to what major LLM providers like OpenAI, Anthropic, and Google use internally.
What are we trying to get?

A huge quantity of high-quality documents
A large diversity of documents, so the model has broad knowledge
Achieving this is complicated and takes multiple filtering stages

The FineWeb dataset, for example, ends up being about 44 terabytes of disk space — large but not impossibly so (it could almost fit on a single hard drive). The internet itself is vastly larger, but aggressive filtering brings it down to this.
The starting point: Common Crawl
Most of the data comes from Common Crawl, an organization that has been scouring the internet since 2007. As of 2024, Common Crawl has indexed 2.7 billion web pages. The process works by starting from a few seed web pages, following all links, and recursively indexing everything encountered.
Filtering pipeline:
Raw Common Crawl data goes through many stages of filtering:


URL Filtering — Blocklists of domains you don't want: malware sites, spam, marketing sites, racist sites, adult sites, etc. These are eliminated entirely.


Text Extraction — Web pages are stored as raw HTML (with markup, CSS, JavaScript, navigation menus, etc.). All of that has to be stripped away to extract just the clean readable text.


Language Filtering — A language classifier guesses the language of each page. FineWeb, for example, keeps only pages that are more than 65% English. This is a design decision — companies that want multilingual models will keep more non-English content, but models that filter out Spanish heavily will be correspondingly worse at Spanish.


Deduplication — Duplicate or near-duplicate documents are removed.


PII Removal — Personally identifiable information (addresses, Social Security numbers, etc.) is detected and filtered out.


At the end of this pipeline, you have something like FineWeb: ~44TB of clean text — essentially web pages filtered down to their meaningful content. Examples include an article about tornadoes in 2012, a medical article about adrenal glands, and thousands of other diverse documents.
What does this data look like?
If you took the first 200 web pages and concatenated all their text together, you'd get a massive tapestry of raw text — a giant texture of characters. This text has patterns in it, and the goal of the next steps is to train neural networks to internalize and model how this text flows.

Step 1b: Tokenization

Before text can be fed into a neural network, we need to decide how to represent it. Neural networks expect a one-dimensional sequence of symbols from a finite set. So we need to:

Define what the symbols (tokens) are
Represent our text as a sequence of those tokens

From raw bytes to tokens:
Text on a computer is stored as binary (bits: 0s and 1s). We could represent text as a sequence of bits, but that creates an extremely long sequence using only two symbols. Instead, we can group bits into bytes (8 bits = 256 possible combinations), which gives us a shorter sequence with 256 possible symbols.
The sequence length is a precious, finite resource in neural networks — we want to trade off symbol count for sequence length: more symbols = shorter sequences.
Byte-Pair Encoding (BPE):
In practice, we go further than bytes using an algorithm called Byte-Pair Encoding. The process:

Start with sequences of bytes (256 possible symbols)
Find pairs of consecutive bytes that appear very frequently (e.g., the pair 116, 32 appears often)
Mint a new token (symbol ID 256) representing that pair
Replace every instance of that pair with the new token
Repeat — each iteration mints a new token and further compresses the sequence

In practice, a vocabulary size of around 100,000 tokens works well. GPT-4 uses 100,277 tokens.
Tokenization in action:
You can explore tokenization at tiktokenizer.vercel.app. Using the cl100k_base tokenizer (GPT-4's base tokenizer):

"Hello world" → 2 tokens: Hello (ID 15339) and  world (ID 11917)
"Hello  world" (two spaces) → 3 tokens (the extra space is its own token)
"HELLO world" → different tokens (tokenization is case-sensitive)

Important: tokenization is not intuitive. A line of text that looks simple to you might be 62 tokens. The FineWeb dataset is described not just in bytes (44TB) but also as approximately 15 trillion tokens.
Key insight: All those token IDs are just unique identifiers. Think of them like emoji — they represent little chunks of text, and the numbers themselves are meaningless except as IDs.

Step 1c: Neural Network Training

Now we have a 15-trillion-token sequence. The neural network's job is to model the statistical relationships of how these tokens follow each other.
The training loop:

Sample a window of tokens from the dataset (e.g., up to 8,000 tokens — the "context length")
Feed those tokens into the neural network as input (the "context")
The network outputs a probability distribution over all ~100,000 tokens — a prediction for what token comes next
Compare that prediction to the actual next token in the dataset (the "label")
Compute how wrong the prediction was (the loss)
Update the network's parameters slightly so the correct token gets a higher probability
Repeat billions of times across the entire dataset

In the very beginning, the network is randomly initialized, so its predictions are random. Through iteration, the parameters get adjusted until the predictions match the statistical patterns of the training data.
Visualizing the network (inputs and outputs):

Input: A sequence of tokens (0 to ~8,000 tokens)
Output: Exactly ~100,277 probability values — one for each possible next token
These probabilities indicate how likely each token is to come next

For example, given the context tokens for "bar view in and single", the network might say: "space Direction" has 4% probability, token 11799 has 2%, and "post" (token 3962) has 3%. If the correct answer is "post", we nudge the network to make "post" slightly more probable next time.

Neural Network Internals

What's inside these networks?
Parameters (weights): Modern neural networks have billions of parameters — numerical "knobs." In the beginning, they're randomly set. Training adjusts them so the network's predictions match the training data. Think of parameters like knobs on a DJ set: as you twiddle them, you get different predictions.
The mathematical expression: The neural network is a giant mathematical function that takes input tokens and parameters, mixes them together using operations like multiplication, addition, and exponentiation, and produces output probabilities. The specific structure of this mathematical expression is the subject of neural network architecture research.
The Transformer architecture: Modern LLMs use an architecture called the Transformer. Inside a Transformer:

Input tokens are first "embedded" — each token gets a vector representation
Information flows through a series of "attention" blocks and "MLP" blocks
Each block applies simple mathematical transformations (layer norms, matrix multiplications, softmax operations)
Eventually, output probabilities are produced

A key clarification: these "neurons" are extremely simple compared to biological neurons. Biological neurons are complex dynamical processes with memory. These are stateless mathematical operations — no memory within a single forward pass, just inputs mixed with parameters to produce outputs.

Inference

Once a network is trained, we can generate new text from it. This is called inference.
How inference works:

Start with some prefix tokens (e.g., token 91)
Feed them into the network → get a probability distribution
Sample from that distribution — flip a weighted coin biased by those probabilities
Append the sampled token to the sequence
Feed the longer sequence back in → get the next probability distribution
Repeat

This sampling process is stochastic — you're flipping coins. For the same input, you'll get different outputs each time. Sometimes you'll get tokens that happen to form sequences identical to something in the training data; usually you get remixes and variations that have similar statistical properties to the training data but aren't verbatim copies.
Training vs. inference: In training, you're repeatedly adjusting the model's parameters. Once training is done, those parameters are fixed. When you're chatting with a model on ChatGPT, the model was trained months ago; all that's happening now is inference — the model's fixed weights are being used to complete token sequences.

GPT-2: A Concrete Example

OpenAI's GPT-2 (published 2019) was the first time a recognizably modern LLM stack came together. Its specs:

1.6 billion parameters
Maximum context length: 1,024 tokens (modern models: ~100,000–1,000,000)
Trained on ~100 billion tokens (modern datasets: ~15 trillion)

The cost to train GPT-2 in 2019 was approximately $40,000. Today it could be reproduced for around $100, due to better data quality, better hardware (GPUs), and better software.
What training looks like in practice:
Each line of a training run represents one update to the model, improving predictions across ~1 million tokens at once. The key number to watch is the loss — a single number measuring how well the network is currently predicting. Low loss is good. You watch the loss decrease over time as the network improves.
Every 20 steps or so, you run inference to see what the model generates. Early on (step 1–20), it produces completely random gibberish. At 1% through training, it produces text with some local coherence but still largely nonsensical. By the end (32,000 steps, ~33 billion tokens processed), it generates coherent English.
The hardware: This training runs on GPUs in the cloud. A commonly used setup is 8× NVIDIA H100 GPUs in a single node, rented for around $3/GPU/hour. Multiple nodes can be combined into entire data centers. The reason Nvidia's stock has exploded is precisely this: every major AI company needs massive quantities of GPUs to run these token prediction loops at scale.

What a Base Model Is

The output of pre-training is a base model: a neural network with a fixed set of parameters that acts as a token sequence simulator. It doesn't answer questions — it continues token sequences based on the statistical patterns of internet text. It essentially "dreams" internet pages.
Trying a base model (Llama 3.1 405B):
Meta's Llama 3.1 405B (405 billion parameters, trained on 15 trillion tokens) was released publicly. Hosted at hyperbolic.xyz, you can try the base model directly.
Key observations when using a base model:

It is not an assistant. If you type "What is 2+2?", it doesn't say "4." It treats those tokens as a prefix and autocompletes — perhaps going off into philosophical territory, or repeating variations of the question.
It is stochastic. The same prompt gives different outputs every time.
It can recite training data. Paste the opening of a Wikipedia article (e.g., for "zebra"), and the model will often recite the rest of that article nearly verbatim, because it's been trained on that page multiple times and has essentially memorized it. This is called regurgitation — generally undesirable.
It hallucinates "parallel universe" completions. Ask it to complete a sentence about the 2024 election (which occurred after its training cutoff), and it will generate plausible-but-wrong completions — different running mates, different opponents, etc. Each sample produces a different fictional history.

Using a base model for practical tasks (few-shot prompting):
Even without fine-tuning, base models can be used cleverly via few-shot prompts — providing several examples of the input-output pattern you want, so the model continues the pattern:
English: house
Korean: 집

English: book  
Korean: 책

...

English: teacher
Korean: [model completes here]

The model has in-context learning abilities: as it reads the context, it learns the pattern and continues it. This is how apps can be built on top of base models.
You can even simulate an assistant by crafting a prompt that looks like a conversation between a helpful AI and a human, and the base model will continue the conversation in that format.

Stage 2: Supervised Fine-Tuning (SFT) — Post-Training

Pre-training gives us an internet document simulator. What we actually want is an assistant — something we can ask questions and get answers from. That requires post-training.
The Core Idea

The base model was trained on internet documents. For SFT, we swap out the dataset for a dataset of conversations — examples of the ideal interaction between a human and an assistant. We continue training the model on this new dataset.
This is computationally much cheaper than pre-training:

Pre-training: ~3 months on thousands of computers, millions of dollars
SFT: ~3 hours, much smaller dataset

The algorithm is identical — we're still predicting the next token. Only the data changes.
What These Conversations Look Like

A conversation dataset might contain entries like:
Human: What is 2+2?
Assistant: 2+2 is 4.

A conversation dataset contains entries like:

Human asks "What is 2+2?", assistant responds "2+2 is 4"
Human asks for something the model should decline, assistant refuses politely
The assistant has a consistent, helpful personality

These conversations are created by human labelers hired by companies. The labelers follow labeling instructions — often hundreds of pages written by the company — describing what good responses look like: helpful, truthful, harmless.
Key insight: When you use ChatGPT, what comes back is a statistical simulation of a human data labeler following OpenAI's instructions. Not a magical AI. A skilled person's response pattern, learned statistically.
Modern SFT: Synthetic Data

Since ~2022-2023, LLMs themselves help generate training conversations. Labelers use existing models to draft responses and then edit them. Datasets like UltraChat contain millions of mostly-synthetic conversations. The core is still human judgment (prompts, instructions, evaluation), but LLMs dramatically accelerate the process.

How Conversations Are Tokenized

Conversations must become token sequences. GPT-4 uses a protocol with special tokens:

<|im_start|> — start of a turn
user or assistant — whose turn
<|im_sep|> — separator
The actual message
<|im_end|> — end of turn

A two-turn conversation becomes about 49 tokens — one flat, one-dimensional sequence, just like any other token stream. The model then applies all the same training and inference machinery.
During inference on ChatGPT: your message is wrapped in this format, the sequence is sent to the model, and the model generates the next tokens — the assistant's response.

LLM Psychology: Cognitive Properties and Sharp Edges


Hallucinations

What they are: LLMs confidently make up facts, invent citations, describe people who don't exist.
Why they happen: In SFT training data, "Who is Tom Cruise?" is answered confidently with correct information. The model learns the style of confident answers. When you ask about someone it's barely seen, it doesn't say "I don't know" — it generates a statistically plausible answer in that confident style, which happens to be fabricated.
Mitigation 1: Teaching the model to say "I don't know"

Take random documents from the training set
Use an LLM to generate factual questions about each document
Ask the model those questions multiple times
If the model consistently gets the answer wrong (checked by an LLM judge), it doesn't know
Add a training example: question → "I'm sorry, I don't know"

This creates the association between internal uncertainty (which likely exists as a neuron state) and actually saying "I don't know" — something that wasn't previously wired up.
Mitigation 2: Tool use (web search)
The model can emit special tokens like <search>query</search>. When the inference code sees the closing token, it:

Pauses generation
Runs the actual web search
Pastes results into the context window
Resumes generation

The context window is the model's working memory. Data there is directly accessible — unlike vague parameter-stored recollections.
Analogy:

Knowledge in parameters = something you read a month ago (vague, unreliable for details)
Knowledge in the context window = something you just looked up (directly accessible, reliable)

This is why pasting a document directly into your prompt almost always gives better results than asking the model to recall it from memory.

Models Need Tokens to Think

There is a fixed, finite amount of computation per token. You cannot expect a model to do complex work in a single token.
Math example: "Emily buys 3 apples and 2 oranges. Each orange is $2. Total is $13. What does each apple cost?"
Bad assistant response:

The answer is $3. [then justifications follow]

The answer is produced in a single token — before any computation. Everything after is post-hoc rationalization.
Good assistant response:

Cost of 2 oranges = 2 × $2 = $4. Remaining for apples = $13 - $4 = $9. Per apple = $9 ÷ 3 = $3.

Each step is simple. Computation is distributed across tokens. By the time the model reaches the answer, all intermediate results are in context.
ChatGPT's verbose step-by-step answers aren't just for your benefit — they're the model thinking. Without them, the model is more likely to fail.
Practical rule: If the numbers get bigger (23 apples and 177 oranges instead of 3 and 2), the single-token approach fails immediately. Use code interpreter for anything numerical.

Models Struggle with Counting

Same reason — counting all elements in a single forward pass is too much work per token. If you paste 177 dots and ask "how many?" the model will likely get it wrong.
Fix: ask it to use code. The model copy-pastes the content into Python and calls .count(). Python does the counting — not the model's mental arithmetic.

Tokenization Revisited: Spelling Failures

Models don't see characters — they see tokens (chunks of characters). The word "ubiquitous" is 3 tokens. The model cannot easily index into individual letters. This causes failures on:

"Print every third character of 'ubiquitous'"
"How many R's are in 'strawberry'?" ← went viral; models long insisted there were 2 R's (there are 3)

The strawberry problem combines two weaknesses: not seeing characters + poor counting.
Fix: always ask the model to use code for character-level tasks.

Jagged Intelligence: The Swiss Cheese Model

Models can solve graduate-level physics but fail at "Is 9.11 bigger than 9.9?" This is jagged intelligence — extraordinary capability with unexpected random holes.
One documented explanation for the 9.11/9.9 failure: neurons associated with Bible verse notation activate (where 9:11 does come after 9:9 in that system), interfering with the arithmetic.
Mental model: Swiss cheese — excellent across vast areas, but with random unpredictable holes. Check their work. Use them as tools, not oracles.

Knowledge of Self

Models don't have persistent identity. Every conversation starts from scratch — no memory, no continuous existence. Asking "who built you?" without explicit programming yields hallucinated answers.
Older models from non-OpenAI companies would say "I was built by OpenAI" — not because they were, but because the internet has massive amounts of ChatGPT responses, making that the statistical best guess.
How companies fix this:

Hardcoded training examples — add ~240 conversations where "what are you?" gets the correct answer
System messages — a hidden message at conversation start reminds the model of its identity


Stage 3: Reinforcement Learning (RL)

The Textbook Analogy

Learning from textbooks has three phases:

Exposition → building background knowledge → equivalent to pre-training
Worked examples → imitating expert solutions → equivalent to SFT
Practice problems → discovering solutions yourself, with only the final answer given → equivalent to RL

Practice problems are critical because they force you to discover solution strategies through trial and error, not just imitate.

Why RL Is Needed

Human labelers writing ideal SFT responses can't know which token sequences are actually easiest for the model. Human cognition and LLM cognition are different. What seems like a natural solution step to a human might be a massive leap for the model — or what seems elaborate might be trivial.
The model needs to discover the token sequences that work for it, not blindly imitate human solutions.

How RL Works


Take a prompt with a known correct answer (e.g., a math problem where answer = $3)
Sample many independent solutions from the model (perhaps 1,000)
Score each: did it reach $3? (Check the boxed answer against the key)
Encourage solutions that worked — train on them; discard failures
Repeat across thousands of diverse prompts

The model plays in a playground, tries many things, discovers what works. No human decides which solution is "best" — the verifiable answer determines it automatically.

What RL Produces: Thinking Models

DeepSeek-R1's key finding: as RL trains on math problems, solutions get longer and more accurate. The model discovers that it's more accurate when it:

Tries multiple approaches
Backtracks when something seems off ("Wait, let me re-evaluate")
Checks results from different angles
Uses exploratory language: "Let me try setting up an equation instead"

These chains of thought emerge from RL without being hardcoded. No human told the model to do this. It discovered that this style of thinking leads to more correct answers.
Where to access thinking models:

DeepSeek R1: chat.deepseek.com (enable Deep Think) or via together.ai
OpenAI o1, o3-mini: confirmed to use similar RL techniques
Google Gemini 2.0 Flash Thinking Experimental

When to use them: Hard math, code, multi-step logic. Overkill for simple factual questions.

The AlphaGo Connection

RL's power was demonstrated in Go. DeepMind's AlphaGo showed:

Supervised learning (imitating expert players) → gets good but plateaus; can never exceed human ceiling
Reinforcement learning (self-play) → exceeds even the best humans

Famous example: Move 37 — AlphaGo played a move humans estimated had a 1-in-10,000 chance of any human attempting. It looked like a mistake. It was actually brilliant, and humans had never discovered it.
RL found a strategy not present in any human training data.
The same potential exists for LLMs on open-domain reasoning. In principle, RL-trained models could discover reasoning strategies, analogies, or problem approaches that no human has ever conceived. The model isn't even constrained to English — it could potentially develop its own internal language better suited to reasoning. We're in early days (early 2025), but the trajectory is extraordinary.

Reinforcement Learning from Human Feedback (RLHF)

RL on verifiable domains (math, code) works because you can check answers objectively. For unverifiable domains (creative writing, jokes, summaries), there's no objective answer to check.
The RLHF solution:

Train a Reward Model — a separate neural network that predicts human scores
Collect a small amount of human supervision: have people rank 5 candidate responses (best to worst) for ~5,000 prompts
Train the reward model to match human rankings
Run RL against the reward model (queried automatically, no humans needed)

The reward model becomes a simulator of human preferences. Humans contribute by discriminating (ranking), not generating — which is a much easier task.
Why rankings, not scores? Easier and more consistent for humans to say "this joke is funnier than that one" than to assign precise numerical scores.
RLHF limitations:
The reward model is gameable. RL will eventually find adversarial inputs — nonsensical outputs that inexplicably score highly on the reward model. For example, after many RL steps, the model might discover that "the the the the the" receives a score of 1.0 from the reward model for joke quality, even though it's obviously not a joke.
This happens because reward models are neural networks with billions of parameters — complex enough that RL can find the cracks.
Practical consequence: You run RLHF for a limited number of steps (a few hundred), get a modest improvement, and stop. Running too long causes the model to game the reward model and degrade.
The key distinction: RLHF is not "real RL" in the transformative sense. Real RL on verifiable problems can run indefinitely and produce genuinely novel, superhuman strategies. RLHF is more like a small fine-tuning nudge — useful, but not magic.

What You're Actually Talking To: A Summary

When you type a message into ChatGPT:

Your query becomes tokens, wrapped in the conversation protocol
The model — a fixed mathematical function — generates the next tokens one at a time
Each token is sampled from a probability distribution

For a standard GPT-4o response (SFT model):
A neural network simulation of a human data labeler following OpenAI's instructions. Not a magical AI — a statistical pattern learned from skilled humans' example responses.
For a thinking model (o3, DeepSeek-R1):
Something more novel — an emergent reasoning process discovered through RL. The chains of thought were not written by any human; they were discovered by a system that learned what thinking strategies reliably lead to correct answers.

Practical Tips

Work with their strengths:

Paste documents directly into your prompt rather than relying on model memory
For math/counting/calculations, ask for code rather than mental arithmetic
Use them for first drafts, brainstorming, and exploration

Guard against their weaknesses:

Always verify factual claims, especially specific names, dates, numbers
Let models show their work — don't force single-token answers on complex problems
Spelling and character-level tasks need code tools
Simple comparisons (9.11 vs 9.9) may fail
Random holes appear in unexpected places — check the output

Mental model: Stochastic Swiss cheese — brilliant across most areas, with random unpredictable holes, and never producing the exact same output twice.

Future Capabilities

Multimodality: Audio and images can be tokenized (audio spectrograms, image patches) and fed into the same context window as text. The same training machinery applies.
Agents: Long-running multi-step autonomous tasks, with humans supervising rather than directly doing. Human-to-agent ratios will become a key metric.
Computer use: Models taking direct keyboard/mouse actions on your behalf (early examples: ChatGPT Operator).
Test-time training: Current models freeze after training. Active research into models that continue updating during inference — more like biological learning.

Where to Find Models


Type
Where


ChatGPT / GPT-4o
chat.openai.com


Gemini (Google)
gemini.google.com or aistudio.google.com


Open-weight models
together.ai (playground)


Base models
hyperbolic.xyz


DeepSeek R1
chat.deepseek.com or together.ai


Local/on-device
LM Studio (smaller distilled models on your laptop)


Model leaderboard
LMSYS Chatbot Arena (lmsys.org/chatbot-arena)


The Three Stages: Final Summary


Stage
What happens
Analogy
Output


Pre-training
Train on internet text; predict next token
Reading all the exposition in all the textbooks
Base model (internet text simulator)


SFT
Train on human-curated conversations; imitate ideal responses
Studying worked examples from experts
Assistant model (simulates a human labeler)


RL
Practice problems with known answers; model discovers what works
Doing practice problems yourself
Thinking model (emergent novel reasoning)


The overall trajectory: building textbooks and practice problems for AI across all domains of human knowledge, running increasingly sophisticated training algorithms at ever-larger scale. RL is the newest and most powerful stage — still early in development — but already producing models that verify their reasoning, try multiple approaches, and in principle could discover strategies humans have never thought of.

  
## llm_practical_guide.md

      
    Raw
  

              llm_practical_guide.md
            
          
    Practical Guide to LLMs: Features, Tools, and Real-World Use


Based on Andrej Karpathy's video "How I use LLMs". This is a companion to the deep-dive video on how LLMs are built — this one focuses on practical use: how to interact with these tools, what features exist, when to use them, and how to get the most out of them.


The LLM Ecosystem in 2025

ChatGPT was deployed by OpenAI in 2022 and was the first time people could talk to a large language model through a simple text interface. It went viral, and since then the ecosystem has grown enormously.
The major players as of 2025:

ChatGPT (OpenAI) — the original, most popular, most feature-rich because it's been around longest
Claude (Anthropic)
Gemini (Google)
Copilot (Microsoft)
Grok (xAI, Elon Musk's company)
DeepSeek (Chinese company)
Le Chat (Mistral, French company)

All of the above (except DeepSeek and Mistral) are US companies. ChatGPT is the incumbent, but the others are catching up and in some specific areas are already ahead.
How to track them: Two useful leaderboards:

Chatbot Arena (LMSYS) — ranks models based on human head-to-head comparisons
SEAL Leaderboard (Scale) — ranks across a variety of evals and tasks

The plan for this guide: start with ChatGPT because it's the most feature-rich, then branch out to show where other tools shine.

What's Actually Happening When You Chat

The Token Stream

When you type something into ChatGPT and get a response, what looks like a friendly chat bubble exchange is — under the hood — two parties collaboratively writing into a single one-dimensional token stream.
To see this for yourself, use tiktokenizer.vercel.app with the GPT-4o tokenizer selected. Paste in a message, and you'll see the actual token IDs the model processes. For example:

A query like "write me a haiku about what it's like to be a large language model" is 15 tokens
The model's haiku response back is 19 tokens
When you wrap both in the conversation protocol (with special formatting tokens for user/assistant turns), the full exchange is 42 tokens total

The context window is this one-dimensional token sequence. You write some tokens, hand control to the model, the model writes its response tokens, emits a special "I'm done" token, and control passes back to you. Together you're building out a shared token stream.
Clicking "New Chat" wipes the token window entirely. The entire conversation history is gone — the model has no memory of it.
The "Zip File" Mental Model

The best way to think about what you're talking to:

"Hi, I'm ChatGPT. I am a one-terabyte zip file. My knowledge comes from the internet, which I read in its entirety about six months ago, and I only remember it vaguely. My winning personality was programmed by example by human labelers at OpenAI."

Breaking this down:

The zip file: The model is a file on disk — literally just a set of numbers (the neural network parameters). A one-terabyte file corresponds to roughly one trillion parameters. There's no browser, no calculator, no Python interpreter, no internet connection — just a self-contained file.
Knowledge from the internet (pre-training): The model was trained on vast amounts of internet text, compressing what it learned into those parameters. This is a lossy, probabilistic compression — think of it as the vibes of the internet, not a precise copy.
The knowledge cutoff: Pre-training is expensive (months of compute, tens of millions of dollars), so it only happens infrequently. GPT-4o was pre-trained many months ago, possibly over a year ago. Anything that happened after that date is not in the model's knowledge. Recent events, current prices, new products — the model doesn't know about these by default.
Personality from post-training: After pre-training, a second stage of training ("post-training") shapes the model to act like a helpful assistant. This is where the conversational personality comes from — programmed by example using human-curated conversation datasets.

What the model is NOT (by default):

Not connected to the internet
Not running a calculator
Not using a Python interpreter
Not looking anything up

It's just a zip file that receives tokens and predicts the next token. Any tools beyond this have to be explicitly added and integrated.

Basic Interactions: Knowledge-Based Queries

What to Expect From the Model's Memory

Because the model is a lossy compression of the internet, some things it knows well and some things it doesn't. Two factors determine reliability:

Recency: Is this something the model would have seen during pre-training (months or years ago)? Recent news, current prices, newly released products — unreliable.
Frequency: How often does this topic appear on the internet? Common, frequently-discussed topics are remembered more reliably than rare or obscure ones.

Example: "How much caffeine is in one shot of Americano?"
This is a fine question to ask the model without any tools because: (a) it's not recent information, and (b) it's extremely commonly discussed on the internet. The model will give you roughly 63mg, which is approximately right. You can verify against primary sources if it matters, but this is the kind of thing the model has good recall of.
Example: Medical/ingredient questions (DayQuil/NyQuil)
Asking about over-the-counter medication ingredients is also reasonable: this is common knowledge, commonly discussed, and doesn't change frequently. You can ask follow-up questions, get clarifications, and verify ingredients against the actual box. For non-high-stakes medical questions like this, the model is quite useful — just verify when accuracy matters.
General rule: Use the model's built-in knowledge for common, stable, non-recent information. For anything recent or niche, use search tools (covered below).

Managing the Context Window

Start a New Chat When Switching Topics

Every token in the context window costs something:

Distraction: The model can be distracted by irrelevant tokens from earlier in the conversation when generating responses much later. More irrelevant tokens = slightly worse accuracy.
Compute cost: More tokens in the window means slightly more expensive to calculate the next token.

Best practice: When you switch to a new topic, click "New Chat." Don't let old conversation history sit in the context window if it's no longer relevant. Think of the context window as precious working memory — keep it focused.
Know What Model You're Using

Always be aware of which model you're actually using. In ChatGPT, you can see and change this in the top-left dropdown.
ChatGPT pricing tiers (individual):

Free tier: Access to GPT-4o mini (a smaller, less capable model). Limited access to GPT-4o.
Plus ($20/month): 80 messages every 3 hours on GPT-4o (the flagship model as of now), plus access to some other features.
Pro ($200/month): Unlimited GPT-4o, access to o1 Pro (the best reasoning model), and other advanced features.

Why this matters: smaller models are less creative, have less world knowledge, and hallucinate more. If you're doing professional work, you want the best model available.
The same logic applies to every other provider. Claude has a Professional plan giving access to Claude 3.5 Sonnet (or the latest model). Grok has tiers where you want Grok 3 (the latest), not Grok 2. Gemini similarly has tiers.
Practical advice: Experiment across providers and pricing tiers for the problems you work on. For professional use, paying for the top tier is often worth it because you're getting access to significantly more capable models.
One approach: treat all these models as an "LLM Council" — for important decisions or questions (like where to travel), ask several models and compare their responses.

Thinking Models: When and Why to Use Them

What They Are

"Thinking models" are models that have been additionally trained with reinforcement learning (RL). During RL training, the model practices on large collections of hard problems (math, code, logic) and discovers thinking strategies that lead to correct answers — strategies like trying multiple approaches, backtracking, re-examining assumptions, and verifying results.
These strategies emerge from the RL process. No human hardcoded them. The model discovers what kinds of internal reasoning lead to right answers.
What this looks like in practice: When you use a thinking model, it will emit a "thinking" or "reasoning" phase before giving you its answer. OpenAI shows you summaries of the thought process; DeepSeek shows you the raw thoughts. The model might internally try several solution paths, notice a flaw in one, backtrack, and try again.
The tradeoff: Thinking takes time. Sometimes minutes. For simple or factual questions, this is overkill and annoying. For hard problems, it can dramatically improve accuracy.
When to Use Thinking Models

Use them when you have:

Hard math problems
Complex code debugging
Multi-step logic or reasoning
Any problem where you suspect the non-thinking model isn't doing well enough

Don't use them when you have:

Simple factual questions
Travel recommendations
Creative writing prompts
Anything where fast response is more important than maximum accuracy

Practical workflow: Try the standard (non-thinking) model first. If the response doesn't seem good enough, switch to a thinking model.
Where to Find Thinking Models


Provider
Thinking Model(s)


OpenAI (ChatGPT)
o1, o3-mini, o3-mini High, o1 Pro (requires $200/mo)


DeepSeek
R1 (enable "Deep Think" at chat.deepseek.com)


Grok
Enable "Think" button before submitting


Anthropic (Claude)
Claude 3.7 with "Extended" mode (newly released)


Google (Gemini)
Gemini 2.0 Flash Thinking Experimental


Perplexity
DeepSeek R1 available in model dropdown


OpenAI's naming is confusing — all models starting with "o" (o1, o3, etc.) are thinking models. GPT-4o and similar "GPT" named models are not thinking models (they have some limited RL, but not in the full thinking sense).
A Real Example

A gradient check failing in some neural network code was submitted to several models:

GPT-4o (non-thinking): Gave a list of things to double-check, but didn't identify the actual bug
o1 Pro (thinking, $200/mo plan): Thought for about 1 minute, identified the exact issue — parameters were packed/unpacked inconsistently
Claude 3.5 Sonnet (non-thinking): Also correctly identified the issue (sometimes the non-thinking model is good enough)
Gemini: Also solved it
Grok 3: Also solved it
DeepSeek R1 (via Perplexity): Thought through it visibly ("wait, when they accumulate the gradients they're doing it incorrectly..."), found the bug

Key takeaway: for hard technical problems, try multiple models and use thinking models when standard ones fall short.

Tool Use: Internet Search

The Problem It Solves

The model's knowledge has a cutoff date. Anything that happened recently — new product releases, current prices, today's news, TV schedules — is not in the model's parameters. Without internet search, it either hallucinates an answer or says it doesn't know.
Example: "When is White Lotus Season 3 Episode 2 coming out?" — The model was pre-trained before this schedule was announced. It cannot answer correctly without going to look it up.
How It Works Under the Hood

When the model determines (or is told) it needs to search:

It emits a special "search" token with a query
The application pauses the model, runs the actual web search, visits relevant pages
The text from those web pages gets copy-pasted into the context window
The model resumes, now with all that web content in its working memory
It answers the question using the freshly retrieved information, often citing sources

The context window is now acting as working memory for the search results. The model can directly access and reference this content — it's not relying on vague parameter-stored recollection; the actual text is right there.
How to Trigger Search

Behavior varies by app:

ChatGPT: Click the "Search the web" button, or just ask — GPT-4o will often auto-detect that it needs to search for recent information
Grok: Often auto-searches; can also be explicitly triggered
Claude: As of the time of this video, does not have integrated internet search — it will tell you its knowledge cutoff and decline to answer
Gemini 2.0 Flash: Has search integration. Gemini 2.0 Pro Experimental notably does NOT have access to real-time info (the app will tell you this)
Perplexity: Built primarily around search; a great default for search-based queries

Recommendation: When you know you need recent information, explicitly select search. Don't guess whether the model will automatically do it.
What to Use Search For

Good cases for internet search:

Is the stock market open today?
Where is White Lotus Season 3 filmed?
Does Vercel offer a Postgres database? (service offerings change over time)
What is Apple announcing tomorrow? What are the rumors?
Where is Singles Inferno Season 4 cast?
Why is Palantir stock going up?
When is Civilization 7 coming out (exact date)?
Is it safe to travel to Vietnam right now?
What's the deal with USAID cuts? (something trending you want a quick summary of)

Pattern: Use search when the answer is (a) recent, (b) time-sensitive or likely to change, or (c) niche enough that the model might not have it in its knowledge base.
Preferred tool for search: Perplexity (perplexity.ai) — designed around this use case, clean interface, reliable citations.

Tool Use: Deep Research

What It Is

Deep research is a combination of internet search + thinking, run for an extended period (10–20+ minutes). The model will:

Issue many search queries
Visit and read many web pages and papers
Think through the material
Synthesize everything into a comprehensive, cited report

Think of it as commissioning a custom mini research paper on any topic, with sources you can actually follow up on.
First announced: OpenAI as part of the Pro tier ($200/month). Now available from multiple providers.
How to Use It

ChatGPT Deep Research:
Available on the Pro plan. When writing your prompt:

Be specific about what you want to know
The model may ask clarifying questions before beginning (e.g., "focus on human clinical trials or animal models?")
Wait 10–15 minutes
You'll get a multi-section report with citations

Other providers offering deep research:

Perplexity: Has a "Deep Research" option in the model dropdown
Grok: Has "Deep Search" (similar concept)

ChatGPT's offering is currently the most thorough — longer, better structured, more detailed. Perplexity and Grok's versions are shorter and briefer but still useful.
Real Examples

Supplement research (Ca-AKG):
Prompted with: "Ca-AKG is one of the health actives in Brian Johnson's Blueprint at 2.5g per serving. Can you do research on Ca-AKG? Tell me why it might be in the longevity mix, its possible efficacy in humans or animal models, mechanism of action, potential concerns or toxicity."
After ~10 minutes, the report covered: research in worms, Drosophila, mice, and ongoing human trials; proposed mechanism of action; safety and potential concerns; references to follow up on.
Browser comparison (Brave vs Arc):
Deep research on privacy comparison between two browsers resulted in a report that led to switching to Brave.
Life extension in mice:
A comprehensive review of which Labs have tried to extend mouse lifespan and what worked.
LLM labs funding table:
A table of major US AI labs, their funding and size. Note: this one was hit-or-miss — xAI was missing, HuggingFace was included (arguably shouldn't be), some numbers seemed off. Use as a first draft, not a final source.
Important Caveats

Even though deep research reads many sources, hallucinations are still possible. The model can misread a paper, misattribute a finding, or fabricate a citation. Always:

Treat the report as a first draft and a source list, not a final authority
Click through to the cited papers and verify key claims
Ask follow-up questions if something seems off


File Uploads: Adding Documents to Context

The Core Idea

Instead of relying on the model's hazy parameter-based knowledge, you can give it concrete documents to work with directly. Uploading a file puts its content into the context window — the model's working memory — where it's directly accessible.
Format note: PDFs are typically converted to text before being loaded. Images in PDFs may be discarded or not processed well. The model sees the text content.
Reading Papers Together

Workflow for reading a paper:

Upload the PDF (drag and drop)
Start with: "Can you summarize this paper?"
Get oriented, then read the actual paper yourself
As questions arise, ask them in the chat

This works especially well for papers outside your area of expertise. The model can explain terminology, contextualize findings, and answer specific questions about the methodology.
Example: Uploading a paper on a biological foundation model (Evo 2, trained on DNA sequences) and asking for a summary gave a useful overview: "This paper introduces Evo 2, a large-scale biological foundation model..."
Note on file size limits: Very large PDFs may exceed context limits in some apps. If Claude says it's exceeding limits, try ChatGPT instead.
Reading Books Together

The same approach works for books, and this has dramatically changed how to approach difficult reading.
Example workflow for The Wealth of Nations (Adam Smith, 1776):

Find the chapter online (Project Gutenberg has it free)
Copy-paste the chapter text into the chat
Ask for a summary to orient yourself
Read the chapter, pausing to ask questions whenever confused

Why this is valuable:

Old texts with archaic language become much more accessible
Texts from other fields (biology, economics, law) become approachable even without background
You retain more because you can ask questions as you read
You can ask "why does this argument matter?" or "what's a modern example of this?"

For example, reading Adam Smith on how the division of labor is limited by the extent of the market — the model can explain the argument, give modern examples, and help you follow the logic even if you're not an economist.
Unfortunately, no clean tool exists to make this seamless. The current workflow is: find the book online, copy-paste chapters, go back and forth between the book and the chat. Clunky, but enormously worth it.
Strong recommendation: Don't read books alone. Having an LLM as a reading companion dramatically increases comprehension and retention.

Tool Use: Python Interpreter

What It Is

LLMs can write and execute code. When the model determines (or is told) to use the Python interpreter:

It writes a Python program
Emits special tokens telling the application "run this code"
The application runs the code in a sandboxed Python environment
The result comes back to the model as text
The model incorporates the result into its response

This is qualitatively different from doing math "in its head." The Python interpreter is reliable; mental arithmetic in a language model is not.
When to Use It

Simple calculation (30 × 9): The model can do this in its head correctly and probably won't use the interpreter.
Complex calculation (large multi-digit multiplication): The model should use the interpreter. GPT-4o will recognize when a calculation is too complex for mental arithmetic and automatically write a Python snippet.
If you ever suspect the model is doing mental arithmetic on something non-trivial, you can explicitly say "use code" to force it to use the interpreter.
Not All Models Have It

This is an important gotcha. Different models have different tools available:

ChatGPT (GPT-4o): Uses Python interpreter; will auto-detect when to use it for hard math ✓
Grok 3: Does NOT have access to a Python interpreter; will attempt mental arithmetic and get it wrong (but remarkably close)
Claude: Will write JavaScript code and execute it ✓
Gemini 2.0 Pro: Sometimes doesn't use tools and tries to compute mentally; can get wrong answers for hard calculations

Critical implication: If you ask a model that doesn't have a code interpreter to compute something complex, it will hallucinate a confident-sounding wrong answer. You cannot assume all models have all tools.

Advanced Data Analysis (ChatGPT-Specific)

What It Is

Advanced Data Analysis turns ChatGPT into a junior data analyst. You can give it data (from search, from uploads, from manual input) and have it write Python code to analyze, visualize, and interpret that data. The results — charts, figures, tables — appear directly in the chat.
Example Workflow


Collect data: "Research OpenAI's valuations over time. Use the search tool. Create a table with year and valuation."
Visualize: "Now plot this. Use log scale for the y-axis."
ChatGPT writes Python code, runs it, and shows you a chart inline.
Extend: "Fit a trend line and extrapolate until 2030. Mark the expected valuation."

Watch Out For Hidden Assumptions

When ChatGPT encountered an "N/A" valuation for 2015, it silently substituted 0.1 (implying $100M) in the code — without telling you. This kind of implicit assumption can skew your analysis.
Always read the code. ChatGPT will show you the Python it wrote. If you're using Advanced Data Analysis for anything that matters, read through it and make sure it's doing what you think.
The Model Can Hallucinate Numbers

In the same workflow, when asked for the extrapolated 2030 valuation, ChatGPT said "approximately $1.7 trillion" verbally — but the actual code variable was $20 trillion. When challenged, it admitted the error. The verbal summary did not match the computed result.
Lesson: The outputs of Advanced Data Analysis need to be verified. It's a powerful tool, but your junior analyst is absent-minded. Scrutinize the code, scrutinize the outputs, and don't take the verbal summary at face value.

Claude Artifacts: Apps and Diagrams

What Artifacts Are

When you ask Claude to build something (an app, a tool, a diagram), it can generate code and render it directly in the browser — right inside the Claude interface. No separate app needed, no copy-pasting code. Claude writes the code, and you interact with the result immediately.
Building a Flashcard App

Workflow:

Copy-paste source text (e.g., the Wikipedia intro for Adam Smith)
"Generate 20 flashcards from this text"
Claude produces 20 Q&A pairs
"Now use the artifacts feature to write a flashcards app to test these flashcards"
Claude writes a React component, and the chat renders a fully functional flashcard app

The resulting app: flip cards to reveal answers, mark correct/incorrect, shuffle cards, track progress — all custom-built, hardcoded with your specific content.
This is a fundamentally different paradigm from traditional software. Instead of finding a generic app and importing your content into it, Claude writes a custom app just for you.
Limitation: No backend, no database. These are local apps running only in your browser session.
Conceptual Diagrams

One of the most useful artifact applications: generating conceptual diagrams from documents.
Workflow:

Paste or upload a document (e.g., a book chapter)
"Please create a conceptual diagram of this chapter"
Claude writes Mermaid diagram code and renders it visually

Example: For Adam Smith's chapter on the division of labor being limited by market extent, the diagram showed: the core principle at the center, the comparative examples (land vs. water transport), specific historical civilizations that flourished due to water access, and the mechanistic link between water transport → specialization → wealth.
If you're a visual thinker who likes to see the structure of an argument spatially, this is extremely useful. You can make diagrams of: book chapters, source code architectures, research papers, meeting notes, anything.

Cursor: Professional Coding With LLMs

Why the Chat Interface Isn't Enough for Code

ChatGPT and Claude's web interfaces work fine for code snippets, but they don't have access to your file system. For professional development — where you have a whole project directory, multiple files, and a running application — you need a different tool.
Cursor is a code editor (similar to VS Code) that integrates LLMs deeply. It works on your local file system, has context of your entire codebase, and uses Claude (via API) under the hood.
The Three Levels of LLM Integration in Cursor


Ctrl+K (line edit): "Change this line to do X." Fine-grained, surgical edits.
Ctrl+L (explain): "Explain this chunk of code." Uses the code as context for a conversation.
Ctrl+I / Composer: A full autonomous agent mode — tell it what you want, and it edits files, runs commands, and builds features across your entire codebase.

Vibe Coding

"Vibe coding" — letting Composer take the wheel and implementing features by just describing what you want — is a term coined in this context. The workflow:

Start a new project or open an existing one
Tell Composer what you want: "Set up a new React repo, delete the boilerplate, make a simple Tic-Tac-Toe app"
Composer writes all the CSS, JavaScript, TypeScript — everything
Your app is running locally

Then you can iterate: "When a player wins, add a confetti effect." Composer will install the relevant library (react-confetti), modify the app component, add CSS for winning cells, even download a victory sound file. If it needs to install packages, it asks for your confirmation.
The demo: a Tic-Tac-Toe app with confetti, highlighted winning cells, fade-in animations, and a victory sound — all implemented by just describing it in plain English, taking maybe 2-3 minutes.
The safety net: You always have access to all the raw files. If Composer generates something wrong, you can fall back to reading and editing the code directly.
Where Cursor is headed: "Agent mode" — Composer autonomously executes commands, edits across multiple files, and works for extended periods with less interruption. The trajectory is toward AI doing most of the low-level programming work while you supervise and direct.

Audio: Speech Input and Output

The Two Voice Modes

There are two fundamentally different ways to use voice with these models. They look similar but work very differently under the hood.
Mode 1: Fake Audio (Speech-to-Text + Text-to-Speech)
Your speech gets converted to text by a separate model, that text goes to the LLM, the LLM's text response gets converted to speech by another separate model. The LLM itself never sees audio — it only sees text.
This is what the microphone icon in the ChatGPT mobile app does: it transcribes your speech to text, then sends the text.
Mode 2: True Audio / Advanced Voice Mode
The LLM natively processes audio tokens. Audio is chunked into a spectrogram, quantized into tokens from a vocabulary of ~100,000 possible audio chunks, and those tokens go directly into the model. The model processes them and generates audio tokens back. No text intermediary.
This gives capabilities impossible with fake audio:

The model can hear your tone, emotion, pacing
It can respond with different voices, accents, characters
It can speak quickly, slowly, dramatically
It experiences the conversation as audio, not text

Using Voice on Mobile

The ChatGPT mobile app has both:

Microphone icon: Speech → text (fake audio). Fast, practical for typing replacement.
Audio/headphone icon: Advanced Voice Mode (true audio).

On mobile, roughly 80% of queries can be handled by just speaking — much faster than typing, especially for natural language questions.
Using Voice on Desktop

The desktop ChatGPT app doesn't have a direct microphone-to-text button. Options:
Third-party transcription apps (recommended):

Super Whisper (current recommendation)
Whisper Flow
Mac Whisper

These run system-wide. Set a hotkey (e.g., F5), press it, speak, press it again, and your words appear as text in whatever field you're focused on. Works in any app, not just ChatGPT.
Roughly half of all queries to LLMs can be done by voice this way — much faster than typing.
Text-to-Speech (having it read back to you):
In ChatGPT, there's a "Read aloud" button on responses. In other apps, look for similar functionality. Or install a system-wide text-to-speech app.
Advanced Voice Mode Demo

In the true audio mode, you can:

Have natural back-and-forth voice conversations
Ask the model to speak as different characters ("speak in the voice of Yoda," "now like a pirate")
Tell stories in an engaging narrative voice
Count rapidly (1 to 20 as fast as possible — the model will)
Make basic animal sounds (a cow going "moo")

Caveat: The advanced voice mode is very conservative (cautious/refuses a lot) in ChatGPT. For more unconstrained audio interaction, Grok's app offers advanced voice mode with modes like: Default, Romantic, Unhinged, Conspiracy, and others that are significantly less restricted.
The Grok app on mobile has a voice icon in the top right. Try "Unhinged" mode if you want something that will just go there without restriction.

NotebookLM: Podcast Generation

What It Is

NotebookLM from Google lets you:

Upload sources (PDFs, web pages, text files, etc.)
Chat with the contents (standard RAG-style Q&A)
Generate a custom audio podcast from those sources — with two AI hosts discussing the material

How to Use It


Go to notebooklm.google.com
Add sources in the left panel (drag PDFs, paste URLs, etc.)
Chat with the material via Q&A on the right
Click "Generate" on the Deep Dive Podcast panel
Wait a few minutes
Listen to a ~30-minute custom podcast on whatever you uploaded

Customization


Provide special instructions before generating (e.g., "focus on the clinical trial data, not the animal models")
Re-generate with different instructions
Enter interactive mode to break into the podcast mid-episode and ask a question

When to Use It

Best for: topics you have passive interest in, niche subjects not covered by existing podcasts, academic papers you want to understand without deep reading, long drives or walks.
The format is engaging — two AI hosts riffing on the material, asking each other questions, explaining concepts. It sounds like a real podcast. You can share episodes (e.g., a public "Histories of Mysteries" podcast series generated this way and uploaded to Spotify).

Image Input: OCR and Visual Queries

How It Works

Images are tokenized — chopped into a rectangular grid of patches, each patch mapped to the closest token in a vocabulary of ~100,000 possible image patches. These image tokens go into the same context window as text tokens. The model processes them the same way.
The model itself doesn't "know" whether it's processing text tokens, audio tokens, or image tokens — they're all just tokens in a sequence.
Practical Uses

Nutrition labels / ingredient lists:
Photograph or screenshot a nutrition label, upload it, ask the model to:

Transcribe the ingredients into a table (verify accuracy against the label)
Explain what each ingredient does
Rank them by how well-studied/safe they are
Identify which ones are unnecessary filler

This workflow works for anything with dense small-print information: supplement labels, food ingredient lists, cosmetic ingredients, medication packaging.
Blood test results:
Upload screenshots of blood test result pages. Ask for interpretation, what the values mean, what the ranges are, and what to discuss with your doctor. This works well because blood test result formats and reference ranges are widely documented on the internet — the model has good knowledge here.
Workflow tip (macOS): Use Cmd+Shift+Ctrl+4 to copy a screenshot selection to clipboard, then Cmd+V to paste directly into the chat.
Important note: Always verify the transcription first. Before asking questions about an image, ask the model to transcribe what it sees into text. Check that text against the original. If it got the numbers or ingredients wrong, the downstream analysis will be wrong.
Math equations from papers:
Screenshot a complex mathematical expression from a paper, paste it, and ask the model what it evaluates to under specific conditions. Often faster than typesetting it in LaTeX yourself.
Memes:
Paste a meme image and ask "explain this." The model is quite good at understanding visual humor. Useful for when someone sends you a meme and you don't get it.

Image Output: DALL-E, Ideogram, and Others

How It Works in ChatGPT

When you ask ChatGPT to generate an image, what happens under the hood is:

The language model generates a detailed caption/description of the desired image
That caption is sent to a separate image generation model (DALL-E 3)
DALL-E 3 generates the image and returns it

It's not the LLM itself generating images — it's calling a specialized model.
Practical Uses


Blog/video thumbnails and icons: Generate stylized images matching a specific aesthetic
Content illustrations: "Generate an image summarizing today's headlines" (combine with search tool to get the headlines first)
Mood boards and ideation: Generate visual concepts quickly

Recommended tool: For polished image generation, Ideogram is competitive with DALL-E 3 and worth trying. Many YouTube thumbnails and channel art are generated with Ideogram or similar tools.

Video Input: Point and Talk

What It Is

In the ChatGPT mobile app, Advanced Voice Mode includes a video option. While in a voice conversation, you can open your camera and show the model what you're looking at. It can see the video feed and comment on it.
Demo Highlights

Things the model correctly identified and commented on in a live demo:

Black acoustic foam panels on a wall ("have you tried covering more corners?")
The book Genghis Khan and the Making of the Modern World by Jack Weatherford
Surely You're Joking, Mr. Feynman! by Richard Feynman
An Aranet4 portable CO2 monitor (713 ppm — "generally okay, below 800 is ideal")
A map of Middle-Earth from Lord of the Rings

Practical Reality

This feature is mostly interesting rather than deeply practical for power users with targeted code/analysis queries. But for:

Showing to non-technical family members as an introduction to AI
Quick identification of unknown objects, plants, products
Getting information while your hands are occupied

It's impressively capable. Under the hood, it's probably processing frames (one per second or so), not true video streaming.

Video Output: Sora, Veo 2, and Others

AI video generation has become very capable very quickly. All the major labs have offerings:

Sora (OpenAI)
Veo 2 (Google) — currently near state-of-the-art
Others from various providers

These can generate short video clips from text descriptions — a tiger in a jungle, a specific cinematic style, etc. Each model has a slightly different aesthetic and quality level.
This guide doesn't cover these in depth since: (a) it's rapidly evolving, (b) the use cases are mostly creative/marketing work, and (c) quality comparisons change monthly. Check current leaderboards and comparison tweets for the latest state of the art.

Quality of Life Features

Memory

ChatGPT has a memory feature that persists information between conversations. When you discuss something notable, ChatGPT may:

Automatically save it to memory ("memory updated: believes late 1990s-early 2000s was the peak of Hollywood")
Save it when you ask ("can you please remember this preference?")

How it works technically: ChatGPT maintains a separate database of text strings about you. These strings are automatically prepended to every conversation, so the model always has context about who you are, what you prefer, what you've discussed.
Managing memories: In settings, you can view, edit, add, and delete individual memories. It's just a list of text strings.
Practical value: Over time, as you use ChatGPT naturally, it gradually builds a profile. It starts making references to things you care about, giving more relevant recommendations. Movie recommendations get better if it knows your taste. Travel advice gets more relevant if it knows you prefer certain types of cities.
Note: This feature appears to be unique to ChatGPT; other providers don't currently have it.
Custom Instructions

In ChatGPT settings (Settings → Customize ChatGPT), you can permanently modify how ChatGPT behaves across all conversations:

Tone: "Don't talk to me like an HR business partner. Just talk normally."
Style: "Give me explanations, education, and insights whenever possible. Be educational."
Identity context: Provide information about yourself, your field, your preferences
Language learning: "When giving me Korean examples, use formal polite speech (합쇼체) by default"

Anything you want to be true in every conversation goes here. This is global and persistent.
Most other LLM apps have equivalent settings — look for it in their settings menus.

Custom GPTs

What They Are

Custom GPTs are saved prompt configurations that pre-load specific instructions, examples, and behavior into a conversation. Instead of re-explaining a task every time, you create a custom GPT once and reuse it.
Key insight: Custom GPTs aren't technically special — they're just prompts saved as reusable conversation starters. The "magic" is the same few-shot prompting that makes the model perform a specific task well.
Language Learning Examples

Korean Vocabulary Extractor:
Instructions tell the model: "When given Korean text, extract vocabulary items in the format Korean; English (dictionary form). Here are four examples..."
Result: Paste any Korean sentence, get a clean vocabulary list formatted for import into Anki flashcard app. One click to build vocabulary cards from real content.
Korean Detailed Translator:
The built-in Google Translate is famously bad with Korean. A custom GPT that is a much better translator:
Instructions tell the model: translate the full sentence, then break down each component piece by piece showing exactly how each Korean word/particle/ending maps to English.
Result: Input 지금 오후 3시인데 좋아하는 카페에 가고 싶어요, and get:

Full translation: "It's 3 in the afternoon now and I want to go to my favorite café"
Component breakdown: 지금 = now, 오후 3시 = 3pm, 인데 = and/but (connecting particle), 좋아하는 = favorite/liked, 카페에 = to café, 가고 싶어요 = want to go

Plus: you can ask follow-up questions in the same conversation. "Why is 싶어요 used here instead of 싶다?" This is massively better than any translation app.
Korean Caption OCR + Translator:
For watching Korean TV shows where subtitles are baked into the video pixels:

Screenshot a subtitle frame
Paste into this custom GPT
It OCRs the text, then applies the detailed translation breakdown

The instructions even tell the model what show it's from ("singles Inferno") for context. Screenshot → paste → understood. No separate OCR step needed.
How to Build Your Own


Click "My GPTs" in ChatGPT
Click "Create a GPT"
In the Instructions field: describe the task, provide your examples as a few-shot prompt
Save it with a name

Prompt writing tip: Don't just describe the task. Give examples. Instead of "translate Korean to English in detail," show it 3-4 example inputs and exactly the output format you want. This is called a few-shot prompt, and it dramatically improves accuracy.
Use XML-like tags to delineate examples clearly:
<example1>
Input: [Korean sentence]
Output: [your desired output format]
</example1>

Availability note: Custom GPTs are largely unique to ChatGPT. Other providers may have similar project-based features — check their settings.

Summary: The Full Feature Map

When working with LLMs, here's what to keep track of:
The Model Itself


You're talking to a "zip file" — a self-contained neural network with no tools by default
Know which model you're using and what pricing tier you're on
Larger models: more creative, more knowledgeable, fewer hallucinations, slower/pricier
Smaller models (free tier): less capable across the board

Thinking vs. Non-Thinking Models


Standard models: fast, good for most tasks
Thinking models: trained with RL, take time to reason, better accuracy on hard problems
Use thinking for: hard math, complex code, multi-step logic
Skip thinking for: factual questions, travel advice, creative writing, anything time-sensitive

Tool Use


Tool
Best used for
Available in


Internet search
Recent info, current events, time-sensitive queries
ChatGPT, Grok, Gemini Flash, Perplexity


Deep research
Comprehensive topic reports, comparing options, literature reviews
ChatGPT Pro, Perplexity, Grok


Python interpreter
Complex calculations, data analysis, plotting
ChatGPT, Claude (via JS), varies


File upload
Reading papers/books, analyzing documents
Most major apps


Advanced Data Analysis
Visualizing data, trend analysis, custom charts
ChatGPT-specific


Artifacts
Building mini-apps, conceptual diagrams
Claude-specific


Cursor/Composer
Professional codebase development
Cursor app (desktop)


Modalities


Modality
Input
Output


Text
All apps
All apps


Voice (fake)
All apps (via transcription)
Most apps (read aloud)


Voice (true/native)
ChatGPT Advanced Voice Mode, Grok
Same


Images
All major apps
ChatGPT (DALL-E), Ideogram, others


Video
ChatGPT mobile (Advanced Voice)
Sora, Veo 2, others


Quality of Life


Memory: ChatGPT-specific; builds a profile of you over time
Custom instructions: Available in most apps; set global preferences
Custom GPTs / Projects: ChatGPT-specific; save reusable prompts

The Apps Ecosystem

No single app does everything best. Current strengths:

ChatGPT: Best overall, most features, best deep research, memory, Advanced Data Analysis
Claude: Best for Claude Artifacts (mini-apps, diagrams), great at code
Perplexity: Best for search-first queries
Grok: Less restricted voice mode, fast responses, thinking model available
Cursor: Best for professional software development
NotebookLM: Best for custom podcast generation from your documents
Gemini: Strong models, Google integration, but more inconsistent feature availability


Practical Habits

Check the model: Always know which model you're actually talking to.
Start fresh: Open a new chat when switching topics. Keep the context window focused.
Use your voice: Roughly half of queries can be spoken rather than typed. Install Super Whisper or equivalent for system-wide transcription.
Verify transcriptions: When uploading images or documents, ask the model to transcribe what it sees before asking questions — check the transcription against the source.
Read the code: If ChatGPT generates Python for data analysis, read it. It will make silent assumptions. It will sometimes report a wrong verbal summary even when the code computed the right answer.
Few-shot prompts: When building custom GPTs or any complex prompt, don't just describe the task — give examples. Show the model exactly what you want.
Use search for recent info: Anything time-sensitive, don't ask the model's memory — use search. The zip file is months old.
Try multiple models: For important questions, ask several models and compare. They disagree, and sometimes one catches something another misses. Use them as a council.
Verify everything that matters: The model is a stochastic system. It will occasionally be confidently wrong about something simple. Don't copy-paste outputs for anything consequential without checking.
Type	Where
ChatGPT / GPT-4o	chat.openai.com
Gemini (Google)	gemini.google.com or aistudio.google.com
Open-weight models	together.ai (playground)
Base models	hyperbolic.xyz
DeepSeek R1	chat.deepseek.com or together.ai
Local/on-device	LM Studio (smaller distilled models on your laptop)
Model leaderboard	LMSYS Chatbot Arena (lmsys.org/chatbot-arena)
Stage	What happens	Analogy	Output
Pre-training	Train on internet text; predict next token	Reading all the exposition in all the textbooks	Base model (internet text simulator)
SFT	Train on human-curated conversations; imitate ideal responses	Studying worked examples from experts	Assistant model (simulates a human labeler)
RL	Practice problems with known answers; model discovers what works	Doing practice problems yourself	Thinking model (emergent novel reasoning)
Provider	Thinking Model(s)
OpenAI (ChatGPT)	o1, o3-mini, o3-mini High, o1 Pro (requires $200/mo)
DeepSeek	R1 (enable "Deep Think" at chat.deepseek.com)
Grok	Enable "Think" button before submitting
Anthropic (Claude)	Claude 3.7 with "Extended" mode (newly released)
Google (Gemini)	Gemini 2.0 Flash Thinking Experimental
Perplexity	DeepSeek R1 available in model dropdown
Tool	Best used for	Available in
Internet search	Recent info, current events, time-sensitive queries	ChatGPT, Grok, Gemini Flash, Perplexity
Deep research	Comprehensive topic reports, comparing options, literature reviews	ChatGPT Pro, Perplexity, Grok
Python interpreter	Complex calculations, data analysis, plotting	ChatGPT, Claude (via JS), varies
File upload	Reading papers/books, analyzing documents	Most major apps
Advanced Data Analysis	Visualizing data, trend analysis, custom charts	ChatGPT-specific
Artifacts	Building mini-apps, conceptual diagrams	Claude-specific
Cursor/Composer	Professional codebase development	Cursor app (desktop)
Modality	Input	Output
Text	All apps	All apps
Voice (fake)	All apps (via transcription)	Most apps (read aloud)
Voice (true/native)	ChatGPT Advanced Voice Mode, Grok	Same
Images	All major apps	ChatGPT (DALL-E), Ideogram, others
Video	ChatGPT mobile (Advanced Voice)	Sora, Veo 2, others