MichaelDimmitt/draft2-how-context-actually-works-working.md

## draft2-how-context-actually-works-working.md

      
    Raw
  

              draft2-how-context-actually-works-working.md
            
          
    How Context Actually Works:

Model → Inference Server → Client

(Local vs SaaS)
Most developers use AI coding tools every day without understanding what's happening to their context under the hood. Here's what's actually going on and why the same prompt behaves differently depending on where you run it.

Disclaimer: Written with AI — Claude Opus 4.6 via claude.ai alongside human feedback.

Contents


The Gap Nobody Has Filled
What Is Context, Really?
Layer 0: The Model Itself
Claude Web: You're a Thin Client
Local: You Own the Whole Stack
Cursor: An IDE, Not an Inference Engine
Claude Code: Agentic Context Management
What Happens at the Limit: A Direct Comparison
Practical Takeaways
The Full Landscape: Every Tool Compared

Web Interfaces
AI-Powered IDEs
CLI / Terminal Agents
API Usage (Model + Inference Server)
Local Inference (Self-Hosted)
Local / Open-Weight Models


Conclusion
Sources


The Gap Nobody Has Filled

You're probably using at least one AI coding tool daily. Maybe Cursor. Maybe Claude Code. Maybe you've got a local model running through Ollama on the side. All of these tools talk about "context windows," and most developers have a rough sense that bigger is better.
But context doesn't work the same way across any of these tools. The advertised token count on a marketing page tells you almost nothing about how much of your codebase the tool can actually reason about at any given moment. Three layers sit between your prompt and the model's response: the model itself, the inference server, and the client. Most developers only interact with the client and never think about the other two.


Model
→ Inference Server
→ Client


Local
Model (e.g. Llama 3.1)
Ollama / llama.cpp / vLLM
Cursor / terminal


SaaS
Model (e.g. Opus 4.6)
vLLM / SGLang / TensorRT-LLM / custom
claude.ai / Cursor


A note on terminology: the middle layer is often called the "inference server," but it actually contains two parts. The inference engine (vLLM, llama.cpp, TensorRT-LLM, SGLang) is the core software that loads model weights into GPU memory, computes attention, and manages the KV (key-value) cache, which stores previously computed token data so the model doesn't have to reprocess the entire context from scratch for each new token. The inference server is the API wrapper that exposes the engine as an HTTP endpoint so clients can talk to it. Sometimes these are the same tool (vLLM includes its own OpenAI-compatible server), sometimes they're separate (TensorRT-LLM is often deployed behind NVIDIA's Triton Inference Server). Ollama is a good example: it's the server you interact with at localhost:11434, but under the hood it uses llama.cpp as its inference engine. For most of this post, the distinction doesn't matter. What matters is that this layer is where your context settings live, whether you're running locally or hitting a cloud API.
This post breaks down that chain. By the end, you'll understand why Cursor can silently forget code it wrote five minutes ago ¹, why your local Ollama model might be running at 2% of its capacity ², and why Claude Code can refactor 23 files in one shot ³ while other tools choke at 10.

What Is Context, Really?

The context window is the model's working memory. It's the total amount of text, measured in tokens, that the model can "see" at once when generating a response.
A token is roughly 4 characters or three-quarters of a word ⁴. A 200,000-token context window translates to approximately 150,000 words, or about 500 pages of text.
Here's the part most developers miss: the context window is shared. The model doesn't get 200k tokens for your message. It gets 200k tokens total, and that budget is split across everything ⁵:

The system prompt (instructions telling the model how to behave)
Tool definitions (function schemas for code execution, file reading, web search, etc.)
Conversation history (every message you and the model have exchanged)
File contents (any code or documents pulled into the session)
The model's own responses

By the time you type your first message in a fresh session, a significant chunk of that window is already consumed. In a tool like Claude Code working in a monorepo, the baseline overhead can be around 20,000 tokens. That's 10% of the window gone before you've asked a single question ⁶.

Layer 0: The Model Itself

Every model has a trained context ceiling. This is a hard architectural limit baked in during training:

Llama 4 Scout: 10M tokens (practical: 128-256k) ⁷
Llama 4 Maverick: 1M tokens ⁷
Claude Opus 4.6: 200k tokens (1M in beta) ⁸
Claude Sonnet 4.6: 200k tokens (1M available) ⁸
GPT-5.3 Codex: up to 400k tokens ⁹
Gemini 3 Pro: 1M tokens ¹⁰
Gemini 3 Flash: 200k tokens ¹⁰
Qwen3-Coder-Next: 256k tokens ¹¹
DeepSeek V3.2: 128k tokens ¹²
Mistral Large 3: 256k tokens ¹³
GPT-OSS-120B: 128k tokens ¹⁴
Llama 3.1: 128k tokens ¹⁵

You cannot exceed this ceiling without architectural tricks like RoPE scaling or YaRN, and even then, quality degrades. Bigger windows also aren't free: the transformer attention mechanism scales quadratically with sequence length, which means doubling the context roughly quadruples the compute cost ⁴.
There's also the "lost in the middle" problem. Research has consistently shown that models are better at recalling information from the beginning and end of their context window ¹⁶. Details buried in the middle are more likely to be missed. A 1M token window doesn't work like 1M tokens of perfect memory.

Claude Web: You're a Thin Client

When you use claude.ai, your computer does almost nothing. It sends your text over HTTPS to Anthropic's servers, receives a stream of tokens back, and renders them in the browser. That's it.
Models like Claude Opus 4.6 have hundreds of billions of parameters and require specialized GPU clusters to run inference. Your browser tab is almost entirely JavaScript application code, React components, and DOM overhead. The raw text of even a very long conversation is a few hundred KB at most.
Extended thinking? Also entirely server-side. When Claude "thinks" for 30 seconds before responding, your machine is just waiting for the stream to start. No extra CPU, no extra memory.
What fills the window

In Claude web, the context is assembled from your conversation history: system prompt, every message you've sent, every response Claude has generated, and any files you've uploaded. There's no codebase indexing, no file-tree awareness, no agentic search. You get exactly what you put in.
When you hit the limit

As conversations get long, older messages are truncated to make room. You'll notice this when Claude "forgets" something you discussed an hour ago. There's no compaction or summarization. The early messages simply fall out of the window.
For most chat-based usage, this is fine. For multi-session coding projects, it means you need to re-establish context at the start of each conversation. Summarizing key decisions in your opening prompt becomes a practical necessity.

Local: You Own the Whole Stack

Running a model locally through Ollama, llama.cpp, vLLM, or LM Studio means you control every layer of the stack. That's powerful. It's also where the most common context misconfigurations happen.
Three things determine your local context

1. The model's trained ceiling. A Llama 3.1 model supports up to 128k tokens ¹⁵. Qwen3-Coder supports up to 256k ¹¹. This is the theoretical maximum.
2. Your hardware. Every token in the context window requires memory for the KV cache. This is separate from the memory needed to hold the model weights themselves. A 7B parameter model at Q4 quantization might need 4-5GB for the weights, but a 32k context window on top of that can add another 4-8GB depending on the model architecture. If you're running on a GPU with 24GB of VRAM, the model weights plus KV cache for a large context might not fit, forcing you to either reduce context or offload to slower system RAM.
3. Your inference framework settings. This is the most commonly missed piece. Ollama defaults to a context window of 4,096 tokens, regardless of what the model actually supports ². If you downloaded a model capable of 128k tokens and never changed the OLLAMA_CONTEXT_LENGTH environment variable, you've been running at roughly 3% of its capacity.
In llama.cpp, the equivalent setting is n_ctx. In vLLM, it's max_model_len. Each framework has its own default, and none of them automatically use the model's full capacity because doing so would consume too much memory on most consumer hardware.
The tradeoff

Local inference gives you complete privacy, zero per-token cost, and total control. The tradeoff is significantly smaller practical context windows, slower inference (often 5-50x slower than cloud APIs ¹⁷), and models that are less capable than the frontier cloud options.
When you hit the limit

Hard cutoff. The model stops seeing earlier tokens with no graceful degradation. There's no summarization, no compaction, no warning. Your prompt simply gets truncated from the beginning, and the model responds based on whatever remains in the window.

Cursor: An IDE, Not an Inference Engine

This is the most misunderstood piece of the stack, and it came up in a conversation I had recently that crystallized the issue.
Cursor is not running your model. Cursor is a VS Code fork, a code editor with AI features layered on top. When you use Claude, GPT, or Gemini inside Cursor, the model is running on Anthropic's, OpenAI's, or Google's servers. Cursor is deciding what to send to those servers.
This distinction matters enormously because Cursor makes aggressive decisions about context that most developers never see.
How Cursor manages context

Cursor uses a RAG-like system with embeddings to index your codebase and select which files are relevant to your current prompt ¹⁸. It doesn't send your entire project to the model. It selects what it thinks matters, assembles a prompt, and sends that.
The default context budget for a chat session is approximately 20,000 tokens. Inline commands (Cmd-K) get around 10,000 tokens ¹⁹. That's a fraction of the underlying model's capacity.
"Max Mode" extends the window to the model's full capacity: up to 200k tokens for Claude, potentially 1M for Gemini. But even in Max Mode, developers consistently report that the effective usable context falls between 70k and 120k tokens ²⁰. Cursor applies internal truncation and performance safeguards that silently reduce what actually reaches the model ¹⁹.
The silent truncation problem

This is Cursor's most consequential behavior. When context gets too large, Cursor doesn't error out. It doesn't tell you it's dropping files. It silently deprioritizes and removes older or less relevant content to maintain responsiveness and manage API costs ¹⁹.
The practical result: you ask Cursor to modify a component, it does so, you switch tabs to work on something else, come back, and Cursor has no memory of the component structure you just designed together. Each tab maintains its own context, and switching between them can mean starting from scratch ¹.
Using local models through Cursor

If you want to run a local model inside Cursor, the architecture looks like this:
You → Cursor (IDE) → Local inference server (Ollama/llama.cpp) → Your model

Cursor points at a local OpenAI-compatible API endpoint instead of a cloud API. But the context settings (n_ctx, VRAM allocation, all of it) live in the inference server, not in Cursor. Cursor is still just the client, deciding what to send. The inference server determines how much it can receive.

Claude Code: Agentic Context Management

Claude Code is a CLI-first tool that runs in your terminal. It takes a fundamentally different approach to context than any IDE-based tool.
No pre-built index

Unlike Cursor, which maintains a static embeddings index of your codebase, Claude Code doesn't pre-index anything. Instead, it uses agentic search: when you give it a task, it actively explores your repository at runtime ²¹. It reads files, follows import chains, greps for references, runs tests, and builds its understanding dynamically.
This means Claude Code's context is always fresh. It's never working from a stale index. But it also means each task starts with an exploration phase that consumes tokens as the agent reads files to understand your project.
CLAUDE.md: Persistent project memory

One of Claude Code's most distinctive features is CLAUDE.md, a markdown file that lives in your repo root and acts as institutional memory ²². It stores project conventions, architecture decisions, directory structure, key patterns, and anything else the agent should know at the start of every session.
This is loaded into context automatically when Claude Code starts. It's the one piece of context that persists across sessions. Everything else (conversation history, file reads, tool results) is ephemeral.
The full 200k, reliably

Claude Code consistently delivers the advertised 200k token context window ¹⁹. With Claude Opus 4.6, there's a 1M token beta that scores 76% on the MRCR v2 long-context retrieval benchmark ²³. For large codebases, this changes what's possible in a single session.
You can check your current consumption with the /context command, which shows how much of the 200k window you've used and what's filling it ⁶. A fresh session in a monorepo might start with a ~20k token baseline, leaving 180k for your actual work, which can still fill up fast during complex multi-file refactors.
Compaction: The double-edged sword

When conversations get long, Claude Code uses compaction, automatic server-side summarization of earlier conversation turns ¹. This is what enables effectively unlimited session length, but it comes with a cost.
Compacted context is a summary, not the original. Details get lost. The most common symptom: Claude Code starts making suggestions that contradict decisions you made an hour ago. It might recommend adding Redux for state management when you explicitly told it you're using Zustand, because that decision was in a part of the conversation that got compacted away ¹.
Subagents: Delegated context

Claude Code can spin up child agents (subagents), each with their own context window ⁶. This is a context management strategy: the main agent stays clean and focused while subagents handle specific tasks like running tests, reviewing security, and scanning for patterns. The results are returned as summaries, keeping the main agent's window from filling with raw output.

What Happens at the Limit: A Direct Comparison

Each tool handles context exhaustion differently, and the differences have real consequences:
Claude Web (claude.ai + Anthropic's inference server) — Older messages silently drop out. The model keeps responding but loses awareness of early conversation context. No warning, no summarization.
Local (e.g. Ollama + Cursor) — Behavior depends on the inference server. Ollama truncates older conversation turns to fit the window. llama.cpp's server may error out if you exceed n_ctx. The client has no say in this — it's the inference server deciding what gets cut.
Cursor (Cursor IDE + cloud inference server) — Silent truncation. Files get deprioritized, older context gets dropped, and you receive no notification ¹⁹. The model keeps generating, but based on incomplete information. This is the sneakiest failure mode because the output looks confident.
Claude Code (CLI + Anthropic's inference server) — Compaction kicks in. Earlier turns are summarized to free space ¹. The session continues, but the summary may lose important details. You can at least monitor this with /context.
Gemini CLI (CLI + Google's inference server) — Similar to Claude Code. Uses a 1M token window with /compress for manual summarization and /clear for hard resets ²⁴. Warns when approaching the limit.
GitHub Copilot CLI (CLI + Microsoft/GitHub's inference server) — Auto-compaction at 95% of the token limit ²⁵. Creates checkpoint files so you can rewind to pre-compaction state. Uses /compact for manual compression and /context for monitoring.

Practical Takeaways

If you're using Cursor: Accept that you're working with 60-120k effective tokens, not 200k ²⁰. Turn on Max Mode for large refactors. Be aware that tab switching resets your context ¹. Keep critical decisions in .cursorrules or AGENTS.md files that get loaded automatically.
If you're using Claude Code: Use /context regularly to monitor consumption ⁶. Start fresh sessions for distinct tasks rather than letting compaction degrade your context. Invest time in your CLAUDE.md. It's the one thing that persists and directly improves every session ²².
If you're running local models: Check your actual context setting right now. If you're on Ollama and never changed it ², run:
# Check current context
ollama ps
# Set a larger default
# In /etc/systemd/system/ollama.service.d/override.conf:
# Environment="OLLAMA_CONTEXT_LENGTH=32768"
Even going from 4,096 to 16,384 or 32,768 tokens will dramatically improve your experience.
If you're using Claude web: Recognize that very long conversations will lose early context. For multi-session projects, start each new conversation with a summary of prior decisions.
In general: Context is a shared, finite resource. System prompts, tool definitions, and file contents eat into it before you type a word. The skill isn't finding the tool with the biggest window. It's being intentional about what goes into the window you have.

The Full Landscape: Every Tool Compared

Web Interfaces


Tool
Provider
Max Context (Advertised)
Effective Context
Context Strategy
At Limit
Cost


claude.ai
Anthropic
200k tokens
~200k
Conversation history only
Older messages drop
Free / $20 Pro / $100-200 Max


ChatGPT
OpenAI
128k tokens (GPT-5.3)
~128k
Conversation + uploaded files
Older messages drop
Free / $20 Plus / $200 Pro


Gemini
Google
1M tokens (Gemini 3 Pro) ¹⁰
~1M (consumer often 128-200k)
Conversation + uploaded files
Performance degrades
Free / $20 AI Premium / $250 Ultra


Grok
xAI
128k tokens
~128k
Conversation history
Older messages drop
Free / Premium subscription


AI-Powered IDEs


Tool
Type
Max Context (Advertised)
Effective Context
Context Strategy
At Limit
Models
Cost


Cursor
VS Code fork
200k (Max Mode)
60-120k ²⁰
RAG + embeddings + silent truncation ¹⁸
Silent file dropping ¹⁹
Claude, GPT, Gemini, Composer
Free / $20 Pro / $40 Ultra


Windsurf
Standalone IDE
Model-dependent
50-100k ³
RAG (M-Query) + Cascade indexing ²⁶
Context drops without warning
Claude, GPT, Gemini, SWE-1
Free / $15 Pro / $60 Teams


GitHub Copilot (IDE)
VS Code / JetBrains extension
64-128k (model-dependent) ⁹
~64k typical
Codebase search + file context
Truncation, yellow warnings
GPT-5.x, Claude, Gemini
Free / $10 Pro / $39 Enterprise


Zed
Standalone editor (Rust)
Model-dependent
Varies
Direct model integration
Model-dependent
Claude, GPT, Gemini, local
Free (open source) + API costs


Cline
VS Code extension
Model-dependent
Varies
Full file reading, agentic
Model-dependent
Any (bring your own API key)
Free (open source) + API costs


CLI / Terminal Agents


Tool
Provider
Max Context (Advertised)
Effective Context
Context Strategy
At Limit
Models
Cost


Claude Code
Anthropic
200k (1M beta on Opus 4.6) ²³
~150-200k ³
Agentic search + CLAUDE.md ²¹
Compaction (summarization) ¹
Claude only
$20 Pro / $100-200 Max


GitHub Copilot CLI
GitHub/ Microsoft
Model-dependent (64-128k)
~64-128k
Auto-compaction at 95%, checkpoints ²⁵
Compaction + checkpoint rewind
GPT-5.x, Claude, Gemini
$10 Pro / $39 Enterprise


Gemini CLI
Google
1M tokens ²⁴
~1M
Full codebase awareness + /compress
Warning + manual /compress
Gemini 2.5/3 Pro
Free (with Google account)


Codex CLI
OpenAI
Up to 400k (GPT-5.3 Codex) ⁹
~200-400k
Agentic, file reading
Session-based
GPT-5.x Codex variants
Included with ChatGPT Plus/Pro


OpenCode
Anomaly Innovations
Model-dependent
Model-dependent
Depends on provider
Provider-dependent
75+ providers incl. local
Free (open source) + API costs


Aider
Open source
Model-dependent
Varies
Git-aware, repo map
Model-dependent
Any (bring your own API key)
Free (open source) + API costs


API Usage (Model + Inference Server)

These are the models running on provider infrastructure when you use cloud APIs, IDEs like Cursor, or CLI tools like Claude Code. You don't control the inference server. Context window and max output are set by the provider.


Model
Provider
Context Window
Max Output
Coding Strength
Notes


Claude Opus 4.6
Anthropic
200k (1M beta) ⁸
128k
SWE-bench 80.8%, Terminal-Bench SOTA
Adaptive thinking, fastest frontier reasoning


Claude Sonnet 4.6
Anthropic
200k (1M beta) ⁸
64k
SWE-bench 79.6%, near-Opus performance
Best value frontier model, default in claude.ai


Claude Haiku 4.5
Anthropic
200k
8k
Routing, classification, simple extraction
Lightweight, fast, good for task routing


GPT-5.2
OpenAI
400k ⁹
128k
LiveCodeBench 89%, strong multi-file reasoning
Thinking/Instant/Pro variants


GPT-5.3 Codex
OpenAI
400k ⁹
128k
Optimized for Codex CLI workflows
Coding-tuned variant of GPT-5.x


Gemini 3 Pro
Google
1M ¹⁰
Varies
Strong multimodal, competitive SWE-bench
Best multimodal reasoning, largest standard window


Gemini 3 Flash
Google
200k
Varies
Pro-grade reasoning at Flash speed
3x faster than 2.5 Pro


Grok 4
xAI
128k
Varies
General purpose reasoning
Available via Grok subscription


Max output matters for context because output tokens consume the shared window. A model with 400k context but 128k max output (GPT-5.2) can generate much longer responses per turn than one with 200k context and 8k max output (Haiku 4.5). For multi-file refactors, that ceiling determines how much code the model can write in a single pass.
Local Inference (Self-Hosted)


Tool
Type
Default Context
Max Context
Context Setting
Notes
Cost


Ollama
Model runner + API
4,096 tokens ²
Model-dependent (up to 256k+)
OLLAMA_CONTEXT_LENGTH env var
Most common misconfiguration; default is far below model capacity
Free


llama.cpp
Inference engine
Varies
Model-dependent
--ctx-size / -c flag
Most flexible, bare-metal control
Free


LM Studio
Desktop GUI
Auto-detected
Model-dependent
GUI slider for context length
Most user-friendly local option
Free


vLLM
Production server
Model-dependent
Model-dependent
--max-model-len
Optimized for throughput, paged attention
Free


text-generation-webui
Browser-based GUI
Varies
Model-dependent
UI parameter
Multiple backend support (GGUF, GPTQ, AWQ)
Free


Local / Open-Weight Models

These are the models you'd run through the inference tools listed in the table above. Context window is the trained ceiling. Actual usable context depends on your hardware (VRAM/RAM for KV cache) and inference framework settings.


Model
Provider
Total Params
Active Params
Arch
Context Window
Min RAM/VRAM (Quantized)
Coding Strength
License


Qwen3-Coder-Next
Alibaba
80B
3B
MoE
256k ¹¹
~46GB (Q4)
Strong agentic coding, tool use
Apache 2.0


Qwen3-Coder 480B
Alibaba
480B
35B
MoE
256k ¹¹
~250GB
Frontier open-source coding
Apache 2.0


Qwen3.5-397B
Alibaba
397B
17B
MoE
262k ²⁷
Multi-GPU required
Multimodal reasoning + coding
Custom (Qwen)


DeepSeek V3.2
DeepSeek
685B
37B
MoE
128k ¹²
Multi-GPU (4-5x H100)
SWE-bench 73.1%, strong refactoring
MIT


Llama 4 Scout
Meta
109B
17B
MoE (16 experts)
10M (practical: 128-256k) ⁷
~24GB (Q4, single H100)
General purpose, multimodal
Llama 4 Community


Llama 4 Maverick
Meta
400B
17B
MoE (128 experts)
1M ⁷
H100 DGX system
General purpose, multimodal
Llama 4 Community


Kimi K2.5
Moonshot AI
~1T
32B
MoE
256k ²⁸
Multi-GPU, high-end
SWE-bench 76.8%, research-grade reasoning
Modified MIT


GLM-4.7
Zhipu AI
~355B
32B
MoE
200k ²⁸
Multi-GPU required
SWE-bench 73.8%, structured reasoning
Zhipu License


Mistral Large 3
Mistral AI
675B
41B
MoE
256k ¹³
8x H200 (FP8)
General purpose, multimodal, agentic
Apache 2.0


Devstral 2
Mistral AI
123B
123B (dense)
Dense
256k ²⁹
4x H100 minimum
SWE-bench 72.2%, code-agent focused
Modified MIT


Devstral Small 2
Mistral AI
24B
24B (dense)
Dense
256k ²⁹
Single GPU / consumer hardware
SWE-bench 68.0%, local-first coding
Apache 2.0


GPT-OSS-120B
OpenAI
117B
5.1B
MoE
128k ¹⁴
~80GB (single H100)
Reasoning, agentic tasks, tool use
Apache 2.0


GPT-OSS-20B
OpenAI
21B
3.6B
MoE
128k ¹⁴
~16GB (MXFP4)
Reasoning, local/edge deployment
Apache 2.0


Llama 3.1
Meta
405B / 70B / 8B
Dense
Dense
128k ¹⁵
Varies (8B: ~6GB, 70B: ~48GB)
General purpose, widely supported
Llama 3.1 Community


A few things to note. Most of the frontier open-weight models are MoE (Mixture-of-Experts), which means the total parameter count is much larger than what's active per token. This is how Qwen3-Coder-Next fits 80B parameters into ~46GB of RAM: only 3B parameters fire for each token. The tradeoff is that you still need to load all the weights into memory, even though only a fraction are used at inference time.
The context windows listed above are architectural ceilings. Your actual usable context is determined by the inference framework (see the previous table) and your available VRAM for the KV cache. A model that supports 256k tokens doesn't give you 256k tokens on a 24GB GPU.
For coding-specific local use, the practical sweet spot in early 2026 is Qwen3-Coder-Next (strong agentic coding, runs on consumer hardware with enough RAM), Devstral Small 2 (dense 24B, fits on a single GPU), or GPT-OSS-20B (fits on 16GB, solid reasoning). Everything above that requires datacenter hardware or multi-GPU setups.
Key for the Charts


Effective Context: What developers actually get in practice, based on community reports and testing, not the marketed number.
Context Strategy: How the tool decides what goes into the model's context window.
At Limit: What happens when the context window fills up.
Cost: Individual pricing as of early 2026. Team/enterprise tiers vary.


Conclusion

Context management is the new skill nobody taught you. The quality of code these tools produce isn't just a function of which model they're running. It's a function of what they're putting in the context window and what they're leaving out.
The tools that win aren't necessarily the ones with the biggest windows. They're the ones that are smartest about what they put in the window. Claude Code's agentic search, Cursor's RAG-based selection, Copilot CLI's checkpoint system. These are all different answers to the same fundamental question: given finite working memory, what should the model be paying attention to right now?
Understanding how your tools answer that question, not just how many tokens they advertise, is what separates a developer who's frustrated by AI "forgetting things" from one who knows exactly why it happened and how to prevent it.

Have a correction or something to add? This space is evolving fast. Context management strategies that are current today may be outdated in six months. Feedback welcome.

Sources

Footnotes


Cursor vs Claude Code vs Windsurf: Which One Handles Context Loss the Worst? — Real-world testing of compaction, tab-based context loss, and silent truncation across tools. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷


Optimizing the Ollama Context Window — Ollama's 4,096-token default and how to increase it via systemd, modelfiles, and CLI. ↩ ↩² ↩³ ↩⁴


Cursor vs Claude Code vs Windsurf: The Honest Comparison — Effective context numbers from hands-on testing: 60-80k Cursor, 50-70k Windsurf, 150k+ Claude Code. Also covers the 23-file authentication migration. ↩ ↩² ↩³


LLM Context Windows: What They Are & How They Work — Redis blog covering tokenization, transformer architecture, quadratic attention scaling, and KV cache mechanics. ↩ ↩²


What Is a Context Window? — IBM's overview of how context windows work, including shared token budgets across prompts, history, and responses. ↩


How I Use Every Claude Code Feature — Shrivu Shankar's walkthrough covering /context, ~20k baseline in monorepos, subagents, and CLAUDE.md strategies. ↩ ↩² ↩³ ↩⁴


The Llama 4 Herd — Meta's announcement: Scout (109B/17B active, 10M context, 16 experts), Maverick (400B/17B active, 1M context, 128 experts), MoE architecture, 30T+ training tokens. ↩ ↩² ↩³ ↩⁴


Claude Code vs Cursor: What to Choose in 2026 — Builder.io's comparison covering Claude's 200k reliable delivery and 1M beta on Opus 4.6. ↩ ↩² ↩³ ↩⁴


Why can't we fully utilize context_window? — GitHub community discussion on Copilot API context windows, including GPT-5.3 Codex's 400k window and the gap between advertised and usable limits. ↩ ↩² ↩³ ↩⁴ ↩⁵


Google Gemini Context Window: Token Limits and Workflow Strategies — Gemini 3 Pro's 1M default, Gemini 3 Flash's 200k, and consumer-facing limits. ↩ ↩² ↩³ ↩⁴


Qwen3-Coder-Next on Hugging Face — Official model card: 80B total / 3B active MoE, 256k native context (extendable to 1M via YaRN), Apache 2.0 license, agentic coding focus. ↩ ↩² ↩³ ↩⁴


DeepSeek-V3.2 Technical Report — Architecture details: 685B total / 37B active MoE with DeepSeek Sparse Attention, 128k context, MIT license. ↩ ↩²


Mistral Large 3 on Hugging Face — Official model card: 675B total / 41B active MoE, 256k context, multimodal (2.5B vision encoder), Apache 2.0 license. ↩ ↩²


Introducing GPT-OSS — OpenAI's open-weight models: gpt-oss-120b (117B total / 5.1B active) and gpt-oss-20b (21B total / 3.6B active), 128k context, MoE with alternating sparse attention, Apache 2.0. ↩ ↩² ↩³


Best LLMs for Extended Context Windows in 2026 — AIMultiple's comparison of context window sizes across major models including Llama 3.1 at 128k. ↩ ↩² ↩³


Context Window: What It Is and Why It Matters for AI Agents — Comet's deep dive on the "lost in the middle" problem and why models miss information at mid-context positions. ↩


Running a Local LLM for Code Assistance — Benchmarks comparing local model inference (110-598s) against cloud tools (52-56s). ↩


Testing AI Coding Agents: Cursor vs Claude, OpenAI, and Gemini — Render blog noting Cursor's RAG-like system on the local filesystem for gathering codebase context. ↩ ↩²


Claude Code vs Cursor: Deep Comparison for Dev Teams — Qodo's breakdown of Cursor's 20k default chat sessions, 10k inline commands, Max Mode tradeoffs, silent truncation, and Claude Code's reliable 200k. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶


Claude Code vs Cursor: The Honest Comparison — Developer reports of Cursor's effective context falling to 70k-120k despite 200k advertising. ↩ ↩² ↩³


Claude Code vs Cursor: A Power-User's Playbook — Arize AI's comparison covering Claude Code's agentic context assembly vs Cursor's embeddings index. ↩ ↩²


Cursor vs Claude Code — Data Science Collective piece on CLAUDE.md as institutional memory, project conventions, and persistent context. ↩ ↩²


Cursor vs Claude Code 2026: Which AI Coding Tool Wins? — AI Tool VS comparison citing the 1M token beta on Opus 4.6 and 76% on the MRCR v2 benchmark. ↩ ↩²


Gemini CLI Tutorial: Understanding Context, Memory and Conversational Branching — Romin Irani's guide to Gemini CLI's 1M token window, /compress, /clear, and session management. ↩ ↩²


GitHub Copilot CLI: Enhanced Agents, Context Management, and New Ways to Install — Official GitHub changelog covering auto-compaction at 95%, /compact, /context, and checkpoint files. ↩ ↩²


Context Awareness for Windsurf — Official Windsurf docs on the RAG-based indexing engine, M-Query retrieval techniques, and context scoping. ↩


Qwen3.5-397B-A17B on Hugging Face — Official model card: 397B total / 17B active MoE, 262k default context, multimodal reasoning, native tool use. ↩


Kimi K2.5 vs GLM-4.7 Comparison — Technical comparison: Kimi K2.5 (~1T params, 32B active, 256k context), GLM-4.7 (~355B params, 32B active, 200k context), architecture and benchmark details. ↩ ↩²


Introducing Devstral 2 and Mistral Vibe CLI — Devstral 2 (123B dense, 256k context, SWE-bench 72.2%) and Devstral Small 2 (24B dense, 256k context, SWE-bench 68.0%, Apache 2.0). ↩ ↩²
	Model	→ Inference Server	→ Client
Local	Model (e.g. Llama 3.1)	Ollama / llama.cpp / vLLM	Cursor / terminal
SaaS	Model (e.g. Opus 4.6)	vLLM / SGLang / TensorRT-LLM / custom	claude.ai / Cursor
Tool	Provider	Max Context (Advertised)	Effective Context	Context Strategy	At Limit	Cost
claude.ai	Anthropic	200k tokens	~200k	Conversation history only	Older messages drop	Free / $20 Pro / $100-200 Max
ChatGPT	OpenAI	128k tokens (GPT-5.3)	~128k	Conversation + uploaded files	Older messages drop	Free / $20 Plus / $200 Pro
Gemini	Google	1M tokens (Gemini 3 Pro) ¹⁰	~1M (consumer often 128-200k)	Conversation + uploaded files	Performance degrades	Free / $20 AI Premium / $250 Ultra
Grok	xAI	128k tokens	~128k	Conversation history	Older messages drop	Free / Premium subscription
Tool	Type	Max Context (Advertised)	Effective Context	Context Strategy	At Limit	Models	Cost
Cursor	VS Code fork	200k (Max Mode)	60-120k ²⁰	RAG + embeddings + silent truncation ¹⁸	Silent file dropping ¹⁹	Claude, GPT, Gemini, Composer	Free / $20 Pro / $40 Ultra
Windsurf	Standalone IDE	Model-dependent	50-100k ³	RAG (M-Query) + Cascade indexing ²⁶	Context drops without warning	Claude, GPT, Gemini, SWE-1	Free / $15 Pro / $60 Teams
GitHub Copilot (IDE)	VS Code / JetBrains extension	64-128k (model-dependent) ⁹	~64k typical	Codebase search + file context	Truncation, yellow warnings	GPT-5.x, Claude, Gemini	Free / $10 Pro / $39 Enterprise
Zed	Standalone editor (Rust)	Model-dependent	Varies	Direct model integration	Model-dependent	Claude, GPT, Gemini, local	Free (open source) + API costs
Cline	VS Code extension	Model-dependent	Varies	Full file reading, agentic	Model-dependent	Any (bring your own API key)	Free (open source) + API costs
Tool	Provider	Max Context (Advertised)	Effective Context	Context Strategy	At Limit	Models	Cost
Claude Code	Anthropic	200k (1M beta on Opus 4.6) ²³	~150-200k ³	Agentic search + CLAUDE.md ²¹	Compaction (summarization) ¹	Claude only	$20 Pro / $100-200 Max
GitHub Copilot CLI	GitHub/ Microsoft	Model-dependent (64-128k)	~64-128k	Auto-compaction at 95%, checkpoints ²⁵	Compaction + checkpoint rewind	GPT-5.x, Claude, Gemini	$10 Pro / $39 Enterprise
Gemini CLI	Google	1M tokens ²⁴	~1M	Full codebase awareness + /compress	Warning + manual /compress	Gemini 2.5/3 Pro	Free (with Google account)
Codex CLI	OpenAI	Up to 400k (GPT-5.3 Codex) ⁹	~200-400k	Agentic, file reading	Session-based	GPT-5.x Codex variants	Included with ChatGPT Plus/Pro
OpenCode	Anomaly Innovations	Model-dependent	Model-dependent	Depends on provider	Provider-dependent	75+ providers incl. local	Free (open source) + API costs
Aider	Open source	Model-dependent	Varies	Git-aware, repo map	Model-dependent	Any (bring your own API key)	Free (open source) + API costs
Model	Provider	Context Window	Max Output	Coding Strength	Notes
Claude Opus 4.6	Anthropic	200k (1M beta) ⁸	128k	SWE-bench 80.8%, Terminal-Bench SOTA	Adaptive thinking, fastest frontier reasoning
Claude Sonnet 4.6	Anthropic	200k (1M beta) ⁸	64k	SWE-bench 79.6%, near-Opus performance	Best value frontier model, default in claude.ai
Claude Haiku 4.5	Anthropic	200k	8k	Routing, classification, simple extraction	Lightweight, fast, good for task routing
GPT-5.2	OpenAI	400k ⁹	128k	LiveCodeBench 89%, strong multi-file reasoning	Thinking/Instant/Pro variants
GPT-5.3 Codex	OpenAI	400k ⁹	128k	Optimized for Codex CLI workflows	Coding-tuned variant of GPT-5.x
Gemini 3 Pro	Google	1M ¹⁰	Varies	Strong multimodal, competitive SWE-bench	Best multimodal reasoning, largest standard window
Gemini 3 Flash	Google	200k	Varies	Pro-grade reasoning at Flash speed	3x faster than 2.5 Pro
Grok 4	xAI	128k	Varies	General purpose reasoning	Available via Grok subscription
Tool	Type	Default Context	Max Context	Context Setting	Notes	Cost
Ollama	Model runner + API	4,096 tokens ²	Model-dependent (up to 256k+)	`OLLAMA_CONTEXT_LENGTH` env var	Most common misconfiguration; default is far below model capacity	Free
llama.cpp	Inference engine	Varies	Model-dependent	`--ctx-size` / `-c` flag	Most flexible, bare-metal control	Free
LM Studio	Desktop GUI	Auto-detected	Model-dependent	GUI slider for context length	Most user-friendly local option	Free
vLLM	Production server	Model-dependent	Model-dependent	`--max-model-len`	Optimized for throughput, paged attention	Free
text-generation-webui	Browser-based GUI	Varies	Model-dependent	UI parameter	Multiple backend support (GGUF, GPTQ, AWQ)	Free
Model	Provider	Total Params	Active Params	Arch	Context Window	Min RAM/VRAM (Quantized)	Coding Strength	License
Qwen3-Coder-Next	Alibaba	80B	3B	MoE	256k ¹¹	~46GB (Q4)	Strong agentic coding, tool use	Apache 2.0
Qwen3-Coder 480B	Alibaba	480B	35B	MoE	256k ¹¹	~250GB	Frontier open-source coding	Apache 2.0
Qwen3.5-397B	Alibaba	397B	17B	MoE	262k ²⁷	Multi-GPU required	Multimodal reasoning + coding	Custom (Qwen)
DeepSeek V3.2	DeepSeek	685B	37B	MoE	128k ¹²	Multi-GPU (4-5x H100)	SWE-bench 73.1%, strong refactoring	MIT
Llama 4 Scout	Meta	109B	17B	MoE (16 experts)	10M (practical: 128-256k) ⁷	~24GB (Q4, single H100)	General purpose, multimodal	Llama 4 Community
Llama 4 Maverick	Meta	400B	17B	MoE (128 experts)	1M ⁷	H100 DGX system	General purpose, multimodal	Llama 4 Community
Kimi K2.5	Moonshot AI	~1T	32B	MoE	256k ²⁸	Multi-GPU, high-end	SWE-bench 76.8%, research-grade reasoning	Modified MIT
GLM-4.7	Zhipu AI	~355B	32B	MoE	200k ²⁸	Multi-GPU required	SWE-bench 73.8%, structured reasoning	Zhipu License
Mistral Large 3	Mistral AI	675B	41B	MoE	256k ¹³	8x H200 (FP8)	General purpose, multimodal, agentic	Apache 2.0
Devstral 2	Mistral AI	123B	123B (dense)	Dense	256k ²⁹	4x H100 minimum	SWE-bench 72.2%, code-agent focused	Modified MIT
Devstral Small 2	Mistral AI	24B	24B (dense)	Dense	256k ²⁹	Single GPU / consumer hardware	SWE-bench 68.0%, local-first coding	Apache 2.0
GPT-OSS-120B	OpenAI	117B	5.1B	MoE	128k ¹⁴	~80GB (single H100)	Reasoning, agentic tasks, tool use	Apache 2.0
GPT-OSS-20B	OpenAI	21B	3.6B	MoE	128k ¹⁴	~16GB (MXFP4)	Reasoning, local/edge deployment	Apache 2.0
Llama 3.1	Meta	405B / 70B / 8B	Dense	Dense	128k ¹⁵	Varies (8B: ~6GB, 70B: ~48GB)	General purpose, widely supported	Llama 3.1 Community