Model → Inference Server → Client
(Local vs SaaS)
Most developers use AI coding tools every day without understanding what's happening to their context under the hood. Here's what's actually going on and why the same prompt behaves differently depending on where you run it.
Disclaimer: Written with AI — Claude Opus 4.6 via claude.ai alongside human feedback.
- The Gap Nobody Has Filled
- What Is Context, Really?
- Layer 0: The Model Itself
- Claude Web: You're a Thin Client
- Local: You Own the Whole Stack
- Cursor: An IDE, Not an Inference Engine
- Claude Code: Agentic Context Management
- What Happens at the Limit: A Direct Comparison
- Practical Takeaways
- The Full Landscape: Every Tool Compared
- Conclusion
- Sources
You're probably using at least one AI coding tool daily. Maybe Cursor. Maybe Claude Code. Maybe you've got a local model running through Ollama on the side. All of these tools talk about "context windows," and most developers have a rough sense that bigger is better.
But context doesn't work the same way across any of these tools. The advertised token count on a marketing page tells you almost nothing about how much of your codebase the tool can actually reason about at any given moment. Three layers sit between your prompt and the model's response: the model itself, the inference server, and the client. Most developers only interact with the client and never think about the other two.
| Model | → Inference Server | → Client | |
|---|---|---|---|
| Local | Model (e.g. Llama 3.1) | Ollama / llama.cpp / vLLM | Cursor / terminal |
| SaaS | Model (e.g. Opus 4.6) | vLLM / SGLang / TensorRT-LLM / custom | claude.ai / Cursor |
A note on terminology: the middle layer is often called the "inference server," but it actually contains two parts. The inference engine (vLLM, llama.cpp, TensorRT-LLM, SGLang) is the core software that loads model weights into GPU memory, computes attention, and manages the KV (key-value) cache, which stores previously computed token data so the model doesn't have to reprocess the entire context from scratch for each new token. The inference server is the API wrapper that exposes the engine as an HTTP endpoint so clients can talk to it. Sometimes these are the same tool (vLLM includes its own OpenAI-compatible server), sometimes they're separate (TensorRT-LLM is often deployed behind NVIDIA's Triton Inference Server). Ollama is a good example: it's the server you interact with at localhost:11434, but under the hood it uses llama.cpp as its inference engine. For most of this post, the distinction doesn't matter. What matters is that this layer is where your context settings live, whether you're running locally or hitting a cloud API.
This post breaks down that chain. By the end, you'll understand why Cursor can silently forget code it wrote five minutes ago 1, why your local Ollama model might be running at 2% of its capacity 2, and why Claude Code can refactor 23 files in one shot 3 while other tools choke at 10.
The context window is the model's working memory. It's the total amount of text, measured in tokens, that the model can "see" at once when generating a response.
A token is roughly 4 characters or three-quarters of a word 4. A 200,000-token context window translates to approximately 150,000 words, or about 500 pages of text.
Here's the part most developers miss: the context window is shared. The model doesn't get 200k tokens for your message. It gets 200k tokens total, and that budget is split across everything 5:
- The system prompt (instructions telling the model how to behave)
- Tool definitions (function schemas for code execution, file reading, web search, etc.)
- Conversation history (every message you and the model have exchanged)
- File contents (any code or documents pulled into the session)
- The model's own responses
By the time you type your first message in a fresh session, a significant chunk of that window is already consumed. In a tool like Claude Code working in a monorepo, the baseline overhead can be around 20,000 tokens. That's 10% of the window gone before you've asked a single question 6.
Every model has a trained context ceiling. This is a hard architectural limit baked in during training:
- Llama 4 Scout: 10M tokens (practical: 128-256k) 7
- Llama 4 Maverick: 1M tokens 7
- Claude Opus 4.6: 200k tokens (1M in beta) 8
- Claude Sonnet 4.6: 200k tokens (1M available) 8
- GPT-5.3 Codex: up to 400k tokens 9
- Gemini 3 Pro: 1M tokens 10
- Gemini 3 Flash: 200k tokens 10
- Qwen3-Coder-Next: 256k tokens 11
- DeepSeek V3.2: 128k tokens 12
- Mistral Large 3: 256k tokens 13
- GPT-OSS-120B: 128k tokens 14
- Llama 3.1: 128k tokens 15
You cannot exceed this ceiling without architectural tricks like RoPE scaling or YaRN, and even then, quality degrades. Bigger windows also aren't free: the transformer attention mechanism scales quadratically with sequence length, which means doubling the context roughly quadruples the compute cost 4.
There's also the "lost in the middle" problem. Research has consistently shown that models are better at recalling information from the beginning and end of their context window 16. Details buried in the middle are more likely to be missed. A 1M token window doesn't work like 1M tokens of perfect memory.
When you use claude.ai, your computer does almost nothing. It sends your text over HTTPS to Anthropic's servers, receives a stream of tokens back, and renders them in the browser. That's it.
Models like Claude Opus 4.6 have hundreds of billions of parameters and require specialized GPU clusters to run inference. Your browser tab is almost entirely JavaScript application code, React components, and DOM overhead. The raw text of even a very long conversation is a few hundred KB at most.
Extended thinking? Also entirely server-side. When Claude "thinks" for 30 seconds before responding, your machine is just waiting for the stream to start. No extra CPU, no extra memory.
In Claude web, the context is assembled from your conversation history: system prompt, every message you've sent, every response Claude has generated, and any files you've uploaded. There's no codebase indexing, no file-tree awareness, no agentic search. You get exactly what you put in.
As conversations get long, older messages are truncated to make room. You'll notice this when Claude "forgets" something you discussed an hour ago. There's no compaction or summarization. The early messages simply fall out of the window.
For most chat-based usage, this is fine. For multi-session coding projects, it means you need to re-establish context at the start of each conversation. Summarizing key decisions in your opening prompt becomes a practical necessity.
Running a model locally through Ollama, llama.cpp, vLLM, or LM Studio means you control every layer of the stack. That's powerful. It's also where the most common context misconfigurations happen.
1. The model's trained ceiling. A Llama 3.1 model supports up to 128k tokens 15. Qwen3-Coder supports up to 256k 11. This is the theoretical maximum.
2. Your hardware. Every token in the context window requires memory for the KV cache. This is separate from the memory needed to hold the model weights themselves. A 7B parameter model at Q4 quantization might need 4-5GB for the weights, but a 32k context window on top of that can add another 4-8GB depending on the model architecture. If you're running on a GPU with 24GB of VRAM, the model weights plus KV cache for a large context might not fit, forcing you to either reduce context or offload to slower system RAM.
3. Your inference framework settings. This is the most commonly missed piece. Ollama defaults to a context window of 4,096 tokens, regardless of what the model actually supports 2. If you downloaded a model capable of 128k tokens and never changed the OLLAMA_CONTEXT_LENGTH environment variable, you've been running at roughly 3% of its capacity.
In llama.cpp, the equivalent setting is n_ctx. In vLLM, it's max_model_len. Each framework has its own default, and none of them automatically use the model's full capacity because doing so would consume too much memory on most consumer hardware.
Local inference gives you complete privacy, zero per-token cost, and total control. The tradeoff is significantly smaller practical context windows, slower inference (often 5-50x slower than cloud APIs 17), and models that are less capable than the frontier cloud options.
Hard cutoff. The model stops seeing earlier tokens with no graceful degradation. There's no summarization, no compaction, no warning. Your prompt simply gets truncated from the beginning, and the model responds based on whatever remains in the window.
This is the most misunderstood piece of the stack, and it came up in a conversation I had recently that crystallized the issue.
Cursor is not running your model. Cursor is a VS Code fork, a code editor with AI features layered on top. When you use Claude, GPT, or Gemini inside Cursor, the model is running on Anthropic's, OpenAI's, or Google's servers. Cursor is deciding what to send to those servers.
This distinction matters enormously because Cursor makes aggressive decisions about context that most developers never see.
Cursor uses a RAG-like system with embeddings to index your codebase and select which files are relevant to your current prompt 18. It doesn't send your entire project to the model. It selects what it thinks matters, assembles a prompt, and sends that.
The default context budget for a chat session is approximately 20,000 tokens. Inline commands (Cmd-K) get around 10,000 tokens 19. That's a fraction of the underlying model's capacity.
"Max Mode" extends the window to the model's full capacity: up to 200k tokens for Claude, potentially 1M for Gemini. But even in Max Mode, developers consistently report that the effective usable context falls between 70k and 120k tokens 20. Cursor applies internal truncation and performance safeguards that silently reduce what actually reaches the model 19.
This is Cursor's most consequential behavior. When context gets too large, Cursor doesn't error out. It doesn't tell you it's dropping files. It silently deprioritizes and removes older or less relevant content to maintain responsiveness and manage API costs 19.
The practical result: you ask Cursor to modify a component, it does so, you switch tabs to work on something else, come back, and Cursor has no memory of the component structure you just designed together. Each tab maintains its own context, and switching between them can mean starting from scratch 1.
If you want to run a local model inside Cursor, the architecture looks like this:
You → Cursor (IDE) → Local inference server (Ollama/llama.cpp) → Your model
Cursor points at a local OpenAI-compatible API endpoint instead of a cloud API. But the context settings (n_ctx, VRAM allocation, all of it) live in the inference server, not in Cursor. Cursor is still just the client, deciding what to send. The inference server determines how much it can receive.
Claude Code is a CLI-first tool that runs in your terminal. It takes a fundamentally different approach to context than any IDE-based tool.
Unlike Cursor, which maintains a static embeddings index of your codebase, Claude Code doesn't pre-index anything. Instead, it uses agentic search: when you give it a task, it actively explores your repository at runtime 21. It reads files, follows import chains, greps for references, runs tests, and builds its understanding dynamically.
This means Claude Code's context is always fresh. It's never working from a stale index. But it also means each task starts with an exploration phase that consumes tokens as the agent reads files to understand your project.
One of Claude Code's most distinctive features is CLAUDE.md, a markdown file that lives in your repo root and acts as institutional memory 22. It stores project conventions, architecture decisions, directory structure, key patterns, and anything else the agent should know at the start of every session.
This is loaded into context automatically when Claude Code starts. It's the one piece of context that persists across sessions. Everything else (conversation history, file reads, tool results) is ephemeral.
Claude Code consistently delivers the advertised 200k token context window 19. With Claude Opus 4.6, there's a 1M token beta that scores 76% on the MRCR v2 long-context retrieval benchmark 23. For large codebases, this changes what's possible in a single session.
You can check your current consumption with the /context command, which shows how much of the 200k window you've used and what's filling it 6. A fresh session in a monorepo might start with a ~20k token baseline, leaving 180k for your actual work, which can still fill up fast during complex multi-file refactors.
When conversations get long, Claude Code uses compaction, automatic server-side summarization of earlier conversation turns 1. This is what enables effectively unlimited session length, but it comes with a cost.
Compacted context is a summary, not the original. Details get lost. The most common symptom: Claude Code starts making suggestions that contradict decisions you made an hour ago. It might recommend adding Redux for state management when you explicitly told it you're using Zustand, because that decision was in a part of the conversation that got compacted away 1.
Claude Code can spin up child agents (subagents), each with their own context window 6. This is a context management strategy: the main agent stays clean and focused while subagents handle specific tasks like running tests, reviewing security, and scanning for patterns. The results are returned as summaries, keeping the main agent's window from filling with raw output.
Each tool handles context exhaustion differently, and the differences have real consequences:
Claude Web (claude.ai + Anthropic's inference server) — Older messages silently drop out. The model keeps responding but loses awareness of early conversation context. No warning, no summarization.
Local (e.g. Ollama + Cursor) — Behavior depends on the inference server. Ollama truncates older conversation turns to fit the window. llama.cpp's server may error out if you exceed n_ctx. The client has no say in this — it's the inference server deciding what gets cut.
Cursor (Cursor IDE + cloud inference server) — Silent truncation. Files get deprioritized, older context gets dropped, and you receive no notification 19. The model keeps generating, but based on incomplete information. This is the sneakiest failure mode because the output looks confident.
Claude Code (CLI + Anthropic's inference server) — Compaction kicks in. Earlier turns are summarized to free space 1. The session continues, but the summary may lose important details. You can at least monitor this with /context.
Gemini CLI (CLI + Google's inference server) — Similar to Claude Code. Uses a 1M token window with /compress for manual summarization and /clear for hard resets 24. Warns when approaching the limit.
GitHub Copilot CLI (CLI + Microsoft/GitHub's inference server) — Auto-compaction at 95% of the token limit 25. Creates checkpoint files so you can rewind to pre-compaction state. Uses /compact for manual compression and /context for monitoring.
If you're using Cursor: Accept that you're working with 60-120k effective tokens, not 200k 20. Turn on Max Mode for large refactors. Be aware that tab switching resets your context 1. Keep critical decisions in .cursorrules or AGENTS.md files that get loaded automatically.
If you're using Claude Code: Use /context regularly to monitor consumption 6. Start fresh sessions for distinct tasks rather than letting compaction degrade your context. Invest time in your CLAUDE.md. It's the one thing that persists and directly improves every session 22.
If you're running local models: Check your actual context setting right now. If you're on Ollama and never changed it 2, run:
# Check current context
ollama ps
# Set a larger default
# In /etc/systemd/system/ollama.service.d/override.conf:
# Environment="OLLAMA_CONTEXT_LENGTH=32768"Even going from 4,096 to 16,384 or 32,768 tokens will dramatically improve your experience.
If you're using Claude web: Recognize that very long conversations will lose early context. For multi-session projects, start each new conversation with a summary of prior decisions.
In general: Context is a shared, finite resource. System prompts, tool definitions, and file contents eat into it before you type a word. The skill isn't finding the tool with the biggest window. It's being intentional about what goes into the window you have.
| Tool | Provider | Max Context (Advertised) | Effective Context | Context Strategy | At Limit | Cost |
|---|---|---|---|---|---|---|
| claude.ai | Anthropic | 200k tokens | ~200k | Conversation history only | Older messages drop | Free / $20 Pro / $100-200 Max |
| ChatGPT | OpenAI | 128k tokens (GPT-5.3) | ~128k | Conversation + uploaded files | Older messages drop | Free / $20 Plus / $200 Pro |
| Gemini | 1M tokens (Gemini 3 Pro) 10 | ~1M (consumer often 128-200k) | Conversation + uploaded files | Performance degrades | Free / $20 AI Premium / $250 Ultra | |
| Grok | xAI | 128k tokens | ~128k | Conversation history | Older messages drop | Free / Premium subscription |
| Tool | Type | Max Context (Advertised) | Effective Context | Context Strategy | At Limit | Models | Cost |
|---|---|---|---|---|---|---|---|
| Cursor | VS Code fork | 200k (Max Mode) | 60-120k 20 | RAG + embeddings + silent truncation 18 | Silent file dropping 19 | Claude, GPT, Gemini, Composer | Free / $20 Pro / $40 Ultra |
| Windsurf | Standalone IDE | Model-dependent | 50-100k 3 | RAG (M-Query) + Cascade indexing 26 | Context drops without warning | Claude, GPT, Gemini, SWE-1 | Free / $15 Pro / $60 Teams |
| GitHub Copilot (IDE) | VS Code / JetBrains extension | 64-128k (model-dependent) 9 | ~64k typical | Codebase search + file context | Truncation, yellow warnings | GPT-5.x, Claude, Gemini | Free / $10 Pro / $39 Enterprise |
| Zed | Standalone editor (Rust) | Model-dependent | Varies | Direct model integration | Model-dependent | Claude, GPT, Gemini, local | Free (open source) + API costs |
| Cline | VS Code extension | Model-dependent | Varies | Full file reading, agentic | Model-dependent | Any (bring your own API key) | Free (open source) + API costs |
| Tool | Provider | Max Context (Advertised) | Effective Context | Context Strategy | At Limit | Models | Cost |
|---|---|---|---|---|---|---|---|
| Claude Code | Anthropic | 200k (1M beta on Opus 4.6) 23 | ~150-200k 3 | Agentic search + CLAUDE.md 21 | Compaction (summarization) 1 | Claude only | $20 Pro / $100-200 Max |
| GitHub Copilot CLI | GitHub/ Microsoft | Model-dependent (64-128k) | ~64-128k | Auto-compaction at 95%, checkpoints 25 | Compaction + checkpoint rewind | GPT-5.x, Claude, Gemini | $10 Pro / $39 Enterprise |
| Gemini CLI | 1M tokens 24 | ~1M | Full codebase awareness + /compress | Warning + manual /compress | Gemini 2.5/3 Pro | Free (with Google account) | |
| Codex CLI | OpenAI | Up to 400k (GPT-5.3 Codex) 9 | ~200-400k | Agentic, file reading | Session-based | GPT-5.x Codex variants | Included with ChatGPT Plus/Pro |
| OpenCode | Anomaly Innovations | Model-dependent | Model-dependent | Depends on provider | Provider-dependent | 75+ providers incl. local | Free (open source) + API costs |
| Aider | Open source | Model-dependent | Varies | Git-aware, repo map | Model-dependent | Any (bring your own API key) | Free (open source) + API costs |
These are the models running on provider infrastructure when you use cloud APIs, IDEs like Cursor, or CLI tools like Claude Code. You don't control the inference server. Context window and max output are set by the provider.
| Model | Provider | Context Window | Max Output | Coding Strength | Notes |
|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 200k (1M beta) 8 | 128k | SWE-bench 80.8%, Terminal-Bench SOTA | Adaptive thinking, fastest frontier reasoning |
| Claude Sonnet 4.6 | Anthropic | 200k (1M beta) 8 | 64k | SWE-bench 79.6%, near-Opus performance | Best value frontier model, default in claude.ai |
| Claude Haiku 4.5 | Anthropic | 200k | 8k | Routing, classification, simple extraction | Lightweight, fast, good for task routing |
| GPT-5.2 | OpenAI | 400k 9 | 128k | LiveCodeBench 89%, strong multi-file reasoning | Thinking/Instant/Pro variants |
| GPT-5.3 Codex | OpenAI | 400k 9 | 128k | Optimized for Codex CLI workflows | Coding-tuned variant of GPT-5.x |
| Gemini 3 Pro | 1M 10 | Varies | Strong multimodal, competitive SWE-bench | Best multimodal reasoning, largest standard window | |
| Gemini 3 Flash | 200k | Varies | Pro-grade reasoning at Flash speed | 3x faster than 2.5 Pro | |
| Grok 4 | xAI | 128k | Varies | General purpose reasoning | Available via Grok subscription |
Max output matters for context because output tokens consume the shared window. A model with 400k context but 128k max output (GPT-5.2) can generate much longer responses per turn than one with 200k context and 8k max output (Haiku 4.5). For multi-file refactors, that ceiling determines how much code the model can write in a single pass.
| Tool | Type | Default Context | Max Context | Context Setting | Notes | Cost |
|---|---|---|---|---|---|---|
| Ollama | Model runner + API | 4,096 tokens 2 | Model-dependent (up to 256k+) | OLLAMA_CONTEXT_LENGTH env var |
Most common misconfiguration; default is far below model capacity | Free |
| llama.cpp | Inference engine | Varies | Model-dependent | --ctx-size / -c flag |
Most flexible, bare-metal control | Free |
| LM Studio | Desktop GUI | Auto-detected | Model-dependent | GUI slider for context length | Most user-friendly local option | Free |
| vLLM | Production server | Model-dependent | Model-dependent | --max-model-len |
Optimized for throughput, paged attention | Free |
| text-generation-webui | Browser-based GUI | Varies | Model-dependent | UI parameter | Multiple backend support (GGUF, GPTQ, AWQ) | Free |
These are the models you'd run through the inference tools listed in the table above. Context window is the trained ceiling. Actual usable context depends on your hardware (VRAM/RAM for KV cache) and inference framework settings.
| Model | Provider | Total Params | Active Params | Arch | Context Window | Min RAM/VRAM (Quantized) | Coding Strength | License |
|---|---|---|---|---|---|---|---|---|
| Qwen3-Coder-Next | Alibaba | 80B | 3B | MoE | 256k 11 | ~46GB (Q4) | Strong agentic coding, tool use | Apache 2.0 |
| Qwen3-Coder 480B | Alibaba | 480B | 35B | MoE | 256k 11 | ~250GB | Frontier open-source coding | Apache 2.0 |
| Qwen3.5-397B | Alibaba | 397B | 17B | MoE | 262k 27 | Multi-GPU required | Multimodal reasoning + coding | Custom (Qwen) |
| DeepSeek V3.2 | DeepSeek | 685B | 37B | MoE | 128k 12 | Multi-GPU (4-5x H100) | SWE-bench 73.1%, strong refactoring | MIT |
| Llama 4 Scout | Meta | 109B | 17B | MoE (16 experts) | 10M (practical: 128-256k) 7 | ~24GB (Q4, single H100) | General purpose, multimodal | Llama 4 Community |
| Llama 4 Maverick | Meta | 400B | 17B | MoE (128 experts) | 1M 7 | H100 DGX system | General purpose, multimodal | Llama 4 Community |
| Kimi K2.5 | Moonshot AI | ~1T | 32B | MoE | 256k 28 | Multi-GPU, high-end | SWE-bench 76.8%, research-grade reasoning | Modified MIT |
| GLM-4.7 | Zhipu AI | ~355B | 32B | MoE | 200k 28 | Multi-GPU required | SWE-bench 73.8%, structured reasoning | Zhipu License |
| Mistral Large 3 | Mistral AI | 675B | 41B | MoE | 256k 13 | 8x H200 (FP8) | General purpose, multimodal, agentic | Apache 2.0 |
| Devstral 2 | Mistral AI | 123B | 123B (dense) | Dense | 256k 29 | 4x H100 minimum | SWE-bench 72.2%, code-agent focused | Modified MIT |
| Devstral Small 2 | Mistral AI | 24B | 24B (dense) | Dense | 256k 29 | Single GPU / consumer hardware | SWE-bench 68.0%, local-first coding | Apache 2.0 |
| GPT-OSS-120B | OpenAI | 117B | 5.1B | MoE | 128k 14 | ~80GB (single H100) | Reasoning, agentic tasks, tool use | Apache 2.0 |
| GPT-OSS-20B | OpenAI | 21B | 3.6B | MoE | 128k 14 | ~16GB (MXFP4) | Reasoning, local/edge deployment | Apache 2.0 |
| Llama 3.1 | Meta | 405B / 70B / 8B | Dense | Dense | 128k 15 | Varies (8B: ~6GB, 70B: ~48GB) | General purpose, widely supported | Llama 3.1 Community |
A few things to note. Most of the frontier open-weight models are MoE (Mixture-of-Experts), which means the total parameter count is much larger than what's active per token. This is how Qwen3-Coder-Next fits 80B parameters into ~46GB of RAM: only 3B parameters fire for each token. The tradeoff is that you still need to load all the weights into memory, even though only a fraction are used at inference time.
The context windows listed above are architectural ceilings. Your actual usable context is determined by the inference framework (see the previous table) and your available VRAM for the KV cache. A model that supports 256k tokens doesn't give you 256k tokens on a 24GB GPU.
For coding-specific local use, the practical sweet spot in early 2026 is Qwen3-Coder-Next (strong agentic coding, runs on consumer hardware with enough RAM), Devstral Small 2 (dense 24B, fits on a single GPU), or GPT-OSS-20B (fits on 16GB, solid reasoning). Everything above that requires datacenter hardware or multi-GPU setups.
- Effective Context: What developers actually get in practice, based on community reports and testing, not the marketed number.
- Context Strategy: How the tool decides what goes into the model's context window.
- At Limit: What happens when the context window fills up.
- Cost: Individual pricing as of early 2026. Team/enterprise tiers vary.
Context management is the new skill nobody taught you. The quality of code these tools produce isn't just a function of which model they're running. It's a function of what they're putting in the context window and what they're leaving out.
The tools that win aren't necessarily the ones with the biggest windows. They're the ones that are smartest about what they put in the window. Claude Code's agentic search, Cursor's RAG-based selection, Copilot CLI's checkpoint system. These are all different answers to the same fundamental question: given finite working memory, what should the model be paying attention to right now?
Understanding how your tools answer that question, not just how many tokens they advertise, is what separates a developer who's frustrated by AI "forgetting things" from one who knows exactly why it happened and how to prevent it.
Have a correction or something to add? This space is evolving fast. Context management strategies that are current today may be outdated in six months. Feedback welcome.
Footnotes
-
Cursor vs Claude Code vs Windsurf: Which One Handles Context Loss the Worst? — Real-world testing of compaction, tab-based context loss, and silent truncation across tools. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Optimizing the Ollama Context Window — Ollama's 4,096-token default and how to increase it via systemd, modelfiles, and CLI. ↩ ↩2 ↩3 ↩4
-
Cursor vs Claude Code vs Windsurf: The Honest Comparison — Effective context numbers from hands-on testing: 60-80k Cursor, 50-70k Windsurf, 150k+ Claude Code. Also covers the 23-file authentication migration. ↩ ↩2 ↩3
-
LLM Context Windows: What They Are & How They Work — Redis blog covering tokenization, transformer architecture, quadratic attention scaling, and KV cache mechanics. ↩ ↩2
-
What Is a Context Window? — IBM's overview of how context windows work, including shared token budgets across prompts, history, and responses. ↩
-
How I Use Every Claude Code Feature — Shrivu Shankar's walkthrough covering
/context, ~20k baseline in monorepos, subagents, and CLAUDE.md strategies. ↩ ↩2 ↩3 ↩4 -
The Llama 4 Herd — Meta's announcement: Scout (109B/17B active, 10M context, 16 experts), Maverick (400B/17B active, 1M context, 128 experts), MoE architecture, 30T+ training tokens. ↩ ↩2 ↩3 ↩4
-
Claude Code vs Cursor: What to Choose in 2026 — Builder.io's comparison covering Claude's 200k reliable delivery and 1M beta on Opus 4.6. ↩ ↩2 ↩3 ↩4
-
Why can't we fully utilize context_window? — GitHub community discussion on Copilot API context windows, including GPT-5.3 Codex's 400k window and the gap between advertised and usable limits. ↩ ↩2 ↩3 ↩4 ↩5
-
Google Gemini Context Window: Token Limits and Workflow Strategies — Gemini 3 Pro's 1M default, Gemini 3 Flash's 200k, and consumer-facing limits. ↩ ↩2 ↩3 ↩4
-
Qwen3-Coder-Next on Hugging Face — Official model card: 80B total / 3B active MoE, 256k native context (extendable to 1M via YaRN), Apache 2.0 license, agentic coding focus. ↩ ↩2 ↩3 ↩4
-
DeepSeek-V3.2 Technical Report — Architecture details: 685B total / 37B active MoE with DeepSeek Sparse Attention, 128k context, MIT license. ↩ ↩2
-
Mistral Large 3 on Hugging Face — Official model card: 675B total / 41B active MoE, 256k context, multimodal (2.5B vision encoder), Apache 2.0 license. ↩ ↩2
-
Introducing GPT-OSS — OpenAI's open-weight models: gpt-oss-120b (117B total / 5.1B active) and gpt-oss-20b (21B total / 3.6B active), 128k context, MoE with alternating sparse attention, Apache 2.0. ↩ ↩2 ↩3
-
Best LLMs for Extended Context Windows in 2026 — AIMultiple's comparison of context window sizes across major models including Llama 3.1 at 128k. ↩ ↩2 ↩3
-
Context Window: What It Is and Why It Matters for AI Agents — Comet's deep dive on the "lost in the middle" problem and why models miss information at mid-context positions. ↩
-
Running a Local LLM for Code Assistance — Benchmarks comparing local model inference (110-598s) against cloud tools (52-56s). ↩
-
Testing AI Coding Agents: Cursor vs Claude, OpenAI, and Gemini — Render blog noting Cursor's RAG-like system on the local filesystem for gathering codebase context. ↩ ↩2
-
Claude Code vs Cursor: Deep Comparison for Dev Teams — Qodo's breakdown of Cursor's 20k default chat sessions, 10k inline commands, Max Mode tradeoffs, silent truncation, and Claude Code's reliable 200k. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Claude Code vs Cursor: The Honest Comparison — Developer reports of Cursor's effective context falling to 70k-120k despite 200k advertising. ↩ ↩2 ↩3
-
Claude Code vs Cursor: A Power-User's Playbook — Arize AI's comparison covering Claude Code's agentic context assembly vs Cursor's embeddings index. ↩ ↩2
-
Cursor vs Claude Code — Data Science Collective piece on CLAUDE.md as institutional memory, project conventions, and persistent context. ↩ ↩2
-
Cursor vs Claude Code 2026: Which AI Coding Tool Wins? — AI Tool VS comparison citing the 1M token beta on Opus 4.6 and 76% on the MRCR v2 benchmark. ↩ ↩2
-
Gemini CLI Tutorial: Understanding Context, Memory and Conversational Branching — Romin Irani's guide to Gemini CLI's 1M token window,
/compress,/clear, and session management. ↩ ↩2 -
GitHub Copilot CLI: Enhanced Agents, Context Management, and New Ways to Install — Official GitHub changelog covering auto-compaction at 95%,
/compact,/context, and checkpoint files. ↩ ↩2 -
Context Awareness for Windsurf — Official Windsurf docs on the RAG-based indexing engine, M-Query retrieval techniques, and context scoping. ↩
-
Qwen3.5-397B-A17B on Hugging Face — Official model card: 397B total / 17B active MoE, 262k default context, multimodal reasoning, native tool use. ↩
-
Kimi K2.5 vs GLM-4.7 Comparison — Technical comparison: Kimi K2.5 (~1T params, 32B active, 256k context), GLM-4.7 (~355B params, 32B active, 200k context), architecture and benchmark details. ↩ ↩2
-
Introducing Devstral 2 and Mistral Vibe CLI — Devstral 2 (123B dense, 256k context, SWE-bench 72.2%) and Devstral Small 2 (24B dense, 256k context, SWE-bench 68.0%, Apache 2.0). ↩ ↩2