Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save MichaelDimmitt/319c8176034c999907b0c957cf71159a to your computer and use it in GitHub Desktop.

Select an option

Save MichaelDimmitt/319c8176034c999907b0c957cf71159a to your computer and use it in GitHub Desktop.
draft: how context actually works

How Context Actually Works:

Model → Inference Server → Client
(Local vs SaaS)

Most developers use AI coding tools every day without understanding what's happening to their context under the hood. Here's what's actually going on and why the same prompt behaves differently depending on where you run it.

Disclaimer: Written with AI — Claude Opus 4.6 via claude.ai alongside human feedback.


Contents

  1. The Gap Nobody Has Filled
  2. What Is Context, Really?
  3. Layer 0: The Model Itself
  4. Claude Web: You're a Thin Client
  5. Local: You Own the Whole Stack
  6. Cursor: An IDE, Not an Inference Engine
  7. Claude Code: Agentic Context Management
  8. What Happens at the Limit: A Direct Comparison
  9. Practical Takeaways
  10. The Full Landscape: Every Tool Compared
  11. Conclusion
  12. Sources

The Gap Nobody Has Filled

You're probably using at least one AI coding tool daily. Maybe Cursor. Maybe Claude Code. Maybe you've got a local model running through Ollama on the side. All of these tools talk about "context windows," and most developers have a rough sense that bigger is better.

But context doesn't work the same way across any of these tools. The advertised token count on a marketing page tells you almost nothing about how much of your codebase the tool can actually reason about at any given moment. Three layers sit between your prompt and the model's response: the model itself, the inference server, and the client. Most developers only interact with the client and never think about the other two.

Model → Inference Server → Client
Local Model (e.g. Llama 3.1) Ollama / llama.cpp / vLLM Cursor / terminal
SaaS Model (e.g. Opus 4.6) vLLM / SGLang / TensorRT-LLM / custom claude.ai / Cursor

A note on terminology: the middle layer is often called the "inference server," but it actually contains two parts. The inference engine (vLLM, llama.cpp, TensorRT-LLM, SGLang) is the core software that loads model weights into GPU memory, computes attention, and manages the KV (key-value) cache, which stores previously computed token data so the model doesn't have to reprocess the entire context from scratch for each new token. The inference server is the API wrapper that exposes the engine as an HTTP endpoint so clients can talk to it. Sometimes these are the same tool (vLLM includes its own OpenAI-compatible server), sometimes they're separate (TensorRT-LLM is often deployed behind NVIDIA's Triton Inference Server). Ollama is a good example: it's the server you interact with at localhost:11434, but under the hood it uses llama.cpp as its inference engine. For most of this post, the distinction doesn't matter. What matters is that this layer is where your context settings live, whether you're running locally or hitting a cloud API.

This post breaks down that chain. By the end, you'll understand why Cursor can silently forget code it wrote five minutes ago 1, why your local Ollama model might be running at 2% of its capacity 2, and why Claude Code can refactor 23 files in one shot 3 while other tools choke at 10.


What Is Context, Really?

The context window is the model's working memory. It's the total amount of text, measured in tokens, that the model can "see" at once when generating a response.

A token is roughly 4 characters or three-quarters of a word 4. A 200,000-token context window translates to approximately 150,000 words, or about 500 pages of text.

Here's the part most developers miss: the context window is shared. The model doesn't get 200k tokens for your message. It gets 200k tokens total, and that budget is split across everything 5:

  • The system prompt (instructions telling the model how to behave)
  • Tool definitions (function schemas for code execution, file reading, web search, etc.)
  • Conversation history (every message you and the model have exchanged)
  • File contents (any code or documents pulled into the session)
  • The model's own responses

By the time you type your first message in a fresh session, a significant chunk of that window is already consumed. In a tool like Claude Code working in a monorepo, the baseline overhead can be around 20,000 tokens. That's 10% of the window gone before you've asked a single question 6.


Layer 0: The Model Itself

Every model has a trained context ceiling. This is a hard architectural limit baked in during training:

  • Llama 4 Scout: 10M tokens (practical: 128-256k) 7
  • Llama 4 Maverick: 1M tokens 7
  • Claude Opus 4.6: 200k tokens (1M in beta) 8
  • Claude Sonnet 4.6: 200k tokens (1M available) 8
  • GPT-5.3 Codex: up to 400k tokens 9
  • Gemini 3 Pro: 1M tokens 10
  • Gemini 3 Flash: 200k tokens 10
  • Qwen3-Coder-Next: 256k tokens 11
  • DeepSeek V3.2: 128k tokens 12
  • Mistral Large 3: 256k tokens 13
  • GPT-OSS-120B: 128k tokens 14
  • Llama 3.1: 128k tokens 15

You cannot exceed this ceiling without architectural tricks like RoPE scaling or YaRN, and even then, quality degrades. Bigger windows also aren't free: the transformer attention mechanism scales quadratically with sequence length, which means doubling the context roughly quadruples the compute cost 4.

There's also the "lost in the middle" problem. Research has consistently shown that models are better at recalling information from the beginning and end of their context window 16. Details buried in the middle are more likely to be missed. A 1M token window doesn't work like 1M tokens of perfect memory.


Claude Web: You're a Thin Client

When you use claude.ai, your computer does almost nothing. It sends your text over HTTPS to Anthropic's servers, receives a stream of tokens back, and renders them in the browser. That's it.

Models like Claude Opus 4.6 have hundreds of billions of parameters and require specialized GPU clusters to run inference. Your browser tab is almost entirely JavaScript application code, React components, and DOM overhead. The raw text of even a very long conversation is a few hundred KB at most.

Extended thinking? Also entirely server-side. When Claude "thinks" for 30 seconds before responding, your machine is just waiting for the stream to start. No extra CPU, no extra memory.

What fills the window

In Claude web, the context is assembled from your conversation history: system prompt, every message you've sent, every response Claude has generated, and any files you've uploaded. There's no codebase indexing, no file-tree awareness, no agentic search. You get exactly what you put in.

When you hit the limit

As conversations get long, older messages are truncated to make room. You'll notice this when Claude "forgets" something you discussed an hour ago. There's no compaction or summarization. The early messages simply fall out of the window.

For most chat-based usage, this is fine. For multi-session coding projects, it means you need to re-establish context at the start of each conversation. Summarizing key decisions in your opening prompt becomes a practical necessity.


Local: You Own the Whole Stack

Running a model locally through Ollama, llama.cpp, vLLM, or LM Studio means you control every layer of the stack. That's powerful. It's also where the most common context misconfigurations happen.

Three things determine your local context

1. The model's trained ceiling. A Llama 3.1 model supports up to 128k tokens 15. Qwen3-Coder supports up to 256k 11. This is the theoretical maximum.

2. Your hardware. Every token in the context window requires memory for the KV cache. This is separate from the memory needed to hold the model weights themselves. A 7B parameter model at Q4 quantization might need 4-5GB for the weights, but a 32k context window on top of that can add another 4-8GB depending on the model architecture. If you're running on a GPU with 24GB of VRAM, the model weights plus KV cache for a large context might not fit, forcing you to either reduce context or offload to slower system RAM.

3. Your inference framework settings. This is the most commonly missed piece. Ollama defaults to a context window of 4,096 tokens, regardless of what the model actually supports 2. If you downloaded a model capable of 128k tokens and never changed the OLLAMA_CONTEXT_LENGTH environment variable, you've been running at roughly 3% of its capacity.

In llama.cpp, the equivalent setting is n_ctx. In vLLM, it's max_model_len. Each framework has its own default, and none of them automatically use the model's full capacity because doing so would consume too much memory on most consumer hardware.

The tradeoff

Local inference gives you complete privacy, zero per-token cost, and total control. The tradeoff is significantly smaller practical context windows, slower inference (often 5-50x slower than cloud APIs 17), and models that are less capable than the frontier cloud options.

When you hit the limit

Hard cutoff. The model stops seeing earlier tokens with no graceful degradation. There's no summarization, no compaction, no warning. Your prompt simply gets truncated from the beginning, and the model responds based on whatever remains in the window.


Cursor: An IDE, Not an Inference Engine

This is the most misunderstood piece of the stack, and it came up in a conversation I had recently that crystallized the issue.

Cursor is not running your model. Cursor is a VS Code fork, a code editor with AI features layered on top. When you use Claude, GPT, or Gemini inside Cursor, the model is running on Anthropic's, OpenAI's, or Google's servers. Cursor is deciding what to send to those servers.

This distinction matters enormously because Cursor makes aggressive decisions about context that most developers never see.

How Cursor manages context

Cursor uses a RAG-like system with embeddings to index your codebase and select which files are relevant to your current prompt 18. It doesn't send your entire project to the model. It selects what it thinks matters, assembles a prompt, and sends that.

The default context budget for a chat session is approximately 20,000 tokens. Inline commands (Cmd-K) get around 10,000 tokens 19. That's a fraction of the underlying model's capacity.

"Max Mode" extends the window to the model's full capacity: up to 200k tokens for Claude, potentially 1M for Gemini. But even in Max Mode, developers consistently report that the effective usable context falls between 70k and 120k tokens 20. Cursor applies internal truncation and performance safeguards that silently reduce what actually reaches the model 19.

The silent truncation problem

This is Cursor's most consequential behavior. When context gets too large, Cursor doesn't error out. It doesn't tell you it's dropping files. It silently deprioritizes and removes older or less relevant content to maintain responsiveness and manage API costs 19.

The practical result: you ask Cursor to modify a component, it does so, you switch tabs to work on something else, come back, and Cursor has no memory of the component structure you just designed together. Each tab maintains its own context, and switching between them can mean starting from scratch 1.

Using local models through Cursor

If you want to run a local model inside Cursor, the architecture looks like this:

You → Cursor (IDE) → Local inference server (Ollama/llama.cpp) → Your model

Cursor points at a local OpenAI-compatible API endpoint instead of a cloud API. But the context settings (n_ctx, VRAM allocation, all of it) live in the inference server, not in Cursor. Cursor is still just the client, deciding what to send. The inference server determines how much it can receive.


Claude Code: Agentic Context Management

Claude Code is a CLI-first tool that runs in your terminal. It takes a fundamentally different approach to context than any IDE-based tool.

No pre-built index

Unlike Cursor, which maintains a static embeddings index of your codebase, Claude Code doesn't pre-index anything. Instead, it uses agentic search: when you give it a task, it actively explores your repository at runtime 21. It reads files, follows import chains, greps for references, runs tests, and builds its understanding dynamically.

This means Claude Code's context is always fresh. It's never working from a stale index. But it also means each task starts with an exploration phase that consumes tokens as the agent reads files to understand your project.

CLAUDE.md: Persistent project memory

One of Claude Code's most distinctive features is CLAUDE.md, a markdown file that lives in your repo root and acts as institutional memory 22. It stores project conventions, architecture decisions, directory structure, key patterns, and anything else the agent should know at the start of every session.

This is loaded into context automatically when Claude Code starts. It's the one piece of context that persists across sessions. Everything else (conversation history, file reads, tool results) is ephemeral.

The full 200k, reliably

Claude Code consistently delivers the advertised 200k token context window 19. With Claude Opus 4.6, there's a 1M token beta that scores 76% on the MRCR v2 long-context retrieval benchmark 23. For large codebases, this changes what's possible in a single session.

You can check your current consumption with the /context command, which shows how much of the 200k window you've used and what's filling it 6. A fresh session in a monorepo might start with a ~20k token baseline, leaving 180k for your actual work, which can still fill up fast during complex multi-file refactors.

Compaction: The double-edged sword

When conversations get long, Claude Code uses compaction, automatic server-side summarization of earlier conversation turns 1. This is what enables effectively unlimited session length, but it comes with a cost.

Compacted context is a summary, not the original. Details get lost. The most common symptom: Claude Code starts making suggestions that contradict decisions you made an hour ago. It might recommend adding Redux for state management when you explicitly told it you're using Zustand, because that decision was in a part of the conversation that got compacted away 1.

Subagents: Delegated context

Claude Code can spin up child agents (subagents), each with their own context window 6. This is a context management strategy: the main agent stays clean and focused while subagents handle specific tasks like running tests, reviewing security, and scanning for patterns. The results are returned as summaries, keeping the main agent's window from filling with raw output.


What Happens at the Limit: A Direct Comparison

Each tool handles context exhaustion differently, and the differences have real consequences:

Claude Web (claude.ai + Anthropic's inference server) — Older messages silently drop out. The model keeps responding but loses awareness of early conversation context. No warning, no summarization.

Local (e.g. Ollama + Cursor) — Behavior depends on the inference server. Ollama truncates older conversation turns to fit the window. llama.cpp's server may error out if you exceed n_ctx. The client has no say in this — it's the inference server deciding what gets cut.

Cursor (Cursor IDE + cloud inference server) — Silent truncation. Files get deprioritized, older context gets dropped, and you receive no notification 19. The model keeps generating, but based on incomplete information. This is the sneakiest failure mode because the output looks confident.

Claude Code (CLI + Anthropic's inference server) — Compaction kicks in. Earlier turns are summarized to free space 1. The session continues, but the summary may lose important details. You can at least monitor this with /context.

Gemini CLI (CLI + Google's inference server) — Similar to Claude Code. Uses a 1M token window with /compress for manual summarization and /clear for hard resets 24. Warns when approaching the limit.

GitHub Copilot CLI (CLI + Microsoft/GitHub's inference server) — Auto-compaction at 95% of the token limit 25. Creates checkpoint files so you can rewind to pre-compaction state. Uses /compact for manual compression and /context for monitoring.


Practical Takeaways

If you're using Cursor: Accept that you're working with 60-120k effective tokens, not 200k 20. Turn on Max Mode for large refactors. Be aware that tab switching resets your context 1. Keep critical decisions in .cursorrules or AGENTS.md files that get loaded automatically.

If you're using Claude Code: Use /context regularly to monitor consumption 6. Start fresh sessions for distinct tasks rather than letting compaction degrade your context. Invest time in your CLAUDE.md. It's the one thing that persists and directly improves every session 22.

If you're running local models: Check your actual context setting right now. If you're on Ollama and never changed it 2, run:

# Check current context
ollama ps
# Set a larger default
# In /etc/systemd/system/ollama.service.d/override.conf:
# Environment="OLLAMA_CONTEXT_LENGTH=32768"

Even going from 4,096 to 16,384 or 32,768 tokens will dramatically improve your experience.

If you're using Claude web: Recognize that very long conversations will lose early context. For multi-session projects, start each new conversation with a summary of prior decisions.

In general: Context is a shared, finite resource. System prompts, tool definitions, and file contents eat into it before you type a word. The skill isn't finding the tool with the biggest window. It's being intentional about what goes into the window you have.


The Full Landscape: Every Tool Compared

Web Interfaces

Tool Provider Max Context (Advertised) Effective Context Context Strategy At Limit Cost
claude.ai Anthropic 200k tokens ~200k Conversation history only Older messages drop Free / $20 Pro / $100-200 Max
ChatGPT OpenAI 128k tokens (GPT-5.3) ~128k Conversation + uploaded files Older messages drop Free / $20 Plus / $200 Pro
Gemini Google 1M tokens (Gemini 3 Pro) 10 ~1M (consumer often 128-200k) Conversation + uploaded files Performance degrades Free / $20 AI Premium / $250 Ultra
Grok xAI 128k tokens ~128k Conversation history Older messages drop Free / Premium subscription

AI-Powered IDEs

Tool Type Max Context (Advertised) Effective Context Context Strategy At Limit Models Cost
Cursor VS Code fork 200k (Max Mode) 60-120k 20 RAG + embeddings + silent truncation 18 Silent file dropping 19 Claude, GPT, Gemini, Composer Free / $20 Pro / $40 Ultra
Windsurf Standalone IDE Model-dependent 50-100k 3 RAG (M-Query) + Cascade indexing 26 Context drops without warning Claude, GPT, Gemini, SWE-1 Free / $15 Pro / $60 Teams
GitHub Copilot (IDE) VS Code / JetBrains extension 64-128k (model-dependent) 9 ~64k typical Codebase search + file context Truncation, yellow warnings GPT-5.x, Claude, Gemini Free / $10 Pro / $39 Enterprise
Zed Standalone editor (Rust) Model-dependent Varies Direct model integration Model-dependent Claude, GPT, Gemini, local Free (open source) + API costs
Cline VS Code extension Model-dependent Varies Full file reading, agentic Model-dependent Any (bring your own API key) Free (open source) + API costs

CLI / Terminal Agents

Tool Provider Max Context (Advertised) Effective Context Context Strategy At Limit Models Cost
Claude Code Anthropic 200k (1M beta on Opus 4.6) 23 ~150-200k 3 Agentic search + CLAUDE.md 21 Compaction (summarization) 1 Claude only $20 Pro / $100-200 Max
GitHub Copilot CLI GitHub/ Microsoft Model-dependent (64-128k) ~64-128k Auto-compaction at 95%, checkpoints 25 Compaction + checkpoint rewind GPT-5.x, Claude, Gemini $10 Pro / $39 Enterprise
Gemini CLI Google 1M tokens 24 ~1M Full codebase awareness + /compress Warning + manual /compress Gemini 2.5/3 Pro Free (with Google account)
Codex CLI OpenAI Up to 400k (GPT-5.3 Codex) 9 ~200-400k Agentic, file reading Session-based GPT-5.x Codex variants Included with ChatGPT Plus/Pro
OpenCode Anomaly Innovations Model-dependent Model-dependent Depends on provider Provider-dependent 75+ providers incl. local Free (open source) + API costs
Aider Open source Model-dependent Varies Git-aware, repo map Model-dependent Any (bring your own API key) Free (open source) + API costs

API Usage (Model + Inference Server)

These are the models running on provider infrastructure when you use cloud APIs, IDEs like Cursor, or CLI tools like Claude Code. You don't control the inference server. Context window and max output are set by the provider.

Model Provider Context Window Max Output Coding Strength Notes
Claude Opus 4.6 Anthropic 200k (1M beta) 8 128k SWE-bench 80.8%, Terminal-Bench SOTA Adaptive thinking, fastest frontier reasoning
Claude Sonnet 4.6 Anthropic 200k (1M beta) 8 64k SWE-bench 79.6%, near-Opus performance Best value frontier model, default in claude.ai
Claude Haiku 4.5 Anthropic 200k 8k Routing, classification, simple extraction Lightweight, fast, good for task routing
GPT-5.2 OpenAI 400k 9 128k LiveCodeBench 89%, strong multi-file reasoning Thinking/Instant/Pro variants
GPT-5.3 Codex OpenAI 400k 9 128k Optimized for Codex CLI workflows Coding-tuned variant of GPT-5.x
Gemini 3 Pro Google 1M 10 Varies Strong multimodal, competitive SWE-bench Best multimodal reasoning, largest standard window
Gemini 3 Flash Google 200k Varies Pro-grade reasoning at Flash speed 3x faster than 2.5 Pro
Grok 4 xAI 128k Varies General purpose reasoning Available via Grok subscription

Max output matters for context because output tokens consume the shared window. A model with 400k context but 128k max output (GPT-5.2) can generate much longer responses per turn than one with 200k context and 8k max output (Haiku 4.5). For multi-file refactors, that ceiling determines how much code the model can write in a single pass.

Local Inference (Self-Hosted)

Tool Type Default Context Max Context Context Setting Notes Cost
Ollama Model runner + API 4,096 tokens 2 Model-dependent (up to 256k+) OLLAMA_CONTEXT_LENGTH env var Most common misconfiguration; default is far below model capacity Free
llama.cpp Inference engine Varies Model-dependent --ctx-size / -c flag Most flexible, bare-metal control Free
LM Studio Desktop GUI Auto-detected Model-dependent GUI slider for context length Most user-friendly local option Free
vLLM Production server Model-dependent Model-dependent --max-model-len Optimized for throughput, paged attention Free
text-generation-webui Browser-based GUI Varies Model-dependent UI parameter Multiple backend support (GGUF, GPTQ, AWQ) Free

Local / Open-Weight Models

These are the models you'd run through the inference tools listed in the table above. Context window is the trained ceiling. Actual usable context depends on your hardware (VRAM/RAM for KV cache) and inference framework settings.

Model Provider Total Params Active Params Arch Context Window Min RAM/VRAM (Quantized) Coding Strength License
Qwen3-Coder-Next Alibaba 80B 3B MoE 256k 11 ~46GB (Q4) Strong agentic coding, tool use Apache 2.0
Qwen3-Coder 480B Alibaba 480B 35B MoE 256k 11 ~250GB Frontier open-source coding Apache 2.0
Qwen3.5-397B Alibaba 397B 17B MoE 262k 27 Multi-GPU required Multimodal reasoning + coding Custom (Qwen)
DeepSeek V3.2 DeepSeek 685B 37B MoE 128k 12 Multi-GPU (4-5x H100) SWE-bench 73.1%, strong refactoring MIT
Llama 4 Scout Meta 109B 17B MoE (16 experts) 10M (practical: 128-256k) 7 ~24GB (Q4, single H100) General purpose, multimodal Llama 4 Community
Llama 4 Maverick Meta 400B 17B MoE (128 experts) 1M 7 H100 DGX system General purpose, multimodal Llama 4 Community
Kimi K2.5 Moonshot AI ~1T 32B MoE 256k 28 Multi-GPU, high-end SWE-bench 76.8%, research-grade reasoning Modified MIT
GLM-4.7 Zhipu AI ~355B 32B MoE 200k 28 Multi-GPU required SWE-bench 73.8%, structured reasoning Zhipu License
Mistral Large 3 Mistral AI 675B 41B MoE 256k 13 8x H200 (FP8) General purpose, multimodal, agentic Apache 2.0
Devstral 2 Mistral AI 123B 123B (dense) Dense 256k 29 4x H100 minimum SWE-bench 72.2%, code-agent focused Modified MIT
Devstral Small 2 Mistral AI 24B 24B (dense) Dense 256k 29 Single GPU / consumer hardware SWE-bench 68.0%, local-first coding Apache 2.0
GPT-OSS-120B OpenAI 117B 5.1B MoE 128k 14 ~80GB (single H100) Reasoning, agentic tasks, tool use Apache 2.0
GPT-OSS-20B OpenAI 21B 3.6B MoE 128k 14 ~16GB (MXFP4) Reasoning, local/edge deployment Apache 2.0
Llama 3.1 Meta 405B / 70B / 8B Dense Dense 128k 15 Varies (8B: ~6GB, 70B: ~48GB) General purpose, widely supported Llama 3.1 Community

A few things to note. Most of the frontier open-weight models are MoE (Mixture-of-Experts), which means the total parameter count is much larger than what's active per token. This is how Qwen3-Coder-Next fits 80B parameters into ~46GB of RAM: only 3B parameters fire for each token. The tradeoff is that you still need to load all the weights into memory, even though only a fraction are used at inference time.

The context windows listed above are architectural ceilings. Your actual usable context is determined by the inference framework (see the previous table) and your available VRAM for the KV cache. A model that supports 256k tokens doesn't give you 256k tokens on a 24GB GPU.

For coding-specific local use, the practical sweet spot in early 2026 is Qwen3-Coder-Next (strong agentic coding, runs on consumer hardware with enough RAM), Devstral Small 2 (dense 24B, fits on a single GPU), or GPT-OSS-20B (fits on 16GB, solid reasoning). Everything above that requires datacenter hardware or multi-GPU setups.

Key for the Charts

  • Effective Context: What developers actually get in practice, based on community reports and testing, not the marketed number.
  • Context Strategy: How the tool decides what goes into the model's context window.
  • At Limit: What happens when the context window fills up.
  • Cost: Individual pricing as of early 2026. Team/enterprise tiers vary.

Conclusion

Context management is the new skill nobody taught you. The quality of code these tools produce isn't just a function of which model they're running. It's a function of what they're putting in the context window and what they're leaving out.

The tools that win aren't necessarily the ones with the biggest windows. They're the ones that are smartest about what they put in the window. Claude Code's agentic search, Cursor's RAG-based selection, Copilot CLI's checkpoint system. These are all different answers to the same fundamental question: given finite working memory, what should the model be paying attention to right now?

Understanding how your tools answer that question, not just how many tokens they advertise, is what separates a developer who's frustrated by AI "forgetting things" from one who knows exactly why it happened and how to prevent it.


Have a correction or something to add? This space is evolving fast. Context management strategies that are current today may be outdated in six months. Feedback welcome.


Sources

Footnotes

  1. Cursor vs Claude Code vs Windsurf: Which One Handles Context Loss the Worst? — Real-world testing of compaction, tab-based context loss, and silent truncation across tools. 2 3 4 5 6 7

  2. Optimizing the Ollama Context Window — Ollama's 4,096-token default and how to increase it via systemd, modelfiles, and CLI. 2 3 4

  3. Cursor vs Claude Code vs Windsurf: The Honest Comparison — Effective context numbers from hands-on testing: 60-80k Cursor, 50-70k Windsurf, 150k+ Claude Code. Also covers the 23-file authentication migration. 2 3

  4. LLM Context Windows: What They Are & How They Work — Redis blog covering tokenization, transformer architecture, quadratic attention scaling, and KV cache mechanics. 2

  5. What Is a Context Window? — IBM's overview of how context windows work, including shared token budgets across prompts, history, and responses.

  6. How I Use Every Claude Code Feature — Shrivu Shankar's walkthrough covering /context, ~20k baseline in monorepos, subagents, and CLAUDE.md strategies. 2 3 4

  7. The Llama 4 Herd — Meta's announcement: Scout (109B/17B active, 10M context, 16 experts), Maverick (400B/17B active, 1M context, 128 experts), MoE architecture, 30T+ training tokens. 2 3 4

  8. Claude Code vs Cursor: What to Choose in 2026 — Builder.io's comparison covering Claude's 200k reliable delivery and 1M beta on Opus 4.6. 2 3 4

  9. Why can't we fully utilize context_window? — GitHub community discussion on Copilot API context windows, including GPT-5.3 Codex's 400k window and the gap between advertised and usable limits. 2 3 4 5

  10. Google Gemini Context Window: Token Limits and Workflow Strategies — Gemini 3 Pro's 1M default, Gemini 3 Flash's 200k, and consumer-facing limits. 2 3 4

  11. Qwen3-Coder-Next on Hugging Face — Official model card: 80B total / 3B active MoE, 256k native context (extendable to 1M via YaRN), Apache 2.0 license, agentic coding focus. 2 3 4

  12. DeepSeek-V3.2 Technical Report — Architecture details: 685B total / 37B active MoE with DeepSeek Sparse Attention, 128k context, MIT license. 2

  13. Mistral Large 3 on Hugging Face — Official model card: 675B total / 41B active MoE, 256k context, multimodal (2.5B vision encoder), Apache 2.0 license. 2

  14. Introducing GPT-OSS — OpenAI's open-weight models: gpt-oss-120b (117B total / 5.1B active) and gpt-oss-20b (21B total / 3.6B active), 128k context, MoE with alternating sparse attention, Apache 2.0. 2 3

  15. Best LLMs for Extended Context Windows in 2026 — AIMultiple's comparison of context window sizes across major models including Llama 3.1 at 128k. 2 3

  16. Context Window: What It Is and Why It Matters for AI Agents — Comet's deep dive on the "lost in the middle" problem and why models miss information at mid-context positions.

  17. Running a Local LLM for Code Assistance — Benchmarks comparing local model inference (110-598s) against cloud tools (52-56s).

  18. Testing AI Coding Agents: Cursor vs Claude, OpenAI, and Gemini — Render blog noting Cursor's RAG-like system on the local filesystem for gathering codebase context. 2

  19. Claude Code vs Cursor: Deep Comparison for Dev Teams — Qodo's breakdown of Cursor's 20k default chat sessions, 10k inline commands, Max Mode tradeoffs, silent truncation, and Claude Code's reliable 200k. 2 3 4 5 6

  20. Claude Code vs Cursor: The Honest Comparison — Developer reports of Cursor's effective context falling to 70k-120k despite 200k advertising. 2 3

  21. Claude Code vs Cursor: A Power-User's Playbook — Arize AI's comparison covering Claude Code's agentic context assembly vs Cursor's embeddings index. 2

  22. Cursor vs Claude Code — Data Science Collective piece on CLAUDE.md as institutional memory, project conventions, and persistent context. 2

  23. Cursor vs Claude Code 2026: Which AI Coding Tool Wins? — AI Tool VS comparison citing the 1M token beta on Opus 4.6 and 76% on the MRCR v2 benchmark. 2

  24. Gemini CLI Tutorial: Understanding Context, Memory and Conversational Branching — Romin Irani's guide to Gemini CLI's 1M token window, /compress, /clear, and session management. 2

  25. GitHub Copilot CLI: Enhanced Agents, Context Management, and New Ways to Install — Official GitHub changelog covering auto-compaction at 95%, /compact, /context, and checkpoint files. 2

  26. Context Awareness for Windsurf — Official Windsurf docs on the RAG-based indexing engine, M-Query retrieval techniques, and context scoping.

  27. Qwen3.5-397B-A17B on Hugging Face — Official model card: 397B total / 17B active MoE, 262k default context, multimodal reasoning, native tool use.

  28. Kimi K2.5 vs GLM-4.7 Comparison — Technical comparison: Kimi K2.5 (~1T params, 32B active, 256k context), GLM-4.7 (~355B params, 32B active, 200k context), architecture and benchmark details. 2

  29. Introducing Devstral 2 and Mistral Vibe CLI — Devstral 2 (123B dense, 256k context, SWE-bench 72.2%) and Devstral Small 2 (24B dense, 256k context, SWE-bench 68.0%, Apache 2.0). 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment