Reverse-engineered from
github.copilot-chat-0.37.9extension bundle. Source:~/.vscode/extensions/github.copilot-chat-0.37.9/dist/extension.js
- Architecture Overview
- Prompt Tree & JSX-Like Element System
- Flexbox-Style Token Budget Allocation
- Priority-Based Pruning
- Tree-Sitter Document Summarization
- Conversation History Summarization
- Tool Result Truncation
- Prompt Caching (Cache Breakpoints)
- Workspace Search (TF-IDF & Embeddings)
- Token Counting & Estimation
- Inline Completion Context Budgets
Copilot Chat models the entire LLM prompt as a tree of typed nodes (JSX-like elements). Each node declares layout properties (flexGrow, flexBasis, flexReserve, priority) that control two distinct phases:
- Allocation phase — distributes the model's token budget across prompt sections using a CSS-flexbox-inspired algorithm.
- Pruning phase — if the rendered prompt exceeds the budget, iteratively removes the lowest-priority leaf nodes until it fits.
| Component | Role |
|---|---|
| Prompt Tree | JSX-like tree of PromptElement nodes |
| Flex Allocator | Distributes token budget proportionally across sibling groups |
| Priority Pruner | Removes lowest-priority nodes when over budget |
Token Budget Tracker (N8) |
Per-element budget accounting |
Document Summarizer (QC) |
Tree-sitter-based code summarization |
History Summarizer (KAe) |
LLM-based conversation compaction |
| Cache Breakpoints | Marks positions for API-level prefix caching |
| TF-IDF / Embeddings Search | Budget-aware workspace context retrieval |
| Tiktoken Tokenizer | Precise BPE token counting in a worker thread |
JSX Element Tree
│
▼
┌─────────────────────────────┐
│ 1. Flex Budget Allocation │ Groups sorted by flexGrow (desc).
│ (_processPromptPieces) │ Each group gets proportional budget.
│ │ flexReserve holds tokens for later groups.
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ 2. Element Rendering │ Each element's prepare() + render()
│ (prepare → render) │ called with its allocated N8 budget.
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ 3. Materialization │ Render tree nodes → runtime nodes:
│ (materialize) │ DD (containers), Wx (messages),
│ │ Sde (text chunks), D8 (images),
│ │ Hx (cache breakpoints)
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ 4. Growth Phase │ If under budget, re-render Expandable
│ (_grow) │ elements with the remaining slack.
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ 5. Pruning Loop │ While tokenCount > limit:
│ (_getFinalElementTree) │ removeLowestPriorityChild()
│ │ Uses 1.25x heuristic to reduce recounts.
└─────────────┬───────────────┘
│
▼
Final Chat Messages
The materialized tree consists of these node types:
| Class | Type | Description |
|---|---|---|
DD (GenericMaterializedContainer) |
Container | Groups children; has priority, flags, metadata |
Wx (MaterializedChatMessage) |
Message | A chat message (system/user/assistant/tool) with role, content children, optional tool calls |
Sde (MaterializedChatMessageTextChunk) |
Leaf | A text segment within a message |
D8 (MaterializedChatMessageImage) |
Leaf | An image content part |
rL (MaterializedChatMessageOpaque) |
Leaf | Opaque content with pre-computed token usage |
Hx (MaterializedChatMessageBreakpoint) |
Leaf | Cache breakpoint marker (protected from pruning) |
| Flag | Value | Meaning |
|---|---|---|
LegacyPrioritization |
1 | Use flat leaf-walk pruning instead of hierarchical |
Chunk |
2 | Treat entire subtree as atomic unit for pruning |
passPriority |
4 | Flatten children into parent's priority comparison |
EmptyAlternate |
8 | Pick between two children based on emptiness |
Before materialization, the tree exists as vBe render nodes:
vBe = class {
parent; childIndex; id;
_obj; // The PromptElement instance
_state; // State from prepare() call
_children; // Child render nodes
_metadata;
_objFlags; // Bitmask (LegacyPrioritization | Chunk | passPriority | EmptyAlternate)
materialize(parent) {
// Image → MaterializedChatMessageImage
// BaseChatMessage → MaterializedChatMessage
// Everything else → GenericMaterializedContainer
}
};All materialized nodes memoize their token counts via once(). When a child is removed during pruning, onChunksChange() propagates upward, clearing cached values:
onChunksChange() {
this._tokenCount.clear();
this._upperBound.clear();
this._text.clear();
this.parent?.onChunksChange();
}Each prompt element can declare these flex properties (analogous to CSS flexbox):
| Property | Type | Default | Description |
|---|---|---|---|
flexGrow |
number | Infinity |
Render-order priority. Higher = rendered first, gets first pick of budget. |
flexBasis |
number | 1 |
Relative weight when splitting budget among siblings in the same flexGrow group. |
flexReserve |
number | string | none | Tokens to hold back for lower-priority groups. String form "/N" means 1/N of remaining budget. |
priority |
number | MAX_SAFE_INTEGER |
Used during post-allocation pruning. Lower = pruned first. |
N8 = class {
tokenBudget; // Total budget allocated
_consumed = 0; // Tokens used so far
endpoint; // Model endpoint metadata
get remainingTokenBudget() {
return Math.max(0, this.tokenBudget - this._consumed);
}
consume(amount) {
this._consumed += amount; // Can be negative (release reservation)
}
};The algorithm runs in 4 phases:
Elements are bucketed by their flexGrow value into a Map<number, Element[]>.
Groups are sorted by flexGrow descending. Higher flexGrow groups render first.
For each group (highest flexGrow first):
-
Reserve — Temporarily consume tokens for all lower-priority groups that declared
flexReserve:// String form: "/3" means "reserve 1/3 of remaining budget" let reserved = typeof flexReserve === "string" ? Math.floor(remaining / Number(flexReserve.slice(1))) : flexReserve; budget.consume(reserved);
-
Cap detection — For elements with a
TokenLimit: if their proportional share exceeds the cap, lock them at the cap and remove their weight from the distribution pool. -
Budget calculation — Uncapped elements split the remaining budget by
flexBasisratios:tokenBudget = capped ? tokenLimit : Math.floor((remaining - lockedTokens) * (flexBasis / totalBasis));
-
Release reservation —
budget.consume(-reserved)to give tokens back. -
Render — Call
prepare()thenrender()on all elements in the group (in parallel viaPromise.all). -
Consume actual — Deduct the real token consumption from the parent budget.
After all groups render, _getFinalElementTree enforces TokenLimit constraints from innermost to outermost, pruning lowest-priority children as needed.
flexGrow=∞ : System messages (always rendered first)
flexGrow=7 : User query text, custom instructions
flexGrow=5 : Chat variable attachments
flexGrow=3 : File attachments (capped at budget/6 via TokenLimit)
flexGrow=2 : Current tool call rounds, current user context
flexGrow=1 : Conversation history (rendered last, gets remaining budget)
When the materialized prompt exceeds the token budget, the system iteratively finds and removes the lowest-priority node. This continues until the prompt fits.
async _getFinalElementTree(maxBudget) {
let tree = this._root.materialize();
let limits = [{ limit: maxBudget, id: root.id }, ...this._tokenLimits];
// Process limits from innermost to outermost
for (let i = limits.length - 1; i >= 0; i--) {
let subtree = tree.findById(limits[i].id);
let count = await subtree.tokenCount(tokenizer);
// If under budget, try to grow Expandable elements
if (count < limit) { this._grow(subtree, count, limit); continue; }
// Prune loop
while (count > limit) {
do {
for (let removed of subtree.removeLowestPriorityChild()) {
let savings = removed.upperBoundTokenCount(tokenizer);
count -= savings * 1.25; // Heuristic: 25% margin to reduce recounts
}
} while (count > limit);
count = await subtree.tokenCount(tokenizer); // Precise recount
}
}
}The walk to find the lowest-priority node:
-
LegacyPrioritization (flag 1): Flat recursive walk across all leaves; find the single leaf with the lowest
priority. -
Standard walk: Iterate direct children of the container:
- Skip nodes containing cache breakpoints (at root level) — breakpoints are protected.
- Flatten containers with
passPriorityflag (flag 4) — their children compete directly with siblings. - Track the child with the lowest
priority. - Tie-break equal priorities using
tHt()— the node whose children have the lower minimum priority loses first.
-
Recurse vs Remove:
- If the target is a leaf, a
Chunk(flag 2), or an empty container → remove it directly. - Otherwise → recurse into the container to find its lowest-priority leaf.
- If the target is a leaf, a
function Tde(node, removedList) {
let parent = node.parent;
parent.children.splice(parent.children.indexOf(node), 1);
removedList.push(node);
cascadeKeepWith(node, removedList); // Remove related nodes (e.g., tool call + result pairs)
if (parent.isEmpty) Tde(parent, removedList); // Cascade empty parents
else parent.onChunksChange(); // Invalidate cached token counts
}When a node is removed, all nodes sharing the same keepWithId are also removed. This ensures paired content (tool calls and their results) are always removed together.
If the prompt is under budget after rendering, Expandable elements are re-rendered with the full available budget and swapped into the tree:
async _grow(tree, currentTokens, limit) {
for (let growable of this._growables) {
let budget = limit - currentTokens + growable.initialConsume;
// Re-render with expanded budget, swap into tree
tree.replaceNode(growable.id, rerendered);
}
}Source Code
│
▼
┌─────────────────────┐ ┌──────────────────────┐
│ Tree-Sitter WASM │────▶│ Overlay Node Tree │
│ (parse AST) │ │ (TR nodes: FOLD/LINE)│
└─────────────────────┘ └──────────┬───────────┘
│
┌────────────────────────┘
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ QC Summarizer │ │ Fallback: A4 │
│ (budget-aware) │ │ (returns full) │
└────────┬─────────┘ └──────────────────┘
│
▼
Summarized Document
(elided regions → "...")
The overlay node is a lightweight tree representing document regions:
TR = class {
startIndex; // Start character offset
endIndex; // End character offset
kind; // "LINE" (leaf) or "FOLD" (collapsible region)
children; // Child overlay nodes
};Construction priority:
- Tree-sitter AST →
parserService.getTreeSitterAST(doc).getStructure()— rich structural overlay - Folding ranges (fallback) → indentation-based heuristics via
Zcn(), with language-specific adjustments for offside-rule languages (Python, YAML) vs brace-based (JS, Java)
- WASM-based parsing via
tree-sitter.wasm+ per-language grammar files (tree-sitter-{language}.wasm) - Singleton parser service (
MHe) with an LRU cache of 5 parse trees per language - Parse trees are ref-counted for safe sharing across consumers
Supported languages: JavaScript, TypeScript, TSX, Python, Ruby, Rust, Go, Java, C++, C#, PHP
The core algorithm uses a greedy cost-based approach:
-
Convert overlay nodes into a "summarizable tree" where each node can be toggled between showing full text or an ellipsis (
...). -
Mark selection-intersecting nodes as must-survive (never elided).
-
Compute per-node cost — determines how "expensive" it is to keep a node visible:
cost = 100 * min_selection_distance + depth + 10 * (distance_ratio)Key factors:
- Selection proximity: Nodes intersecting the selection cost 0.
- Asymmetric distance: Nodes AFTER the selection get 3× distance penalty (context before the cursor is more valuable).
- Tree depth: Deeper nodes cost slightly more.
- Import statements: Cost 0 when
tryPreserveTypeCheckingis enabled.
-
Greedy fill: Sort nodes cheapest-first, add them one by one until the character budget is exceeded.
-
Produce edits: Replace elided regions with
"..."markers.
When fitting summarized documents into the prompt, the system uses a shrink loop:
let budget = promptTokenBudget * 0.85 - 300; // Start at 85% minus overhead
let summary = summarizer.summarizeDocument(budget);
for (let i = 0; i < 5; i++) {
if (await countTokens(summary) <= budget) break;
budget *= 0.85; // Shrink by 15%
summary = summarizer.summarizeDocument(budget);
}
// Effective minimum after 5 iterations: original × 0.85^6 ≈ 38%When code changes exceed the budget, the system progressively reduces documents by finding the enclosing definition (function/class) for each change hunk via tree-sitter:
┌─────────────────────────┐
│ class MyService { │ ← definition header (kept)
│ ... │ ← elided (removed)
│ handleRequest() { │ ← enclosing definition (kept)
│ + const x = validate │ ← changed line (always kept)
│ + return process(x) │ ← changed line (always kept)
│ } │ ← closing brace (kept)
│ ... │ ← elided (removed)
│ } │ ← closing brace (kept)
└─────────────────────────┘
If the result still exceeds the budget with multiple files, a "split_input" error is thrown to trigger prompt splitting.
- Reactively: When
BudgetExceededErroris thrown during prompt rendering. - Preemptively: When the previous turn's token usage exceeds a budget threshold.
BudgetExceededError or threshold exceeded
│
▼
┌─────────────────────────┐
│ Execute PreCompact │ Run registered extension hooks
│ hooks │ (e.g., MCP extensions)
└─────────┬───────────────┘
│
▼
┌─────────────────────────┐
│ Render summarization │ Mode: "full" (with tools, tool_choice:"none")
│ prompt │ or "simple" (lightweight fallback)
└─────────┬───────────────┘
│
▼
┌─────────────────────────┐
│ Send to LLM │ Model: gpt-4.1 (if available & sufficient context)
│ (temperature=0, │ Otherwise: current model
│ stream=false) │
└─────────┬───────────────┘
│
▼
┌─────────────────────────┐
│ Validate summary │ Token count must fit within budget.
│ │ If too large → throw, use "simple" fallback.
└─────────┬───────────────┘
│
▼
Store summary in conversation history.
Subsequent renders use condensed text.
| Mode | Description |
|---|---|
| "full" | Renders complete conversation (including tool schemas with tool_choice: "none"). If prompt cache is enabled, uses 105% of normal budget for extra headroom. |
| "simple" | Lightweight rendering without tool schemas. Used as fallback if "full" mode errors or if config forces it. |
The system prompt instructs the LLM to produce a structured summary with sections:
- Conversation Overview
- Technical Foundation
- Codebase Status
- Problem Resolution
- Progress Tracking
- Active Work State
- Recent Operations
- Continuation Plan
Includes an <analysis> step for chain-of-thought before the final summary. For Claude models, an extra instruction is appended: "Do NOT call any tools."
When enabled (LargeToolResultsToDiskEnabled experiment flag) and the result exceeds a configurable threshold:
- If the result is JSON, pretty-print it and extract a schema (
lDe). - Write the full content to a session-specific directory on disk.
- Replace the tool result with a pointer message:
Large tool result (42KB) written to file. Use the read_file tool to access the content at: /path/to/content.json Data schema found at: /path/to/schema.json
When the result exceeds the truncate token limit:
let ratio = text.length / tokenCount; // chars-per-token
let keepChars = ratio * (budget - marker.length);
let head = Math.round(keepChars * 0.4); // 40% from start
let tail = keepChars - head; // 60% from end
return text.slice(0, head)
+ "\n[Tool response was too long and was truncated.]\n"
+ text.slice(-tail);Rationale: The tail (most recent output) is typically more relevant than the head, so it gets 60% of the budget.
Cache breakpoints are markers placed at strategic positions in the prompt, enabling API-level prefix caching (supported by both Anthropic and OpenAI). The cache type is always "ephemeral".
Breakpoints are placed at two levels:
| Location | When |
|---|---|
After system/environment info (HNe) |
New chats |
After each user message (k4, Jdt) |
Always |
After the last tool result in each round (Vrt) |
When enableCacheBreakpoints=true |
Historical turns do not get their own breakpoints (enableCacheBreakpoints: false).
After materialization, up to 4 cache breakpoints total are placed at strategic message boundaries:
First pass (reverse):
- Tool-to-non-tool boundaries (first tool message in a sequence)
- Most recent user message
- Pure assistant messages (no tool calls)
Second pass (forward):
- Early System/User messages (the stable prompt prefix)
The API response handler tracks:
{
prompt_tokens: inputTokens + cacheCreationTokens + cacheReadTokens,
prompt_tokens_details: {
cached_tokens: cacheReadTokens // Cache hit metric
}
}Both cache_creation_input_tokens and cache_read_input_tokens are tracked from Anthropic's message_start and message_delta events.
The WorkspaceChunkSearch orchestrator tries strategies in order:
1. Full Workspace Search (EM)
│ Available? ──yes──▶ Return immediately
│ no
▼
2. Remote Code Search (YE) ──── 12.5s timeout
│ │
│ timeout? │ success?
│ ▼ ▼
│ Race against local Return result
│ ▼
3. Local Embeddings Search (cw) ── 8s timeout
│ │
│ timeout? │ success?
│ ▼ ▼
│ Fall through Return result
│
▼
4. TF-IDF + Semantic Reranking (yU)
│
▼
5. Pure TF-IDF (IM) ──── always available
- Runs in a dedicated Web Worker (
tfidfWorker.js) - Backed by SQLite database (
local-index.1.db) for persistent index - Indexes up to 25,000 files on initialization
- Subscribes to file create/change/delete events for incremental updates
- Uses
maxSpread: 0.75— only returns results within 75% of the best score - Queries are built by joining extracted keywords with commas
- Uses
text-embedding-3-small-512model for vector similarity - Optional reranking service for result quality improvement
const TOKENS_PER_CHUNK = 250;
maxChunks = Math.floor(tokenBudget / TOKENS_PER_CHUNK);Encodings:
o200k_base— GPT-4o and newer (200K vocab)cl100k_base— GPT-4/3.5-turbo (100K vocab)
Architecture:
- Runs in a dedicated worker thread (
tikTokenizerWorker.js) - LRU cache of 5,000 entries (text → token count)
- Worker auto-terminates after 15 seconds of inactivity
- Falls back to in-process mode if worker unavailable
Token counting for special content types:
| Content Type | Method |
|---|---|
| Text | BPE tokenization (cached) |
| Opaque | Pre-computed tokenUsage field |
| Image | Vision token formula (GBe) |
| Cache breakpoint | 0 tokens |
| Tool definitions | 16 base + 8 per tool + countObjectTokens() × 1.1 (10% overhead) |
| Tool calls | countMessageObjectTokens() × 1.5 (50% overhead) |
| Messages array | 3 base tokens + sum of message tokens |
When precise counting is too expensive:
estimatedTokens = text.length * 3 / 4 // ~0.75 tokens per characterThe inverse is used for character budgets:
characterBudget = tokenBudget * 4 // ~4 characters per tokenconst DEFAULT_TOKEN_BUDGET = 8192; // 8K tokens
// Available budget after accounting for current document:
availableBudget = 8192 - (documentLength / 4) - 256;
// └── estimated tokens ──┘ └── overhead ──┘
// Character budget conversion:
primaryCharacterBudget = (tokenBudget ?? 7168) * 4; // ~28,672 chars
secondaryCharacterBudget = 8192 * 4; // ~32,768 charsContext items are split into mandatory (priority ≥ 0.7) and optional pools:
Z7 = class {
mandatory; // High-priority items consume from this
optional; // Lower-priority items consume from this
spend(amount) {
this.mandatory -= amount;
this.optional -= amount;
}
isExhausted() { return this.mandatory <= 0; }
isOptionalExhausted() { return this.optional <= 0; }
};| Mode | Optional Budget | Use Case |
|---|---|---|
"minimal" |
0 | Only mandatory context |
"fillHalf" |
budget / 2 | Moderate context |
"double" |
min(budget, docLength) | Up to document size extra |
"fill" |
budget | Maximum context (2× mandatory) |
- Tracks up to 32 recently active/visible editors in an LRU cache
- Provides up to 10 most recently active neighbor files (excluding current document)
- Used for inline completion context and TypeScript server plugin
- Proactive cache warming triggered on: cursor moves, text changes, inline completion requests
- Time budget: 50ms for cache population
- Race timeout: 20ms — if the TypeScript server doesn't respond in time, yield what we have
| Constant | Value | Description |
|---|---|---|
modelMaxPromptTokens |
Model-specific | Maximum prompt tokens for the model |
TOKENS_PER_CHUNK (yhe) |
250 | Average tokens per workspace search chunk |
MAX_CACHE_BREAKPOINTS (Lai) |
4 | Maximum cache breakpoints in rendered prompt |
LRU_TOKENIZER_CACHE_SIZE |
5,000 | Tiktoken LRU cache entries |
LRU_PARSE_TREE_CACHE_SIZE |
5 per language | Tree-sitter parse tree cache |
MAX_WORKSPACE_FILES |
25,000 | Maximum files indexed by TF-IDF |
WORKER_IDLE_TIMEOUT |
15,000ms | Tokenizer worker auto-termination |
DEFAULT_INLINE_TOKEN_BUDGET (zFt) |
8,192 | Default inline completion token budget |
MAX_NEIGHBOR_FILES |
10 | Neighbor files for inline completions |
NEIGHBOR_FILE_LRU_SIZE |
32 | LRU capacity for tracking recent editors |
SUMMARIZATION_SHRINK_FACTOR |
0.85 | Per-iteration budget reduction for document fitting |
SUMMARIZATION_MAX_ITERATIONS |
5 | Maximum shrink iterations |
TOOL_RESULT_HEAD_RATIO |
0.4 | Head portion of truncated tool results |
TOOL_RESULT_TAIL_RATIO |
0.6 | Tail portion of truncated tool results |
TOOL_TOKEN_OVERHEAD |
1.1× | Overhead multiplier for tool definition tokens |
TOOL_CALL_OVERHEAD |
1.5× | Overhead multiplier for tool call tokens |
FAST_TOKEN_ESTIMATE |
length × 0.75 | Approximate tokens from character count |
CHAR_PER_TOKEN_ESTIMATE |
4 | Approximate characters per token |
BASE_TOKENS_PER_MESSAGE |
3 | Overhead tokens per chat message |
PRUNING_SAVINGS_MARGIN |
1.25× | Heuristic margin to reduce token recounts during pruning |
TFIDF_MAX_SPREAD |
0.75 | Only return TF-IDF results within 75% of best score |
REMOTE_SEARCH_TIMEOUT |
12,500ms | Timeout for remote code search |
LOCAL_EMBEDDINGS_TIMEOUT |
8,000ms | Timeout for local embeddings search |
CACHE_POPULATION_TIMEOUT |
50ms | Time budget for proactive cache warming |
CACHE_RACE_TIMEOUT |
20ms | Race timeout for TypeScript server response |