roman01la/COPILOT_TOKEN_OPTIMIZATION_SPEC.md

## COPILOT_TOKEN_OPTIMIZATION_SPEC.md

      
    Raw
  

              COPILOT_TOKEN_OPTIMIZATION_SPEC.md
            
          
    VS Code Copilot Chat — Token Optimization Spec


Reverse-engineered from github.copilot-chat-0.37.9 extension bundle.
Source: ~/.vscode/extensions/github.copilot-chat-0.37.9/dist/extension.js


Table of Contents


Architecture Overview
Prompt Tree & JSX-Like Element System
Flexbox-Style Token Budget Allocation
Priority-Based Pruning
Tree-Sitter Document Summarization
Conversation History Summarization
Tool Result Truncation
Prompt Caching (Cache Breakpoints)
Workspace Search (TF-IDF & Embeddings)
Token Counting & Estimation
Inline Completion Context Budgets


1. Architecture Overview

Copilot Chat models the entire LLM prompt as a tree of typed nodes (JSX-like elements). Each node declares layout properties (flexGrow, flexBasis, flexReserve, priority) that control two distinct phases:

Allocation phase — distributes the model's token budget across prompt sections using a CSS-flexbox-inspired algorithm.
Pruning phase — if the rendered prompt exceeds the budget, iteratively removes the lowest-priority leaf nodes until it fits.

Key Components


Component
Role


Prompt Tree
JSX-like tree of PromptElement nodes


Flex Allocator
Distributes token budget proportionally across sibling groups


Priority Pruner
Removes lowest-priority nodes when over budget


Token Budget Tracker (N8)
Per-element budget accounting


Document Summarizer (QC)
Tree-sitter-based code summarization


History Summarizer (KAe)
LLM-based conversation compaction


Cache Breakpoints
Marks positions for API-level prefix caching


TF-IDF / Embeddings Search
Budget-aware workspace context retrieval


Tiktoken Tokenizer
Precise BPE token counting in a worker thread


Lifecycle

JSX Element Tree
    │
    ▼
┌─────────────────────────────┐
│  1. Flex Budget Allocation  │  Groups sorted by flexGrow (desc).
│     (_processPromptPieces)  │  Each group gets proportional budget.
│                             │  flexReserve holds tokens for later groups.
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│  2. Element Rendering       │  Each element's prepare() + render()
│     (prepare → render)      │  called with its allocated N8 budget.
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│  3. Materialization         │  Render tree nodes → runtime nodes:
│     (materialize)           │  DD (containers), Wx (messages),
│                             │  Sde (text chunks), D8 (images),
│                             │  Hx (cache breakpoints)
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│  4. Growth Phase            │  If under budget, re-render Expandable
│     (_grow)                 │  elements with the remaining slack.
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│  5. Pruning Loop            │  While tokenCount > limit:
│     (_getFinalElementTree)  │    removeLowestPriorityChild()
│                             │  Uses 1.25x heuristic to reduce recounts.
└─────────────┬───────────────┘
              │
              ▼
       Final Chat Messages


2. Prompt Tree & JSX-Like Element System

Node Types

The materialized tree consists of these node types:


Class
Type
Description


DD (GenericMaterializedContainer)
Container
Groups children; has priority, flags, metadata


Wx (MaterializedChatMessage)
Message
A chat message (system/user/assistant/tool) with role, content children, optional tool calls


Sde (MaterializedChatMessageTextChunk)
Leaf
A text segment within a message


D8 (MaterializedChatMessageImage)
Leaf
An image content part


rL (MaterializedChatMessageOpaque)
Leaf
Opaque content with pre-computed token usage


Hx (MaterializedChatMessageBreakpoint)
Leaf
Cache breakpoint marker (protected from pruning)


Container Flags (Bitmask)


Flag
Value
Meaning


LegacyPrioritization
1
Use flat leaf-walk pruning instead of hierarchical


Chunk
2
Treat entire subtree as atomic unit for pruning


passPriority
4
Flatten children into parent's priority comparison


EmptyAlternate
8
Pick between two children based on emptiness


Render Tree Node (vBe)

Before materialization, the tree exists as vBe render nodes:
vBe = class {
    parent; childIndex; id;
    _obj;        // The PromptElement instance
    _state;      // State from prepare() call
    _children;   // Child render nodes
    _metadata;
    _objFlags;   // Bitmask (LegacyPrioritization | Chunk | passPriority | EmptyAlternate)

    materialize(parent) {
        // Image → MaterializedChatMessageImage
        // BaseChatMessage → MaterializedChatMessage
        // Everything else → GenericMaterializedContainer
    }
};
Token Counting (Memoized)

All materialized nodes memoize their token counts via once(). When a child is removed during pruning, onChunksChange() propagates upward, clearing cached values:
onChunksChange() {
    this._tokenCount.clear();
    this._upperBound.clear();
    this._text.clear();
    this.parent?.onChunksChange();
}

3. Flexbox-Style Token Budget Allocation

Properties

Each prompt element can declare these flex properties (analogous to CSS flexbox):


Property
Type
Default
Description


flexGrow
number
Infinity
Render-order priority. Higher = rendered first, gets first pick of budget.


flexBasis
number
1
Relative weight when splitting budget among siblings in the same flexGrow group.


flexReserve
number | string
none
Tokens to hold back for lower-priority groups. String form "/N" means 1/N of remaining budget.


priority
number
MAX_SAFE_INTEGER
Used during post-allocation pruning. Lower = pruned first.


Token Budget Tracker (N8)

N8 = class {
    tokenBudget;     // Total budget allocated
    _consumed = 0;   // Tokens used so far
    endpoint;        // Model endpoint metadata

    get remainingTokenBudget() {
        return Math.max(0, this.tokenBudget - this._consumed);
    }
    consume(amount) {
        this._consumed += amount;  // Can be negative (release reservation)
    }
};
Allocation Algorithm (_processPromptPieces)

The algorithm runs in 4 phases:
Phase 1: Group by flexGrow

Elements are bucketed by their flexGrow value into a Map<number, Element[]>.
Phase 2: Sort groups descending

Groups are sorted by flexGrow descending. Higher flexGrow groups render first.
Phase 3: Process each group

For each group (highest flexGrow first):


Reserve — Temporarily consume tokens for all lower-priority groups that declared flexReserve:
// String form: "/3" means "reserve 1/3 of remaining budget"
let reserved = typeof flexReserve === "string"
    ? Math.floor(remaining / Number(flexReserve.slice(1)))
    : flexReserve;
budget.consume(reserved);


Cap detection — For elements with a TokenLimit: if their proportional share exceeds the cap, lock them at the cap and remove their weight from the distribution pool.


Budget calculation — Uncapped elements split the remaining budget by flexBasis ratios:
tokenBudget = capped
    ? tokenLimit
    : Math.floor((remaining - lockedTokens) * (flexBasis / totalBasis));


Release reservation — budget.consume(-reserved) to give tokens back.


Render — Call prepare() then render() on all elements in the group (in parallel via Promise.all).


Consume actual — Deduct the real token consumption from the parent budget.


Phase 4: Post-Allocation Pruning

After all groups render, _getFinalElementTree enforces TokenLimit constraints from innermost to outermost, pruning lowest-priority children as needed.
Actual Flex Values in the Prompt

flexGrow=∞  : System messages (always rendered first)
flexGrow=7  : User query text, custom instructions
flexGrow=5  : Chat variable attachments
flexGrow=3  : File attachments (capped at budget/6 via TokenLimit)
flexGrow=2  : Current tool call rounds, current user context
flexGrow=1  : Conversation history (rendered last, gets remaining budget)


4. Priority-Based Pruning

Overview

When the materialized prompt exceeds the token budget, the system iteratively finds and removes the lowest-priority node. This continues until the prompt fits.
Main Pruning Loop (_getFinalElementTree)

async _getFinalElementTree(maxBudget) {
    let tree = this._root.materialize();
    let limits = [{ limit: maxBudget, id: root.id }, ...this._tokenLimits];

    // Process limits from innermost to outermost
    for (let i = limits.length - 1; i >= 0; i--) {
        let subtree = tree.findById(limits[i].id);
        let count = await subtree.tokenCount(tokenizer);

        // If under budget, try to grow Expandable elements
        if (count < limit) { this._grow(subtree, count, limit); continue; }

        // Prune loop
        while (count > limit) {
            do {
                for (let removed of subtree.removeLowestPriorityChild()) {
                    let savings = removed.upperBoundTokenCount(tokenizer);
                    count -= savings * 1.25;  // Heuristic: 25% margin to reduce recounts
                }
            } while (count > limit);
            count = await subtree.tokenCount(tokenizer);  // Precise recount
        }
    }
}
Node Selection Algorithm ($et)

The walk to find the lowest-priority node:


LegacyPrioritization (flag 1): Flat recursive walk across all leaves; find the single leaf with the lowest priority.


Standard walk: Iterate direct children of the container:

Skip nodes containing cache breakpoints (at root level) — breakpoints are protected.
Flatten containers with passPriority flag (flag 4) — their children compete directly with siblings.
Track the child with the lowest priority.
Tie-break equal priorities using tHt() — the node whose children have the lower minimum priority loses first.


Recurse vs Remove:

If the target is a leaf, a Chunk (flag 2), or an empty container → remove it directly.
Otherwise → recurse into the container to find its lowest-priority leaf.


Node Removal (Tde)

function Tde(node, removedList) {
    let parent = node.parent;
    parent.children.splice(parent.children.indexOf(node), 1);
    removedList.push(node);
    cascadeKeepWith(node, removedList);  // Remove related nodes (e.g., tool call + result pairs)
    if (parent.isEmpty) Tde(parent, removedList);  // Cascade empty parents
    else parent.onChunksChange();  // Invalidate cached token counts
}
keepWith Cascading

When a node is removed, all nodes sharing the same keepWithId are also removed. This ensures paired content (tool calls and their results) are always removed together.
Growth Phase (_grow)

If the prompt is under budget after rendering, Expandable elements are re-rendered with the full available budget and swapped into the tree:
async _grow(tree, currentTokens, limit) {
    for (let growable of this._growables) {
        let budget = limit - currentTokens + growable.initialConsume;
        // Re-render with expanded budget, swap into tree
        tree.replaceNode(growable.id, rerendered);
    }
}

5. Tree-Sitter Document Summarization

Architecture

Source Code
    │
    ▼
┌─────────────────────┐     ┌──────────────────────┐
│  Tree-Sitter WASM   │────▶│  Overlay Node Tree   │
│  (parse AST)        │     │  (TR nodes: FOLD/LINE)│
└─────────────────────┘     └──────────┬───────────┘
                                       │
              ┌────────────────────────┘
              ▼                    ▼
    ┌──────────────────┐  ┌──────────────────┐
    │  QC Summarizer   │  │  Fallback: A4    │
    │  (budget-aware)  │  │  (returns full)  │
    └────────┬─────────┘  └──────────────────┘
             │
             ▼
    Summarized Document
    (elided regions → "...")

Overlay Nodes (TR)

The overlay node is a lightweight tree representing document regions:
TR = class {
    startIndex;   // Start character offset
    endIndex;     // End character offset
    kind;         // "LINE" (leaf) or "FOLD" (collapsible region)
    children;     // Child overlay nodes
};
Construction priority:

Tree-sitter AST → parserService.getTreeSitterAST(doc).getStructure() — rich structural overlay
Folding ranges (fallback) → indentation-based heuristics via Zcn(), with language-specific adjustments for offside-rule languages (Python, YAML) vs brace-based (JS, Java)

Tree-Sitter Setup


WASM-based parsing via tree-sitter.wasm + per-language grammar files (tree-sitter-{language}.wasm)
Singleton parser service (MHe) with an LRU cache of 5 parse trees per language
Parse trees are ref-counted for safe sharing across consumers

Supported languages: JavaScript, TypeScript, TSX, Python, Ruby, Rust, Go, Java, C++, C#, PHP
Summarization Algorithm (Qsn)

The core algorithm uses a greedy cost-based approach:


Convert overlay nodes into a "summarizable tree" where each node can be toggled between showing full text or an ellipsis (...).


Mark selection-intersecting nodes as must-survive (never elided).


Compute per-node cost — determines how "expensive" it is to keep a node visible:
cost = 100 * min_selection_distance + depth + 10 * (distance_ratio)

Key factors:

Selection proximity: Nodes intersecting the selection cost 0.
Asymmetric distance: Nodes AFTER the selection get 3× distance penalty (context before the cursor is more valuable).
Tree depth: Deeper nodes cost slightly more.
Import statements: Cost 0 when tryPreserveTypeChecking is enabled.


Greedy fill: Sort nodes cheapest-first, add them one by one until the character budget is exceeded.


Produce edits: Replace elided regions with "..." markers.


Iterative Document Fitting

When fitting summarized documents into the prompt, the system uses a shrink loop:
let budget = promptTokenBudget * 0.85 - 300;  // Start at 85% minus overhead
let summary = summarizer.summarizeDocument(budget);

for (let i = 0; i < 5; i++) {
    if (await countTokens(summary) <= budget) break;
    budget *= 0.85;  // Shrink by 15%
    summary = summarizer.summarizeDocument(budget);
}
// Effective minimum after 5 iterations: original × 0.85^6 ≈ 38%
Definition-Aware Reduction

When code changes exceed the budget, the system progressively reduces documents by finding the enclosing definition (function/class) for each change hunk via tree-sitter:
┌─────────────────────────┐
│ class MyService {       │ ← definition header (kept)
│   ...                   │ ← elided (removed)
│   handleRequest() {     │ ← enclosing definition (kept)
│ +   const x = validate  │ ← changed line (always kept)
│ +   return process(x)   │ ← changed line (always kept)
│   }                     │ ← closing brace (kept)
│   ...                   │ ← elided (removed)
│ }                       │ ← closing brace (kept)
└─────────────────────────┘

If the result still exceeds the budget with multiple files, a "split_input" error is thrown to trigger prompt splitting.

6. Conversation History Summarization

When Triggered


Reactively: When BudgetExceededError is thrown during prompt rendering.
Preemptively: When the previous turn's token usage exceeds a budget threshold.

Summarization Flow

BudgetExceededError or threshold exceeded
    │
    ▼
┌─────────────────────────┐
│  Execute PreCompact     │  Run registered extension hooks
│  hooks                  │  (e.g., MCP extensions)
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Render summarization   │  Mode: "full" (with tools, tool_choice:"none")
│  prompt                 │  or "simple" (lightweight fallback)
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Send to LLM            │  Model: gpt-4.1 (if available & sufficient context)
│  (temperature=0,        │  Otherwise: current model
│   stream=false)         │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Validate summary       │  Token count must fit within budget.
│                         │  If too large → throw, use "simple" fallback.
└─────────┬───────────────┘
          │
          ▼
  Store summary in conversation history.
  Subsequent renders use condensed text.

Modes


Mode
Description


"full"
Renders complete conversation (including tool schemas with tool_choice: "none"). If prompt cache is enabled, uses 105% of normal budget for extra headroom.


"simple"
Lightweight rendering without tool schemas. Used as fallback if "full" mode errors or if config forces it.


Summarization Prompt

The system prompt instructs the LLM to produce a structured summary with sections:

Conversation Overview
Technical Foundation
Codebase Status
Problem Resolution
Progress Tracking
Active Work State
Recent Operations
Continuation Plan

Includes an <analysis> step for chain-of-thought before the final summary. For Claude models, an extra instruction is appended: "Do NOT call any tools."

7. Tool Result Truncation

Two-Stage Strategy

Stage 1: Large Results to Disk

When enabled (LargeToolResultsToDiskEnabled experiment flag) and the result exceeds a configurable threshold:

If the result is JSON, pretty-print it and extract a schema (lDe).
Write the full content to a session-specific directory on disk.
Replace the tool result with a pointer message:
Large tool result (42KB) written to file.
Use the read_file tool to access the content at: /path/to/content.json

Data schema found at: /path/to/schema.json


Stage 2: Token-Based Head/Tail Truncation

When the result exceeds the truncate token limit:
let ratio = text.length / tokenCount;           // chars-per-token
let keepChars = ratio * (budget - marker.length);
let head = Math.round(keepChars * 0.4);          // 40% from start
let tail = keepChars - head;                      // 60% from end

return text.slice(0, head)
     + "\n[Tool response was too long and was truncated.]\n"
     + text.slice(-tail);
Rationale: The tail (most recent output) is typically more relevant than the head, so it gets 60% of the budget.

8. Prompt Caching (Cache Breakpoints)

Overview

Cache breakpoints are markers placed at strategic positions in the prompt, enabling API-level prefix caching (supported by both Anthropic and OpenAI). The cache type is always "ephemeral".
Placement Points

Breakpoints are placed at two levels:
A. In the Prompt Tree (during rendering)


Location
When


After system/environment info (HNe)
New chats


After each user message (k4, Jdt)
Always


After the last tool result in each round (Vrt)
When enableCacheBreakpoints=true


Historical turns do not get their own breakpoints (enableCacheBreakpoints: false).
B. Post-Render Dynamic Placement (MNe)

After materialization, up to 4 cache breakpoints total are placed at strategic message boundaries:
First pass (reverse):

Tool-to-non-tool boundaries (first tool message in a sequence)
Most recent user message
Pure assistant messages (no tool calls)

Second pass (forward):

Early System/User messages (the stable prompt prefix)

Cache Effectiveness Tracking

The API response handler tracks:
{
    prompt_tokens: inputTokens + cacheCreationTokens + cacheReadTokens,
    prompt_tokens_details: {
        cached_tokens: cacheReadTokens  // Cache hit metric
    }
}
Both cache_creation_input_tokens and cache_read_input_tokens are tracked from Anthropic's message_start and message_delta events.

9. Workspace Search (TF-IDF & Embeddings)

Strategy Fallback Chain

The WorkspaceChunkSearch orchestrator tries strategies in order:
1. Full Workspace Search (EM)
   │ Available? ──yes──▶ Return immediately
   │ no
   ▼
2. Remote Code Search (YE) ──── 12.5s timeout
   │                              │
   │ timeout?                     │ success?
   │   ▼                          ▼
   │ Race against local      Return result
   │   ▼
3. Local Embeddings Search (cw) ── 8s timeout
   │                                  │
   │ timeout?                         │ success?
   │   ▼                              ▼
   │ Fall through                Return result
   │
   ▼
4. TF-IDF + Semantic Reranking (yU)
   │
   ▼
5. Pure TF-IDF (IM) ──── always available

TF-IDF Search (IM)


Runs in a dedicated Web Worker (tfidfWorker.js)
Backed by SQLite database (local-index.1.db) for persistent index
Indexes up to 25,000 files on initialization
Subscribes to file create/change/delete events for incremental updates
Uses maxSpread: 0.75 — only returns results within 75% of the best score
Queries are built by joining extracted keywords with commas

Embeddings Search


Uses text-embedding-3-small-512 model for vector similarity
Optional reranking service for result quality improvement

Budget-Based Result Limiting

const TOKENS_PER_CHUNK = 250;
maxChunks = Math.floor(tokenBudget / TOKENS_PER_CHUNK);

10. Token Counting & Estimation

Precise: Tiktoken BPE Tokenizer

Encodings:

o200k_base — GPT-4o and newer (200K vocab)
cl100k_base — GPT-4/3.5-turbo (100K vocab)

Architecture:

Runs in a dedicated worker thread (tikTokenizerWorker.js)
LRU cache of 5,000 entries (text → token count)
Worker auto-terminates after 15 seconds of inactivity
Falls back to in-process mode if worker unavailable

Token counting for special content types:


Content Type
Method


Text
BPE tokenization (cached)


Opaque
Pre-computed tokenUsage field


Image
Vision token formula (GBe)


Cache breakpoint
0 tokens


Tool definitions
16 base + 8 per tool + countObjectTokens() × 1.1 (10% overhead)


Tool calls
countMessageObjectTokens() × 1.5 (50% overhead)


Messages array
3 base tokens + sum of message tokens


Fast Estimation

When precise counting is too expensive:
estimatedTokens = text.length * 3 / 4   // ~0.75 tokens per character
The inverse is used for character budgets:
characterBudget = tokenBudget * 4       // ~4 characters per token

11. Inline Completion Context Budgets

Token Budget Computation

const DEFAULT_TOKEN_BUDGET = 8192;  // 8K tokens

// Available budget after accounting for current document:
availableBudget = 8192 - (documentLength / 4) - 256;
//                        └── estimated tokens ──┘   └── overhead ──┘

// Character budget conversion:
primaryCharacterBudget = (tokenBudget ?? 7168) * 4;     // ~28,672 chars
secondaryCharacterBudget = 8192 * 4;                     // ~32,768 chars
Two-Tier Budget System (Z7)

Context items are split into mandatory (priority ≥ 0.7) and optional pools:
Z7 = class {
    mandatory;   // High-priority items consume from this
    optional;    // Lower-priority items consume from this

    spend(amount) {
        this.mandatory -= amount;
        this.optional -= amount;
    }
    isExhausted() { return this.mandatory <= 0; }
    isOptionalExhausted() { return this.optional <= 0; }
};
Usage Modes


Mode
Optional Budget
Use Case


"minimal"
0
Only mandatory context


"fillHalf"
budget / 2
Moderate context


"double"
min(budget, docLength)
Up to document size extra


"fill"
budget
Maximum context (2× mandatory)


Neighbor File Context


Tracks up to 32 recently active/visible editors in an LRU cache
Provides up to 10 most recently active neighbor files (excluding current document)
Used for inline completion context and TypeScript server plugin

Cache Population


Proactive cache warming triggered on: cursor moves, text changes, inline completion requests
Time budget: 50ms for cache population
Race timeout: 20ms — if the TypeScript server doesn't respond in time, yield what we have


Appendix: Key Constants


Constant
Value
Description


modelMaxPromptTokens
Model-specific
Maximum prompt tokens for the model


TOKENS_PER_CHUNK (yhe)
250
Average tokens per workspace search chunk


MAX_CACHE_BREAKPOINTS (Lai)
4
Maximum cache breakpoints in rendered prompt


LRU_TOKENIZER_CACHE_SIZE
5,000
Tiktoken LRU cache entries


LRU_PARSE_TREE_CACHE_SIZE
5 per language
Tree-sitter parse tree cache


MAX_WORKSPACE_FILES
25,000
Maximum files indexed by TF-IDF


WORKER_IDLE_TIMEOUT
15,000ms
Tokenizer worker auto-termination


DEFAULT_INLINE_TOKEN_BUDGET (zFt)
8,192
Default inline completion token budget


MAX_NEIGHBOR_FILES
10
Neighbor files for inline completions


NEIGHBOR_FILE_LRU_SIZE
32
LRU capacity for tracking recent editors


SUMMARIZATION_SHRINK_FACTOR
0.85
Per-iteration budget reduction for document fitting


SUMMARIZATION_MAX_ITERATIONS
5
Maximum shrink iterations


TOOL_RESULT_HEAD_RATIO
0.4
Head portion of truncated tool results


TOOL_RESULT_TAIL_RATIO
0.6
Tail portion of truncated tool results


TOOL_TOKEN_OVERHEAD
1.1×
Overhead multiplier for tool definition tokens


TOOL_CALL_OVERHEAD
1.5×
Overhead multiplier for tool call tokens


FAST_TOKEN_ESTIMATE
length × 0.75
Approximate tokens from character count


CHAR_PER_TOKEN_ESTIMATE
4
Approximate characters per token


BASE_TOKENS_PER_MESSAGE
3
Overhead tokens per chat message


PRUNING_SAVINGS_MARGIN
1.25×
Heuristic margin to reduce token recounts during pruning


TFIDF_MAX_SPREAD
0.75
Only return TF-IDF results within 75% of best score


REMOTE_SEARCH_TIMEOUT
12,500ms
Timeout for remote code search


LOCAL_EMBEDDINGS_TIMEOUT
8,000ms
Timeout for local embeddings search


CACHE_POPULATION_TIMEOUT
50ms
Time budget for proactive cache warming


CACHE_RACE_TIMEOUT
20ms
Race timeout for TypeScript server response
Component	Role
Prompt Tree	JSX-like tree of `PromptElement` nodes
Flex Allocator	Distributes token budget proportionally across sibling groups
Priority Pruner	Removes lowest-priority nodes when over budget
Token Budget Tracker (`N8`)	Per-element budget accounting
Document Summarizer (`QC`)	Tree-sitter-based code summarization
History Summarizer (`KAe`)	LLM-based conversation compaction
Cache Breakpoints	Marks positions for API-level prefix caching
TF-IDF / Embeddings Search	Budget-aware workspace context retrieval
Tiktoken Tokenizer	Precise BPE token counting in a worker thread
Class	Type	Description
`DD` (GenericMaterializedContainer)	Container	Groups children; has priority, flags, metadata
`Wx` (MaterializedChatMessage)	Message	A chat message (system/user/assistant/tool) with role, content children, optional tool calls
`Sde` (MaterializedChatMessageTextChunk)	Leaf	A text segment within a message
`D8` (MaterializedChatMessageImage)	Leaf	An image content part
`rL` (MaterializedChatMessageOpaque)	Leaf	Opaque content with pre-computed token usage
`Hx` (MaterializedChatMessageBreakpoint)	Leaf	Cache breakpoint marker (protected from pruning)
Flag	Value	Meaning
`LegacyPrioritization`	1	Use flat leaf-walk pruning instead of hierarchical
`Chunk`	2	Treat entire subtree as atomic unit for pruning
`passPriority`	4	Flatten children into parent's priority comparison
`EmptyAlternate`	8	Pick between two children based on emptiness
Property	Type	Default	Description
`flexGrow`	number	`Infinity`	Render-order priority. Higher = rendered first, gets first pick of budget.
`flexBasis`	number	`1`	Relative weight when splitting budget among siblings in the same `flexGrow` group.
`flexReserve`	number \| string	none	Tokens to hold back for lower-priority groups. String form `"/N"` means `1/N` of remaining budget.
`priority`	number	`MAX_SAFE_INTEGER`	Used during post-allocation pruning. Lower = pruned first.
Mode	Description
"full"	Renders complete conversation (including tool schemas with `tool_choice: "none"`). If prompt cache is enabled, uses 105% of normal budget for extra headroom.
"simple"	Lightweight rendering without tool schemas. Used as fallback if "full" mode errors or if config forces it.
Location	When
After system/environment info (`HNe`)	New chats
After each user message (`k4`, `Jdt`)	Always
After the last tool result in each round (`Vrt`)	When `enableCacheBreakpoints=true`
Content Type	Method
Text	BPE tokenization (cached)
Opaque	Pre-computed `tokenUsage` field
Image	Vision token formula (`GBe`)
Cache breakpoint	0 tokens
Tool definitions	16 base + 8 per tool + `countObjectTokens()` × 1.1 (10% overhead)
Tool calls	`countMessageObjectTokens()` × 1.5 (50% overhead)
Messages array	3 base tokens + sum of message tokens
Mode	Optional Budget	Use Case
`"minimal"`	0	Only mandatory context
`"fillHalf"`	budget / 2	Moderate context
`"double"`	min(budget, docLength)	Up to document size extra
`"fill"`	budget	Maximum context (2× mandatory)
Constant	Value	Description
`modelMaxPromptTokens`	Model-specific	Maximum prompt tokens for the model
`TOKENS_PER_CHUNK` (`yhe`)	250	Average tokens per workspace search chunk
`MAX_CACHE_BREAKPOINTS` (`Lai`)	4	Maximum cache breakpoints in rendered prompt
`LRU_TOKENIZER_CACHE_SIZE`	5,000	Tiktoken LRU cache entries
`LRU_PARSE_TREE_CACHE_SIZE`	5 per language	Tree-sitter parse tree cache
`MAX_WORKSPACE_FILES`	25,000	Maximum files indexed by TF-IDF
`WORKER_IDLE_TIMEOUT`	15,000ms	Tokenizer worker auto-termination
`DEFAULT_INLINE_TOKEN_BUDGET` (`zFt`)	8,192	Default inline completion token budget
`MAX_NEIGHBOR_FILES`	10	Neighbor files for inline completions
`NEIGHBOR_FILE_LRU_SIZE`	32	LRU capacity for tracking recent editors
`SUMMARIZATION_SHRINK_FACTOR`	0.85	Per-iteration budget reduction for document fitting
`SUMMARIZATION_MAX_ITERATIONS`	5	Maximum shrink iterations
`TOOL_RESULT_HEAD_RATIO`	0.4	Head portion of truncated tool results
`TOOL_RESULT_TAIL_RATIO`	0.6	Tail portion of truncated tool results
`TOOL_TOKEN_OVERHEAD`	1.1×	Overhead multiplier for tool definition tokens
`TOOL_CALL_OVERHEAD`	1.5×	Overhead multiplier for tool call tokens
`FAST_TOKEN_ESTIMATE`	length × 0.75	Approximate tokens from character count
`CHAR_PER_TOKEN_ESTIMATE`	4	Approximate characters per token
`BASE_TOKENS_PER_MESSAGE`	3	Overhead tokens per chat message
`PRUNING_SAVINGS_MARGIN`	1.25×	Heuristic margin to reduce token recounts during pruning
`TFIDF_MAX_SPREAD`	0.75	Only return TF-IDF results within 75% of best score
`REMOTE_SEARCH_TIMEOUT`	12,500ms	Timeout for remote code search
`LOCAL_EMBEDDINGS_TIMEOUT`	8,000ms	Timeout for local embeddings search
`CACHE_POPULATION_TIMEOUT`	50ms	Time budget for proactive cache warming
`CACHE_RACE_TIMEOUT`	20ms	Race timeout for TypeScript server response