|
/** |
|
* NanoGPT API Transformer v0.1.4 |
|
* |
|
* This transformer handles requests and responses for NanoGPT API with comprehensive reasoning-to-thinking conversion, |
|
* force reasoning injection, and pseudo-tool syntax cleanup. |
|
* |
|
* ## Purpose: |
|
* 1. Transform requests to always include stream_options: { include_usage: true } parameter (configurable) |
|
* 2. Transform outgoing responses to convert reasoning content to thinking format (Claude Code compatible) (configurable) |
|
* 3. Handle both streaming and non-streaming responses |
|
* 4. Support v1 and v1legacy endpoint formats (uses `reasoning` or `reasoning_content` field in delta) |
|
* 5. Generate fake responses for max_tokens=1 as workaround for models that don't follow OpenAI API standards (configurable) |
|
* 6. Inject reasoning prompts for non-reasoning models to force step-by-step thinking (configurable) |
|
* 7. Clean up pseudo-tool syntax that may leak into reasoning/content from certain providers (configurable) |
|
* 8. Comprehensive error handling for common HTTP status codes (401, 403, 404, 409, 422, 429, 500) with friendly user messages |
|
* |
|
* ## Core Architecture: |
|
* |
|
* ### Main Class: NanoGPTProductionTransformer |
|
* - Extends base transformer functionality with NanoGPT-specific transformations |
|
* - Handles both request transformation and response transformation |
|
* - Supports streaming and non-streaming response processing |
|
* - Provides comprehensive configuration options for all features |
|
* |
|
* ### Key Components: |
|
* |
|
* #### Configuration System: |
|
* - All features are ENABLED by default for backward compatibility and safety |
|
* - Granular control over each transformation feature |
|
* - Parameter omission support (set to -99 to omit from request) |
|
* - Custom parameter injection via `extra` object |
|
* |
|
* #### Reasoning Model Detection: |
|
* - Maintains comprehensive list of known reasoning models (REASONING_MODELS) |
|
* - Intelligent model name matching with fallback logic |
|
* - Automatic exclusion of non-reasoning variants |
|
* |
|
* #### Stream Processing Pipeline: |
|
* - Line-by-line SSE stream processing with state tracking |
|
* - Reasoning content buffering and conversion |
|
* - Pseudo-tool block detection and removal |
|
* - Force reasoning tag parsing with state machine |
|
* |
|
* #### Response Transformation: |
|
* - Non-streaming JSON response processing |
|
* - Streaming response processing with TransformStream |
|
* - Reasoning-to-thinking format conversion |
|
* - Content sanitization and cleanup |
|
* |
|
* ## Configuration Options: |
|
* |
|
* ### Feature Toggles: |
|
* |
|
* - **enableStreamOptions** (default: true) |
|
* WHY: Required for statusline to display token counts (inputTokens/outputTokens) properly. |
|
* RISK if disabled: Token usage statistics won't be available in Claude Code UI statusline. |
|
* SAFE to disable if: You don't need token count visibility OR your API doesn't support stream_options. |
|
* |
|
* - **enableReasoningToThinking** (default: true) |
|
* WHY: Converts NanoGPT's `reasoning`/`reasoning_content` fields to Claude Code's `thinking` format. |
|
* This makes the model's reasoning process visible in Claude Code's "Thinking" mode UI. |
|
* RISK if disabled: Reasoning content will be lost or appear malformed in Claude Code. |
|
* Users won't see the model's step-by-step thinking process. |
|
* SAFE to disable if: Your model doesn't output reasoning content OR you're using a different client. |
|
* |
|
* - **enableFakeResponse** (default: true) |
|
* WHY: Workaround for models that hang or return malformed responses when max_tokens=1. |
|
* Claude Code uses max_tokens=1 when switching models via the /model command. |
|
* RISK if disabled: Model switching via /model command may hang, timeout, or crash the client |
|
* if your NanoGPT model doesn't properly handle max_tokens=1 edge case. |
|
* SAFE to disable if: Your model correctly handles max_tokens=1 requests |
|
* OR you've implemented custom timeout/retry logic |
|
* OR you're not using Claude Code's /model switching feature. |
|
* |
|
* - **enableForceReasoning** (default: false) |
|
* WHY: Forces non-reasoning models to think step-by-step using <reasoning_content> tags. |
|
* Injects a reasoning prompt into the conversation to elicit structured thinking. |
|
* RISK if enabled on reasoning models: Double reasoning, wasted tokens, confused output. |
|
* NOTE: Automatically skipped for models in REASONING_MODELS list. |
|
* |
|
* - **sanitizeToolSyntaxInReasoning** (default: false) |
|
* WHY: Removes pseudo-tool markers (like <|tool_call_begin|>) that may leak into reasoning content. |
|
* Some providers (Kimi, NanoGPT) include these markers in reasoning output. |
|
* RISK if disabled: Pseudo-tool syntax may appear in Claude Code's thinking display. |
|
* SAFE to disable if: Your provider doesn't leak pseudo-tool syntax OR you want to preserve original content. |
|
* |
|
* - **sanitizeToolSyntaxInContent** (default: false) |
|
* WHY: Removes pseudo-tool markers that may leak into regular response content. |
|
* Prevents tool-like syntax from appearing in user-facing responses. |
|
* RISK if disabled: Pseudo-tool syntax may appear in user responses. |
|
* SAFE to disable if: Your provider doesn't leak pseudo-tool syntax into content. |
|
* |
|
* ### Sampling Parameters: |
|
* |
|
* **IMPORTANT: Only 5 parameters are included in API requests by default** |
|
* When no user options are provided, the transformer ONLY sends these 5 parameters to the API: |
|
* - temperature, max_tokens, top_p, frequency_penalty, presence_penalty |
|
* All other parameters are omitted unless explicitly set by the user. |
|
* |
|
* **Range Parameter Support** |
|
* All numeric parameters support range specification using the format "min-max". |
|
* When a range is specified, a random value within that range will be generated for each request. |
|
* This enables dynamic parameter variation while maintaining backward compatibility with exact values. |
|
* |
|
* Range Examples: |
|
* - `"temperature": "0.1-0.8"` - Random temperature between 0.1 and 0.8 for each request |
|
* - `"top_p": "0.7-0.95"` - Random top_p between 0.7 and 0.95 for each request |
|
* - `"frequency_penalty": "-0.5-0.5"` - Random frequency penalty between -0.5 and 0.5 |
|
* |
|
* **Default Parameters (included in all requests):** |
|
* |
|
* - **temperature** (default: 0.1, range: 0-2, supports: "0.1-0.8") |
|
* Controls randomness in the output. Higher values make output more random, lower values more deterministic. |
|
* Range example: `"temperature": "0.1-0.8"` generates random values between 0.1 and 0.8. |
|
* |
|
* - **max_tokens** (default: -99, supports: "1000-4000") |
|
* The maximum number of tokens to generate in the response. |
|
* NOTE: Preserved at 1 (model switching) when explicitly requested, omitted when -99 (default, lets Claude Code configure it). |
|
* Range example: `"max_tokens": "1000-4000"` generates random token counts between 1000 and 4000. |
|
* |
|
* - **top_p** (default: 0.95, range: 0-1, supports: "0.7-0.95") |
|
* Controls diversity via nucleus sampling. Lower values make output more focused, higher values more diverse. |
|
* Range example: `"top_p": "0.7-0.95"` generates random values between 0.7 and 0.95. |
|
* |
|
* - **frequency_penalty** (default: 0, range: -2 to 2, supports: "-0.5-0.5") |
|
* Reduces likelihood of repeating the same tokens. Positive values decrease repetition. |
|
* Range example: `"frequency_penalty": "-0.5-0.5"` generates random values between -0.5 and 0.5. |
|
* |
|
* - **presence_penalty** (default: 0, range: -2 to 2, supports: "-0.3-0.3") |
|
* Reduces likelihood of repeating the same topics. Positive values decrease repetition. |
|
* Range example: `"presence_penalty": "-0.3-0.3"` generates random values between -0.3 and 0.3. |
|
* |
|
* **Optional Parameters (only included if explicitly set by user):** |
|
* |
|
* - **parallel_tool_calls** (default: omitted, supports: "true-false" for boolean randomization) |
|
* Whether to enable parallel tool execution in supported models. |
|
* Only included in request if user explicitly sets this parameter. |
|
* Range example: `"parallel_tool_calls": "0-1"` randomly enables/disables parallel tools. |
|
* |
|
* - **top_k** (default: omitted, range: 1-100, supports: "20-80") |
|
* Limits vocabulary to top K tokens. Only included in request if user explicitly sets this parameter. |
|
* Range example: `"top_k": "20-80"` generates random top_k values between 20 and 80. |
|
* |
|
* - **repetition_penalty** (default: omitted, range: 0.1-2.0, supports: "0.8-1.2") |
|
* Alternative repetition control parameter used by some models. Only included if user explicitly sets this parameter. |
|
* Values >1.0 reduce repetition, <1.0 increase it. |
|
* Range example: `"repetition_penalty": "0.8-1.2"` generates random values between 0.8 and 1.2. |
|
* |
|
* ### Optional Parameters (only included if explicitly set by user): |
|
* |
|
* - **reasoning_effort** (default: omitted) |
|
* Controls how much computational effort the model puts into reasoning before generating a response. |
|
* Higher values result in more thorough reasoning but slower responses and higher costs. |
|
* Only applicable to reasoning-capable models. Only included in request if user explicitly sets this parameter. |
|
* |
|
* Valid values: |
|
* - "none": Disables reasoning entirely (fastest) |
|
* - "minimal": Allocates ~10% of max_tokens for reasoning |
|
* - "low": Allocates ~20% of max_tokens for reasoning |
|
* - "medium": Allocates ~50% of max_tokens for reasoning |
|
* - "high": Allocates ~80% of max_tokens for reasoning (slowest but most thorough) |
|
* |
|
* ### Claude Code Integration: |
|
* |
|
* The transformer treats Claude Code's reasoning as a simple on/off switch: |
|
* |
|
* - **When Thinking is ON**: Claude Code sends `{"reasoning": {"effort": "high", "enabled": true}}` |
|
* → Transformer uses the configured `reasoning_effort` from options |
|
* - **When Thinking is OFF**: No reasoning parameter is sent |
|
* → Transformer sets `reasoning_effort` to "none" |
|
* |
|
* **Precedence Order**: |
|
* 1. User's explicit reasoning_effort setting (highest priority) |
|
* 2. Claude Code thinking toggle (ON = configured effort, OFF = "none") |
|
* 3. Transformer configuration default (none) |
|
* |
|
* **Behavior**: |
|
* - CC Thinking acts as an enable/disable switch for reasoning |
|
* - The actual reasoning effort level is controlled by the transformer's `reasoning_effort` option |
|
* - Users can override by explicitly setting `reasoning_effort` in their request |
|
* |
|
* The transformer creates both `reasoning` and `reasoning_effort` parameters for maximum compatibility with NanoGPT. |
|
* |
|
* **IMPORTANT - UltraThink Feature**: |
|
* When a user types "UltraThink" in their prompt, Claude Code will automatically toggle the Thinking mode ON. |
|
* This is a built-in Claude Code feature that triggers extended thinking/reasoning for complex tasks. |
|
* The transformer will then receive the reasoning parameter with `enabled: true` and apply the configured effort level. |
|
* |
|
* ### Optional Parameters (only included if explicitly set by user): |
|
* |
|
* - **cache_control** (default: omitted) |
|
* Enables caching for Claude models. Only applicable to Claude models. |
|
* Only included in request if user explicitly sets this parameter. |
|
* - enabled: Whether to enable caching |
|
* - ttl: Cache time-to-live (e.g., "5m", "1h", "1d") |
|
* |
|
* - **extra** (default: omitted) |
|
* Custom parameters to include in the request. Only included if user explicitly sets this parameter. |
|
* Merged into the request object. |
|
* Example: { "custom_param_1": "value1", "custom_param_2": "value2" } |
|
* |
|
* ## Advanced Features: |
|
* |
|
* ### Parameter Omission: |
|
* Set any numeric parameter to -99 to omit it from the request entirely. |
|
* This is useful when you want to let the model use its default values. |
|
* |
|
* ### Force Reasoning System: |
|
* When enableForceReasoning is true, the transformer: |
|
* 1. Checks if the model is in the REASONING_MODELS list |
|
* 2. If not a reasoning model, injects FORCE_REASONING_PROMPT |
|
* 3. Parses responses for <reasoning_content> tags |
|
* 4. Converts tagged content to thinking format |
|
* |
|
* ### Pseudo-Tool Syntax Cleanup: |
|
* Handles various pseudo-tool markers that may leak from providers: |
|
* - <|tool_calls_section_begin|> / <|tool_calls_section_end|> |
|
* - <|tool_call_begin|> / <|tool_call_end|> |
|
* - <|tool_call_argument_begin|> / <|tool_call_argument_end|> |
|
* |
|
* ## Usage Examples: |
|
* |
|
* ### Basic Usage (All Features Enabled): |
|
* ```javascript |
|
* const transformer = new NanoGPTProductionTransformer({ enable: true }); |
|
* ``` |
|
* |
|
* ### Custom Configuration: |
|
* ```javascript |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* enableStreamOptions: true, // Keep token counts |
|
* enableReasoningToThinking: true, // Convert reasoning to thinking |
|
* enableFakeResponse: true, // Handle max_tokens=1 edge case |
|
* enableForceReasoning: false, // Don't force reasoning on non-reasoning models |
|
* sanitizeToolSyntaxInReasoning: true, // Clean up reasoning content |
|
* sanitizeToolSyntaxInContent: false, // Keep content as-is |
|
* temperature: 0.8, // Set temperature |
|
* max_tokens: 2000, // Set max tokens |
|
* top_p: 0.9, // Set top_p |
|
* frequency_penalty: 0.1, // Set frequency penalty |
|
* presence_penalty: 0.1, // Set presence penalty |
|
* parallel_tool_calls: true, // Enable parallel tools |
|
* top_k: 50, // Set top_k |
|
* repetition_penalty: 1.1, // Set repetition penalty |
|
* reasoning_effort: "medium", // Set reasoning effort level |
|
* cache_control: { // Enable cache control |
|
* enabled: true, |
|
* ttl: "10m" |
|
* }, |
|
* extra: { // Custom parameters |
|
* custom_param_1: "value1", |
|
* custom_param_2: "value2" |
|
* } |
|
* }); |
|
* ``` |
|
* |
|
* ### Parameter Omission Example: |
|
* ```javascript |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* temperature: -99, // Omit temperature from request |
|
* top_p: -99, // Omit top_p from request |
|
* frequency_penalty: -99 // Omit frequency_penalty from request |
|
* }); |
|
* ``` |
|
* |
|
* ### Range Parameter Example: |
|
* ```javascript |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* temperature: "0.1-0.8", // Random temperature between 0.1 and 0.8 for each request |
|
* max_tokens: "2000-8192", // Random token count between 2000 and 8192 |
|
* top_p: "0.5-0.9", // Random top_p between 0.5 and 0.9 |
|
* frequency_penalty: "0-0.3", // Random frequency penalty between 0 and 0.3 |
|
* presence_penalty: "0-0.2", // Random presence penalty between 0 and 0.2 |
|
* top_k: "20-60", // Random top_k between 20 and 60 |
|
* repetition_penalty: "1.0-1.3" // Random repetition penalty between 1.0 and 1.3 |
|
* }); |
|
* ``` |
|
* |
|
* ### Mixed Exact and Range Parameters: |
|
* ```javascript |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* temperature: "0.2-0.7", // Range: random between 0.2 and 0.7 |
|
* max_tokens: 2048, // Exact: always 2048 |
|
* top_p: "0.8-0.92", // Range: random between 0.8 and 0.92 |
|
* frequency_penalty: 0, // Exact: always 0 |
|
* presence_penalty: "-0.1-0.2" // Range: random between -0.1 and 0.2 |
|
* }); |
|
* ``` |
|
* |
|
* ### Reasoning Effort Examples: |
|
* ```javascript |
|
* // High reasoning effort for complex problems |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* reasoning_effort: "high" // Maximum reasoning (80% of tokens) |
|
* }); |
|
* |
|
* // Fast responses for simple tasks |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* reasoning_effort: "minimal" // Minimal reasoning (10% of tokens) |
|
* }); |
|
* |
|
* // Disable reasoning entirely |
|
* const transformer = new NanoGPTProductionTransformer({ |
|
* enable: true, |
|
* reasoning_effort: "none" // No reasoning (fastest) |
|
* }); |
|
* ``` |
|
* |
|
* ## Configuration in config.json: |
|
* ```json |
|
* { |
|
* "transformers": [ |
|
* { |
|
* "options": { |
|
* // === Core Feature Toggles === |
|
* "enable": true, |
|
* "enableStreamOptions": true, |
|
* "enableReasoningToThinking": true, |
|
* "enableFakeResponse": true, |
|
* "enableForceReasoning": false, |
|
* "sanitizeToolSyntaxInReasoning": false, |
|
* "sanitizeToolSyntaxInContent": false, |
|
* |
|
* |
|
* // === Default Sampling Parameters (Only these 5 parameters will be included in API requests) === |
|
* "temperature": 0.1, |
|
* "max_tokens": -99, |
|
* "top_p": 0.95, |
|
* "frequency_penalty": 0, |
|
* "presence_penalty": 0, |
|
* |
|
* // === Optional Parameters (Only included in API requests when explicitly set by user) === |
|
* "parallel_tool_calls": true, |
|
* "top_k": 40, |
|
* "repetition_penalty": 1.15, |
|
* "reasoning_effort": "none", |
|
* "cache_control": { |
|
* "enabled": false, |
|
* "ttl": "5m" |
|
* }, |
|
* "extra": { |
|
* "custom_param_1": "value1", |
|
* "custom_param_2": "value2" |
|
* } |
|
* }, |
|
* "path": "Your\\Path\\To\\The\\Transformer\\File\\nanogpt.js" |
|
* } |
|
* ] |
|
* } |
|
* ``` |
|
* |
|
* ## Key Features: |
|
* |
|
* ### Request Transformation: |
|
* - Automatically adds stream_options: { include_usage: true } to all requests (if enabled) |
|
* - Sets sampling parameters (temperature, max_tokens, top_p, etc.) |
|
* - Sets reasoning_effort parameter for NanoGPT models (none, minimal, low, medium, high) |
|
* - Handles Claude Code's reasoning format: {"reasoning": {"effort": "high", "enabled": true}} |
|
* - Injects force reasoning prompt for non-reasoning models (if enabled) |
|
* - Merges custom parameters from `extra` object |
|
* - Preserves original max_tokens=1 (model switching), default -99 leaves max_tokens untouched for Claude Code to configure |
|
* |
|
* ### Response Transformation: |
|
* - Converts NanoGPT `reasoning` deltas to Claude Code `thinking` format (if enabled) |
|
* - Buffers reasoning content and emits as structured thinking blocks |
|
* - Handles both streaming and non-streaming responses |
|
* - Processes force reasoning tags (<reasoning_content>) in responses |
|
* - Sanitizes pseudo-tool syntax from reasoning and content (if enabled) |
|
* |
|
* ### Special Features: |
|
* - Enables statusline to properly count inputTokens and outputTokens via include_usage |
|
* - Enables "Thinking" mode display in Claude Code by converting reasoning to thinking format |
|
* - Generates fake responses when max_tokens=1 to handle models that don't respond correctly to Claude Code's /model command (if enabled) |
|
* - Detects and skips reasoning models for force reasoning to avoid double reasoning |
|
* - Removes pseudo-tool blocks that may leak from certain providers |
|
* - **Comprehensive Error Handling**: Intercepts common HTTP errors (401, 403, 404, 409, 422, 429, 500) and returns user-friendly messages instead of letting errors propagate |
|
* |
|
* ### Error Handling: |
|
* - **401 Unauthorized**: Session required - API key invalid or expired |
|
* - **403 Forbidden**: Insufficient permissions for the requested resource |
|
* - **404 Not Found**: Requested resource does not exist |
|
* - **409 Conflict**: Resource conflict (duplicate creation, wrong state) |
|
* - **422 Invalid Input**: Validation failed for request parameters |
|
* - **429 Rate Limited**: Too many requests - please wait and retry |
|
* - **500 Internal Error**: Server encountered unexpected error |
|
* - Works for both streaming and non-streaming responses |
|
* - Provides clear, actionable error messages to users |
|
* |
|
* ## Supported Models: |
|
* |
|
* ### Built-in Reasoning Models (Force Reasoning Skipped): |
|
* The transformer automatically detects these models and skips force reasoning: |
|
* - DeepSeek: deepseek-reasoner, deepseek-v3.2:thinking, etc. |
|
* - GLM: GLM-4.6:thinking, GLM-4.5-Air:thinking, etc. |
|
* - Qwen: Qwen3-235B-A22B-Thinking-2507, qwq-32b, qvq-max, etc. |
|
* - Hermes: Hermes-4-70B:thinking, etc. |
|
* - Moonshot: kimi-k2-thinking, etc. |
|
* - And many more... |
|
* |
|
* ## Utility Functions: |
|
* |
|
* ### Model Detection: |
|
* - `isReasoningModel(modelName)` - Checks if a model has built-in reasoning capabilities |
|
* |
|
* ### Content Processing: |
|
* - `stripPseudoToolBlocks(text)` - Removes complete pseudo tool sections |
|
* - `stripPseudoToolSyntax(text)` - Removes individual pseudo tool markers |
|
* |
|
* ### Response Generation: |
|
* - `shouldOmitParameter(value)` - Checks if parameter should be omitted (-99) |
|
* - `generateId()` - Generates realistic OpenAI-style IDs |
|
* - `createFakeResponse(model)` - Creates fake responses for max_tokens=1 |
|
* |
|
* ### Stream Processing: |
|
* - `processStreamLine()` - Processes individual SSE stream lines with reasoning-to-thinking conversion |
|
* - `processForceReasoningContent()` - Handles force reasoning tag parsing using state machine |
|
* - `handleNonStreamingResponse()` - Processes JSON responses (async function) |
|
* - `handleStreamingResponse()` - Processes streaming responses (async function) |
|
* - `processStream()` - Main stream processing loop (async function, invoked by handleStreamingResponse) |
|
* |
|
* ## Error Handling: |
|
* |
|
* ### Stream Processing: |
|
* - Graceful handling of malformed JSON in SSE streams |
|
* - Buffer size limits (1MB) to prevent memory issues |
|
* - Proper stream cleanup and error propagation |
|
* |
|
* ### Response Processing: |
|
* - Fallback content when pseudo-tool removal empties the response |
|
* - Preservation of original response structure |
|
* - Safe handling of missing or malformed fields |
|
* |
|
* ## Performance Considerations: |
|
* |
|
* ### Memory Usage: |
|
* - Stream buffering with size limits |
|
* - Reasoning content buffering with cleanup |
|
* - Efficient string processing for content sanitization |
|
* |
|
* ### Processing Efficiency: |
|
* - Line-by-line stream processing to minimize latency |
|
* - Conditional feature execution based on configuration |
|
* - Early termination for disabled features |
|
* |
|
* ## Compatibility: |
|
* |
|
* ### API Standards: |
|
* - OpenAI-compatible request/response format |
|
* - Server-Sent Events (SSE) streaming support |
|
* - Standard JSON response handling |
|
* |
|
* ### Claude Code Integration: |
|
* - Thinking mode format compatibility |
|
* - Token count reporting via stream_options |
|
* - Model switching support via max_tokens=1 handling |
|
* |
|
* ## Troubleshooting: |
|
* |
|
* ### Common Issues: |
|
* 1. **Token counts not showing**: Enable enableStreamOptions |
|
* 2. **Reasoning not visible**: Enable enableReasoningToThinking |
|
* 3. **Model switching hangs**: Enable enableFakeResponse |
|
* 4. **Pseudo-tool syntax appears**: Enable sanitizeToolSyntaxInReasoning/Content |
|
* 5. **Double reasoning on reasoning models**: Check REASONING_MODELS list |
|
* 6. **Reasoning effort not working**: Verify reasoning_effort value is one of: "none", "minimal", "low", "medium", "high" |
|
* 7. **Error messages not appearing**: Check that error interception is enabled (built-in, always active) |
|
* 8. **Custom error handling needed**: Modify error messages in the switch statements in transformResponseOut and fetch interceptor |
|
* |
|
*/ |
|
|
|
const fs = require('fs'); |
|
const path = require('path'); |
|
const os = require('os'); |
|
|
|
|
|
// ============================================================================ |
|
// FORCE REASONING CONFIGURATION |
|
// ============================================================================ |
|
|
|
/** |
|
* List of reasoning models that already have built-in reasoning capabilities. |
|
* The forceReasoning feature should NOT be applied to these models. |
|
* This list is used to filter out models that don't need prompt injection. |
|
*/ |
|
const REASONING_MODELS = [ |
|
// DeepSeek Reasoning Models |
|
"deepseek/deepseek-v3.2:thinking", |
|
"deepseek-reasoner", |
|
"deepseek-reasoner-cheaper", |
|
"deepseek-r1", |
|
"deepseek-ai/deepseek-v3.2-exp-thinking", |
|
"deepseek-ai/DeepSeek-V3.1:thinking", |
|
"deepseek-ai/DeepSeek-V3.1-Terminus:thinking", |
|
"deepseek/deepseek-v3.2-speciale", |
|
|
|
// Moonshot Reasoning Models |
|
"moonshotai/kimi-k2-thinking", |
|
|
|
// GLM Reasoning Models |
|
"GLM-4.5-Air-Iceblink:thinking", |
|
"GLM-4.5-Air-Steam-v1:thinking", |
|
"z-ai/glm-4.6:thinking", |
|
"zai-org/GLM-4.5-Air:thinking", |
|
"THUDM/GLM-Z1-32B-0414", |
|
"THUDM/GLM-Z1-9B-0414", |
|
"zai-org/GLM-4.5:thinking", |
|
|
|
// Hermes Reasoning Models |
|
"NousResearch/Hermes-4-70B:thinking", |
|
"nousresearch/hermes-4-405b:thinking", |
|
|
|
// Qwen Reasoning Models |
|
"Qwen/Qwen3-235B-A22B-Thinking-2507", |
|
"qwen3-vl-235b-a22b-thinking", |
|
"qwq-32b", |
|
"qwen/qwq-32b-preview", |
|
"qvq-max", |
|
|
|
// Other Reasoning Models |
|
"pamanseau/OpenReasoning-Nemotron-32B", |
|
"LLM360/K2-Think", |
|
"tngtech/DeepSeek-TNG-R1T2-Chimera", |
|
"tngtech/DeepSeek-R1T-Chimera", |
|
"Steelskull/L3.3-Nevoria-R1-70b", |
|
"Steelskull/L3.3-Electra-R1-70b", |
|
"Steelskull/L3.3-Damascus-R1", |
|
"inflatebot/MN-12B-Mag-Mell-R1", |
|
"Steelskull/L3.3-Cu-Mai-R1-70b", |
|
"Llama-3.3-70B-Electra-R1", |
|
"Llama-3.3-70B-Vulpecula-R1", |
|
"Llama-3.3-70B-Fallen-R1-v1", |
|
"Llama-3.3-70B-Cu-Mai-R1", |
|
"Llama-3.3-70B-Mokume-Gane-R1", |
|
"huihui-ai/DeepSeek-R1-Distill-Qwen-32B-abliterated", |
|
"huihui-ai/DeepSeek-R1-Distill-Llama-70B-abliterated", |
|
|
|
// Newly added models |
|
"Alibaba-NLP/Tongyi-DeepResearch-30B-A3B", |
|
"Envoid/Llama-3.05-Nemotron-Tenyxchat-Storybreaker-70B", |
|
"Ling-Flash-2.0", |
|
"MiniMax-M2", |
|
"Salesforce/Llama-xLAM-2-70b-fc-r", |
|
"deepcogito/cogito-v1-preview-qwen-32B", |
|
"huihui-ai/Llama-3.1-Nemotron-70B-Instruct-HF-abliterated", |
|
"inclusionai/ling-1t", |
|
"meituan-longcat/LongCat-Flash-Chat-FP8", |
|
"microsoft/MAI-DS-R1-FP8", |
|
"minimax/minimax-01", |
|
"nvidia/Llama-3.1-Nemotron-70B-Instruct-HF", |
|
"nvidia/Llama-3.1-Nemotron-Ultra-253B-v1", |
|
"nvidia/Llama-3.3-Nemotron-Super-49B-v1", |
|
"nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", |
|
"nvidia/nvidia-nemotron-nano-9b-v2" |
|
]; |
|
|
|
/** |
|
* The reasoning prompt that forces non-reasoning models to think step-by-step. |
|
* This prompt is injected into requests when enableForceReasoning is true and |
|
* the model is NOT in the REASONING_MODELS list. |
|
*/ |
|
const FORCE_REASONING_PROMPT = `You are an expert reasoning model. |
|
|
|
Always think step by step before answering. Even if the problem seems simple, always write down your reasoning process explicitly. |
|
|
|
Output format: |
|
<reasoning_content> |
|
Your detailed thinking process goes here |
|
</reasoning_content> |
|
Your final answer must follow after the closing tag above.`; |
|
|
|
/** |
|
* Check if a model is a reasoning model that already has built-in reasoning capabilities. |
|
* Uses a three-tier matching strategy: |
|
* 1. Exact match against REASONING_MODELS list (highest priority) |
|
* 2. Check against known non-reasoning variants to avoid false positives |
|
* 3. Fallback pattern matching for unknown models (looks for :thinking, -r1, reasoner, etc.) |
|
* @param {string} modelName - The model name to check (case-insensitive) |
|
* @returns {boolean} True if the model is a reasoning model, false otherwise |
|
* @example |
|
* isReasoningModel("deepseek-reasoner") // true (exact match) |
|
* isReasoningModel("z-ai/glm-4.6:thinking") // true (exact match) |
|
* isReasoningModel("z-ai/glm-4.6") // false (non-reasoning variant) |
|
* isReasoningModel("some-model:thinking") // true (pattern match) |
|
*/ |
|
function isReasoningModel(modelName) { |
|
if (!modelName) return false; |
|
|
|
const normalizedModel = modelName.toLowerCase(); |
|
|
|
// First check for exact matches (highest priority) |
|
if (REASONING_MODELS.some(reasoningModel => |
|
reasoningModel.toLowerCase() === normalizedModel)) { |
|
return true; |
|
} |
|
|
|
// Check if the model is explicitly a non-reasoning variant |
|
// This prevents false positives where non-reasoning models match reasoning models |
|
const nonReasoningVariants = [ |
|
"z-ai/glm-4.6", |
|
"GLM-4.5-Air-Iceblink", |
|
"GLM-4.5-Air-Steam-v1", |
|
"deepseek/deepseek-v3.2", |
|
"deepseek-ai/DeepSeek-V3.1", |
|
"moonshotai/kimi-k2-instruct", |
|
"NousResearch/Hermes-4-70B", |
|
"nousresearch/hermes-4-405b", |
|
"qwen3-vl-235b-a22b-instruct", |
|
"Qwen/Qwen3-235B-A22B" |
|
]; |
|
|
|
if (nonReasoningVariants.some(variant => |
|
variant.toLowerCase() === normalizedModel)) { |
|
return false; |
|
} |
|
|
|
// For models not in exact lists, check for reasoning indicators |
|
// Only apply to unknown models to avoid false positives |
|
const hasReasoningIndicators = [ |
|
":thinking", |
|
"-thinking", |
|
"-r1", |
|
"reasoner", |
|
"think", |
|
"qwq", |
|
"qvq" |
|
]; |
|
|
|
return hasReasoningIndicators.some(indicator => |
|
normalizedModel.includes(indicator)); |
|
} |
|
|
|
// ============================================================================ |
|
// PSEUDO-TOOL SYNTAX CLEANUP UTILITIES |
|
// ============================================================================ |
|
|
|
/** |
|
* CHAOS TEST CASES FOR PSEUDO-TOOL SYNTAX |
|
* |
|
* Test Case 1 - Intentional pseudo-markers in reasoning: |
|
* Input: "Let me use the tool: <|tool_call_begin|>\n{\"name\": \"bash\", \"arguments\": \"{\\\"command\\\": \\\"ls -la\\\"}\"}\n<|tool_call_end|>\n\nNow I'll execute it." |
|
* Expected: "Let me use the tool: \n\nNow I'll execute it." |
|
* |
|
* Test Case 2 - Accidental marker-like strings in user content: |
|
* Input: "To use Kimi's tools, you write: <|tool_call_begin|> followed by JSON" |
|
* Expected: "To use Kimi's tools, you write: followed by JSON" (sanitized if sanitizeToolSyntaxInContent=true) |
|
* |
|
* Test Case 3 - Broken/malformed markers: |
|
* Input: "Here's my command: <|tool_call_begin\n{\"command\": \"test\"}\n<|tool_call_end|>" |
|
* Expected: "Here's my command: \n{\"command\": \"test\"}\n" |
|
* |
|
* Test Case 4 - Attempted tool injection in reasoning: |
|
* Input: "Reasoning: Let's call tool_search({'query': 'malicious'}) to get data" |
|
* Expected: "Reasoning: Let's call tool_search({'query': 'malicious'}) to get data" (no tool_calls created) |
|
* |
|
* Test Case 5 - Nested markers: |
|
* Input: "Outer: <|tool_call_begin|> Inner: <|tool_call_argument_begin|> data <|tool_call_argument_end|> <|tool_call_end|>" |
|
* Expected: "Outer: Inner: data " |
|
*/ |
|
|
|
/** |
|
* Remove complete pseudo tool-call sections that span multiple lines. |
|
* Handles the full section markers: <|tool_calls_section_begin|> ... <|tool_calls_section_end|> |
|
* @param {string} text - The text to sanitize |
|
* @returns {string} Text with pseudo tool blocks removed, or original if not a string |
|
*/ |
|
function stripPseudoToolBlocks(text) { |
|
if (typeof text !== "string" || !text) return text; |
|
return text.replace( |
|
/<\|tool_calls_section_begin\|>[\s\S]*?<\|tool_calls_section_end\|>/g, |
|
"" |
|
); |
|
} |
|
|
|
/** |
|
* Strip provider-specific pseudo tool-call markers that sometimes leak into |
|
* reasoning/content (e.g. Kimi <|tool_call_*|> tags). |
|
* This does NOT touch real OpenAI tool_calls fields. |
|
* |
|
* @param {string} text - The text to sanitize |
|
* @returns {string} - The sanitized text |
|
*/ |
|
function stripPseudoToolSyntax(text) { |
|
if (typeof text !== "string" || !text) return text; |
|
|
|
let cleaned = text; |
|
const originalLength = cleaned.length; |
|
|
|
// Remove known Kimi / NanoGPT tool markers |
|
cleaned = cleaned.replace(/<\|tool_calls_section_begin\|>/g, ""); |
|
cleaned = cleaned.replace(/<\|tool_calls_section_end\|>/g, ""); |
|
cleaned = cleaned.replace(/<\|tool_call_begin\|>/g, ""); |
|
cleaned = cleaned.replace(/<\|tool_call_end\|>/g, ""); |
|
cleaned = cleaned.replace(/<\|tool_call_argument_begin\|>/g, ""); |
|
cleaned = cleaned.replace(/<\|tool_call_argument_end\|>/g, ""); |
|
|
|
// Optional: normalize excessive whitespace created by removals |
|
cleaned = cleaned.replace(/\s{2,}/g, " "); |
|
|
|
return cleaned; |
|
} |
|
|
|
// ============================================================================ |
|
// RANGE PARAMETER UTILITIES |
|
// ============================================================================ |
|
|
|
/** |
|
* RANGE PARAMETER SUPPORT - Random Value Generation from Ranges |
|
* |
|
* This section provides utilities for supporting parameter ranges in the format |
|
* "min-max". When a parameter is specified as a range string, the system will |
|
* generate a random value within that range for each request, enabling dynamic |
|
* parameter variation while maintaining backward compatibility with exact values. |
|
*/ |
|
|
|
/** |
|
* Parses a range string in the format "min-max" and returns range information. |
|
* Supports negative numbers and floating point values. |
|
* @param {string|number} value - The value to parse (can be a range string like "0.1-0.8" or exact number) |
|
* @returns {Object|null} Range object with min, max, and isRange properties, or null if invalid |
|
* @returns {number} return.min - Minimum value (equals max for non-ranges) |
|
* @returns {number} return.max - Maximum value (equals min for non-ranges) |
|
* @returns {boolean} return.isRange - True if input was a range string, false for exact values |
|
* @example |
|
* parseRange(0.5) // { min: 0.5, max: 0.5, isRange: false } |
|
* parseRange("0.2-0.8") // { min: 0.2, max: 0.8, isRange: true } |
|
* parseRange("-0.5-0.5") // { min: -0.5, max: 0.5, isRange: true } |
|
* parseRange("invalid") // null |
|
*/ |
|
function parseRange(value) { |
|
// If it's already a number, return as-is with isRange: false |
|
if (typeof value === 'number') { |
|
return { |
|
min: value, |
|
max: value, |
|
isRange: false |
|
}; |
|
} |
|
|
|
// If it's not a string, return null (invalid) |
|
if (typeof value !== 'string') { |
|
return null; |
|
} |
|
|
|
// Trim whitespace |
|
const trimmed = value.trim(); |
|
|
|
// Check if it matches the range pattern "min-max" |
|
const rangeMatch = trimmed.match(/^(-?\d+\.?\d*)\s*-\s*(-?\d+\.?\d*)$/); |
|
|
|
if (!rangeMatch) { |
|
// Not a range, try to parse as single number |
|
const numValue = parseFloat(trimmed); |
|
if (!isNaN(numValue)) { |
|
return { |
|
min: numValue, |
|
max: numValue, |
|
isRange: false |
|
}; |
|
} |
|
return null; // Invalid format |
|
} |
|
|
|
const min = parseFloat(rangeMatch[1]); |
|
const max = parseFloat(rangeMatch[2]); |
|
|
|
// Validate the range |
|
if (isNaN(min) || isNaN(max)) { |
|
return null; |
|
} |
|
|
|
// Ensure min <= max |
|
const rangeMin = Math.min(min, max); |
|
const rangeMax = Math.max(min, max); |
|
|
|
return { |
|
min: rangeMin, |
|
max: rangeMax, |
|
isRange: true |
|
}; |
|
} |
|
|
|
/** |
|
* Generates a random number within the specified range using Math.random(). |
|
* @param {number} min - Minimum value (inclusive) |
|
* @param {number} max - Maximum value (inclusive) |
|
* @returns {number} Random floating-point number within the range [min, max] |
|
*/ |
|
function getRandomInRange(min, max) { |
|
const randomValue = Math.random() * (max - min) + min; |
|
return randomValue; |
|
} |
|
|
|
/** |
|
* Resolves a parameter value to a concrete number for the current request. |
|
* If the parameter is a range string, generates a random value within that range. |
|
* If the parameter is an exact number, returns it unchanged. |
|
* @param {string|number} value - The parameter value (range string like "0.1-0.8" or exact number) |
|
* @returns {number|null} Resolved numeric value, or null if input is invalid |
|
* @example |
|
* resolveParameterValue(0.5, 'temperature') // 0.5 (exact value) |
|
* resolveParameterValue("0.2-0.8", 'temperature') // random value between 0.2 and 0.8 |
|
*/ |
|
function resolveParameterValue(value) { |
|
const parsed = parseRange(value); |
|
|
|
if (!parsed) { |
|
return null; // Invalid parameter |
|
} |
|
|
|
if (parsed.isRange) { |
|
// Generate random value within range for this request |
|
return getRandomInRange(parsed.min, parsed.max, paramName); |
|
} else { |
|
// Return exact value |
|
return parsed.min; // min == max for non-ranges |
|
} |
|
} |
|
|
|
// ============================================================================ |
|
// FAKE RESPONSE GENERATION UTILITIES |
|
// ============================================================================ |
|
|
|
/** |
|
* FAKE RESPONSE GENERATION - Workaround for Model Switch Issues |
|
* |
|
* This section provides a workaround for models that don't respond correctly to |
|
* max_tokens=1 requests. Claude Code uses max_tokens=1 when switching models |
|
* via the `/model` command, but some models don't follow OpenAI API standards |
|
* for this edge case and return malformed responses. Instead of letting these |
|
* problematic responses break the model switching flow, we generate a fake |
|
* response that follows proper OpenAI API format. |
|
*/ |
|
|
|
/** |
|
* Checks if a parameter value should be omitted (when value equals -99) |
|
* @param {*} value - The parameter value to check |
|
* @returns {boolean} True if the parameter should be omitted, false otherwise |
|
*/ |
|
function shouldOmitParameter(value) { |
|
return value === -99; |
|
} |
|
|
|
/** |
|
* Generates a realistic OpenAI-style ID for fake responses |
|
* @returns {string} Generated ID in the format chatcmpl-XXXXXXXXXXXXXXaaaa |
|
*/ |
|
function generateId() { |
|
let result = 'chatcmpl-'; |
|
// 14 random digits |
|
for (let i = 0; i < 14; i++) { |
|
result += Math.floor(Math.random() * 10); |
|
} |
|
// 4 random lowercase letters |
|
const letters = 'abcdefghijklmnopqrstuvwxyz'; |
|
for (let i = 0; i < 4; i++) { |
|
result += letters.charAt(Math.floor(Math.random() * letters.length)); |
|
} |
|
return result; |
|
} |
|
|
|
/** |
|
* Creates a minimal fake response when max_tokens=1 |
|
* Used as workaround for models that don't handle max_tokens=1 correctly |
|
* @param {string} model - The model name to include in the response |
|
* @returns {Object} Fake response object following OpenAI API format |
|
*/ |
|
function createFakeResponse(model) { |
|
return { |
|
"id": generateId(), |
|
"object": "chat.completion", |
|
"created": Math.floor(Date.now() / 1000), |
|
"model": model, |
|
"choices": [ |
|
{ |
|
"index": 0, |
|
"finish_reason": "length", |
|
"message": { |
|
"role": "assistant", |
|
"content": "\n" |
|
} |
|
} |
|
] |
|
}; |
|
} |
|
|
|
// ============================================================================ |
|
// STREAM PROCESSING UTILITIES |
|
// ============================================================================ |
|
|
|
/** |
|
* Processes a single line from the SSE stream with reasoning-to-thinking conversion |
|
* @param {string} line - The line to process |
|
* @param {TransformStreamDefaultController} controller - Stream controller |
|
* @param {TextEncoder} encoder - Text encoder |
|
* @param {Object} context - Stream state context for reasoning tracking |
|
* @param {boolean} context.hasTextContent - Whether non-reasoning content has started |
|
* @param {string} context.reasoningBuffer - Buffer for accumulated reasoning content |
|
* @param {boolean} context.isReasoningFinished - Whether reasoning phase has completed |
|
* @param {boolean} context.enableReasoningToThinking - Whether to convert reasoning to thinking format |
|
* @param {boolean} context.sanitizeToolSyntaxInReasoning - Whether to sanitize pseudo-tool syntax in reasoning |
|
* @param {boolean} context.sanitizeToolSyntaxInContent - Whether to sanitize pseudo-tool syntax in content |
|
* @param {boolean} context.insidePseudoToolBlock - Whether currently inside a pseudo-tool block |
|
* @param {boolean} context.forceReasoningApplied - Whether force reasoning prompt was injected |
|
* @param {string} context.forceReasoningState - State machine state: "SEARCHING", "REASONING", or "FINAL" |
|
* @param {string} context.forceReasoningBuffer - Buffer for force reasoning content |
|
* @param {string} context.forceReasoningPartialMatch - Partial tag match buffer |
|
* @param {string} context.forceReasoningAccumulatedWhitespace - Accumulated whitespace after reasoning |
|
*/ |
|
function processStreamLine(line, controller, encoder, context) { |
|
// Skip empty lines |
|
if (!line.trim()) { |
|
return; |
|
} |
|
|
|
// Handle [DONE] marker - streaming footer |
|
if (line.trim() === "data: [DONE]") { |
|
controller.enqueue(encoder.encode(line + '\n')); |
|
return; |
|
} |
|
|
|
// Process SSE data lines with reasoning-to-thinking transformation |
|
if (line.startsWith("data: ")) { |
|
try { |
|
const jsonData = JSON.parse(line.slice(6)); |
|
|
|
// Handle pseudo-tool block detection and content dropping |
|
const choice = jsonData.choices?.[0]; |
|
const delta = choice?.delta || {}; |
|
const rawContent = delta.content; |
|
|
|
// 1. Detect tool section markers on the raw content |
|
if (typeof rawContent === "string") { |
|
if (rawContent.includes("<|tool_calls_section_begin|>")) { |
|
context.insidePseudoToolBlock = true; |
|
} |
|
if (rawContent.includes("<|tool_calls_section_end|>")) { |
|
context.insidePseudoToolBlock = false; |
|
} |
|
|
|
// 2. Now sanitize, but only if we will forward it |
|
if (context.sanitizeToolSyntaxInContent) { |
|
delta.content = stripPseudoToolSyntax(rawContent); |
|
} else { |
|
delta.content = rawContent; |
|
} |
|
|
|
// 3. If we are inside the pseudo tool block, DROP the content entirely |
|
if (context.insidePseudoToolBlock) { |
|
// Do not let this contribute to user-visible content |
|
delete delta.content; |
|
} else { |
|
// Outside pseudo tool block, normal behavior |
|
// Track if normal content has started |
|
if (delta.content && !context.hasTextContent) { |
|
context.hasTextContent = true; |
|
} |
|
|
|
// 4. Add fallback if we have reasoning but no content yet |
|
if (!delta.content && !context.hasTextContent && context.reasoningBuffer && |
|
rawContent && rawContent.includes("<|tool_calls_section_end|>")) { |
|
delta.content = "[Internal tool call removed; see Thinking for details.]"; |
|
context.hasTextContent = true; |
|
} |
|
|
|
jsonData.choices[0].delta = delta; |
|
} |
|
} |
|
|
|
// --- Reasoning / Thinking Logic --- |
|
// NanoGPT v1 endpoint returns `reasoning` or `reasoning_content` field. Transform to `thinking` object (if enabled). |
|
const rawReasoning = jsonData.choices?.[0]?.delta?.reasoning || jsonData.choices?.[0]?.delta?.reasoning_content; |
|
|
|
if (rawReasoning && context.enableReasoningToThinking) { |
|
const sanitizedReasoning = context.sanitizeToolSyntaxInReasoning |
|
? stripPseudoToolSyntax(rawReasoning) |
|
: rawReasoning; |
|
|
|
context.reasoningBuffer += sanitizedReasoning; |
|
|
|
// Create a modified data packet containing 'thinking' |
|
const modifiedData = { |
|
...jsonData, |
|
choices: [{ |
|
...jsonData.choices?.[0], |
|
delta: { |
|
...jsonData.choices[0].delta, |
|
thinking: { content: sanitizedReasoning } |
|
} |
|
}] |
|
}; |
|
|
|
// Clean up the original reasoning fields |
|
if (modifiedData.choices?.[0]?.delta) { |
|
delete modifiedData.choices[0].delta.reasoning; |
|
delete modifiedData.choices[0].delta.reasoning_content; |
|
} |
|
|
|
const output = `data: ${JSON.stringify(modifiedData)}\n\n`; |
|
controller.enqueue(encoder.encode(output)); |
|
return; |
|
} |
|
|
|
// Check if reasoning just finished (content appeared, reasoning buffered, but not marked complete) |
|
if (context.enableReasoningToThinking && jsonData.choices?.[0]?.delta?.content && context.reasoningBuffer && !context.isReasoningFinished) { |
|
context.isReasoningFinished = true; |
|
const signature = Date.now().toString(); |
|
|
|
// Send a special packet summarizing the full thinking content |
|
const thinkingSummary = { |
|
...jsonData, |
|
choices: [{ |
|
...jsonData.choices?.[0], |
|
delta: { |
|
...jsonData.choices[0].delta, |
|
content: null, // Clear content for this specific thinking packet |
|
thinking: { |
|
content: context.reasoningBuffer, |
|
signature: signature |
|
} |
|
} |
|
}] |
|
}; |
|
|
|
if (thinkingSummary.choices?.[0]?.delta) { |
|
delete thinkingSummary.choices[0].delta.reasoning; |
|
delete thinkingSummary.choices[0].delta.reasoning_content; |
|
} |
|
|
|
const thinkingOutput = `data: ${JSON.stringify(thinkingSummary)}\n\n`; |
|
controller.enqueue(encoder.encode(thinkingOutput)); |
|
} |
|
|
|
// Cleanup reasoning fields if they exist loosely (only if conversion is enabled) |
|
if (context.enableReasoningToThinking) { |
|
if (jsonData.choices?.[0]?.delta?.reasoning) { |
|
delete jsonData.choices[0].delta.reasoning; |
|
} |
|
if (jsonData.choices?.[0]?.delta?.reasoning_content) { |
|
delete jsonData.choices[0].delta.reasoning_content; |
|
} |
|
} |
|
|
|
// --- Force Reasoning Tag Parsing (for <reasoning_content> tags in content) --- |
|
// This handles responses from models that were injected with FORCE_REASONING_PROMPT |
|
if (context.forceReasoningApplied && context.enableReasoningToThinking) { |
|
const contentToProcess = jsonData.choices?.[0]?.delta?.content; |
|
|
|
if (typeof contentToProcess === "string") { |
|
// Process force reasoning tags using state machine |
|
const result = processForceReasoningContent(contentToProcess, jsonData, context, encoder); |
|
|
|
if (result.chunks.length > 0) { |
|
// Emit all generated chunks |
|
for (const chunk of result.chunks) { |
|
controller.enqueue(encoder.encode(chunk)); |
|
} |
|
return; // Don't forward the original, we've handled it |
|
} |
|
|
|
// If we're currently in REASONING state or SEARCHING, don't forward content yet |
|
if (context.forceReasoningState === "REASONING" || |
|
(context.forceReasoningState === "SEARCHING" && context.forceReasoningPartialMatch)) { |
|
return; |
|
} |
|
} |
|
} |
|
|
|
// Forward the processed chunk |
|
const output = `data: ${JSON.stringify(jsonData)}\n\n`; |
|
controller.enqueue(encoder.encode(output)); |
|
} catch (error) { |
|
// If parsing fails, still pass through the original line |
|
controller.enqueue(encoder.encode(line + '\n')); |
|
} |
|
} else { |
|
// Pass through non-data lines (event: lines, etc.) unchanged |
|
controller.enqueue(encoder.encode(line + '\n')); |
|
} |
|
} |
|
|
|
/** |
|
* Processes force reasoning content with <reasoning_content> tags using a state machine. |
|
* This parses content that contains reasoning wrapped in special tags and converts it to thinking format. |
|
* |
|
* @param {string} content - The content to process |
|
* @param {Object} chunk - The original SSE chunk data (OpenAI streaming format) |
|
* @param {Object} context - The stream context with state tracking (modified in-place) |
|
* @param {TextEncoder} encoder - Text encoder (unused, kept for API consistency) |
|
* @returns {{chunks: string[]}} Object containing array of SSE-formatted chunk strings to emit |
|
*/ |
|
function processForceReasoningContent(content, chunk, context, encoder) { |
|
const chunks = []; |
|
let workingContent = context.forceReasoningPartialMatch + content; |
|
context.forceReasoningPartialMatch = ""; |
|
|
|
while (workingContent.length > 0) { |
|
if (context.forceReasoningState === "SEARCHING") { |
|
// Look for reasoning start tag |
|
const startIndex = workingContent.indexOf(FORCE_REASONING_START_TAG); |
|
if (startIndex !== -1) { |
|
// Found start tag, switch to REASONING state |
|
workingContent = workingContent.substring(startIndex + FORCE_REASONING_START_TAG.length); |
|
context.forceReasoningState = "REASONING"; |
|
} else { |
|
// Check for partial match at the end (tag might be split across chunks) |
|
for (let i = FORCE_REASONING_START_TAG.length - 1; i > 0; i--) { |
|
if (workingContent.endsWith(FORCE_REASONING_START_TAG.substring(0, i))) { |
|
context.forceReasoningPartialMatch = workingContent.substring(workingContent.length - i); |
|
break; |
|
} |
|
} |
|
workingContent = ""; |
|
} |
|
} else if (context.forceReasoningState === "REASONING") { |
|
// Look for reasoning end tag |
|
const endIndex = workingContent.indexOf(FORCE_REASONING_END_TAG); |
|
if (endIndex !== -1) { |
|
// Found end tag, extract reasoning content |
|
const reasoningContent = workingContent.substring(0, endIndex); |
|
|
|
if (reasoningContent.length > 0) { |
|
context.forceReasoningBuffer += reasoningContent; |
|
|
|
// Create thinking delta with reasoning content |
|
const thinkingDelta = { |
|
...chunk.choices[0].delta, |
|
thinking: { content: reasoningContent } |
|
}; |
|
delete thinkingDelta.content; |
|
|
|
const thinkingChunk = { |
|
...chunk, |
|
choices: [{ ...chunk.choices[0], delta: thinkingDelta }] |
|
}; |
|
|
|
chunks.push(`data: ${JSON.stringify(thinkingChunk)}\n\n`); |
|
} |
|
|
|
// Add signature to mark end of reasoning |
|
const signatureDelta = { |
|
...chunk.choices[0].delta, |
|
thinking: { signature: Date.now().toString() } |
|
}; |
|
delete signatureDelta.content; |
|
|
|
const signatureChunk = { |
|
...chunk, |
|
choices: [{ ...chunk.choices[0], delta: signatureDelta }] |
|
}; |
|
|
|
chunks.push(`data: ${JSON.stringify(signatureChunk)}\n\n`); |
|
|
|
workingContent = workingContent.substring(endIndex + FORCE_REASONING_END_TAG.length); |
|
context.forceReasoningState = "FINAL"; |
|
} else { |
|
// Check for partial end tag match |
|
let contentToProcess = workingContent; |
|
for (let i = FORCE_REASONING_END_TAG.length - 1; i > 0; i--) { |
|
if (workingContent.endsWith(FORCE_REASONING_END_TAG.substring(0, i))) { |
|
context.forceReasoningPartialMatch = workingContent.substring(workingContent.length - i); |
|
contentToProcess = workingContent.substring(0, workingContent.length - i); |
|
break; |
|
} |
|
} |
|
|
|
if (contentToProcess.length > 0) { |
|
context.forceReasoningBuffer += contentToProcess; |
|
|
|
// Create thinking delta |
|
const thinkingDelta = { |
|
...chunk.choices[0].delta, |
|
thinking: { content: contentToProcess } |
|
}; |
|
delete thinkingDelta.content; |
|
|
|
const thinkingChunk = { |
|
...chunk, |
|
choices: [{ ...chunk.choices[0], delta: thinkingDelta }] |
|
}; |
|
|
|
chunks.push(`data: ${JSON.stringify(thinkingChunk)}\n\n`); |
|
} |
|
workingContent = ""; |
|
} |
|
} else if (context.forceReasoningState === "FINAL") { |
|
// Handle final content after reasoning (the actual answer) |
|
if (workingContent.length > 0) { |
|
if (/^\s*$/.test(workingContent)) { |
|
// Accumulate whitespace |
|
context.forceReasoningAccumulatedWhitespace += workingContent; |
|
} else { |
|
// Non-whitespace content, emit with accumulated whitespace |
|
const finalContent = context.forceReasoningAccumulatedWhitespace + workingContent; |
|
const finalDelta = { |
|
...chunk.choices[0].delta, |
|
content: finalContent |
|
}; |
|
|
|
// Remove thinking if present |
|
if (finalDelta.thinking) { |
|
delete finalDelta.thinking; |
|
} |
|
|
|
const finalChunk = { |
|
...chunk, |
|
choices: [{ ...chunk.choices[0], delta: finalDelta }] |
|
}; |
|
|
|
chunks.push(`data: ${JSON.stringify(finalChunk)}\n\n`); |
|
context.forceReasoningAccumulatedWhitespace = ""; |
|
} |
|
} |
|
workingContent = ""; |
|
} |
|
} |
|
|
|
return { chunks }; |
|
} |
|
|
|
// ============================================================================ |
|
// RESPONSE HANDLING UTILITIES |
|
// ============================================================================ |
|
|
|
/** |
|
* Constants for force reasoning tag parsing |
|
*/ |
|
const FORCE_REASONING_START_TAG = "<reasoning_content>"; |
|
const FORCE_REASONING_END_TAG = "</reasoning_content>"; |
|
|
|
/** |
|
* Handles non-streaming JSON responses with reasoning-to-thinking transformation |
|
* @async |
|
* @param {Response} response - The Fetch API Response object containing JSON body |
|
* @param {boolean} [enableReasoningToThinking=true] - Whether to convert reasoning to thinking format |
|
* @param {boolean} [sanitizeToolSyntaxInReasoning=false] - Whether to sanitize pseudo-tool syntax in reasoning |
|
* @param {boolean} [sanitizeToolSyntaxInContent=false] - Whether to sanitize pseudo-tool syntax in content |
|
* @param {boolean} [forceReasoningApplied=false] - Whether force reasoning prompt was injected |
|
* @returns {Promise<Response>} New Response object with transformed JSON body |
|
*/ |
|
async function handleNonStreamingResponse(response, enableReasoningToThinking = true, sanitizeToolSyntaxInReasoning = false, sanitizeToolSyntaxInContent = false, forceReasoningApplied = false) { |
|
const data = await response.json(); |
|
|
|
// Transform reasoning to thinking format for non-streaming responses (if enabled) |
|
// NanoGPT non-streaming may have reasoning or reasoning_content in message object |
|
if (enableReasoningToThinking && data.choices && Array.isArray(data.choices)) { |
|
for (const choice of data.choices) { |
|
if (!choice.message) continue; |
|
|
|
// 1) sanitize content if needed |
|
if (sanitizeToolSyntaxInContent && typeof choice.message.content === "string") { |
|
let cleaned = choice.message.content; |
|
const originalLength = cleaned.length; |
|
cleaned = stripPseudoToolBlocks(cleaned); |
|
cleaned = stripPseudoToolSyntax(cleaned); |
|
|
|
// Add fallback if content became empty after removing pseudo-tools |
|
if (cleaned.trim() === "" && originalLength > 0) { |
|
cleaned = "[Internal tool call removed; see Thinking for details.]"; |
|
} |
|
|
|
choice.message.content = cleaned; |
|
} |
|
|
|
// 2) reasoning -> thinking with sanitation (from reasoning/reasoning_content fields) |
|
const rawReasoning = choice.message.reasoning || choice.message.reasoning_content; |
|
if (rawReasoning) { |
|
const sanitizedReasoning = sanitizeToolSyntaxInReasoning |
|
? stripPseudoToolSyntax(rawReasoning) |
|
: rawReasoning; |
|
|
|
choice.message.thinking = { |
|
content: sanitizedReasoning, |
|
signature: Date.now().toString() |
|
}; |
|
|
|
delete choice.message.reasoning; |
|
delete choice.message.reasoning_content; |
|
} |
|
|
|
// 3) Parse <reasoning_content> tags from content (for forceReasoning responses) |
|
if (forceReasoningApplied && typeof choice.message.content === "string" && !choice.message.thinking) { |
|
const reasoningRegex = /<reasoning_content>([\s\S]*?)<\/reasoning_content>/; |
|
const reasoningMatch = choice.message.content.match(reasoningRegex); |
|
|
|
if (reasoningMatch && reasoningMatch[1]) { |
|
const reasoningContent = reasoningMatch[1].trim(); |
|
const sanitizedReasoning = sanitizeToolSyntaxInReasoning |
|
? stripPseudoToolSyntax(reasoningContent) |
|
: reasoningContent; |
|
|
|
choice.message.thinking = { |
|
content: sanitizedReasoning, |
|
signature: Date.now().toString() |
|
}; |
|
|
|
// Remove the reasoning tags from content, keeping only the final answer |
|
choice.message.content = choice.message.content |
|
.replace(/<reasoning_content>[\s\S]*?<\/reasoning_content>/, "") |
|
.trim(); |
|
} |
|
} |
|
} |
|
} |
|
|
|
// Return the transformed data |
|
return new Response(JSON.stringify(data), { |
|
status: response.status, |
|
statusText: response.statusText, |
|
headers: response.headers |
|
}); |
|
} |
|
|
|
/** |
|
* Handles streaming responses with reasoning-to-thinking transformation |
|
* @async |
|
* @param {Response} response - The Fetch API Response object containing SSE stream body |
|
* @param {boolean} [enableReasoningToThinking=true] - Whether to convert reasoning to thinking format |
|
* @param {boolean} [sanitizeToolSyntaxInReasoning=false] - Whether to sanitize pseudo-tool syntax in reasoning |
|
* @param {boolean} [sanitizeToolSyntaxInContent=false] - Whether to sanitize pseudo-tool syntax in content |
|
* @param {boolean} [forceReasoningApplied=false] - Whether force reasoning prompt was injected |
|
* @returns {Promise<Response>} New Response object with transformed SSE stream body |
|
*/ |
|
async function handleStreamingResponse(response, enableReasoningToThinking = true, sanitizeToolSyntaxInReasoning = false, sanitizeToolSyntaxInContent = false, forceReasoningApplied = false) { |
|
if (!response.body) return response; |
|
|
|
const decoder = new TextDecoder(); |
|
const encoder = new TextEncoder(); |
|
|
|
// Stream State Tracking for reasoning-to-thinking conversion |
|
const streamContext = { |
|
hasTextContent: false, |
|
reasoningBuffer: "", |
|
isReasoningFinished: false, |
|
enableReasoningToThinking: enableReasoningToThinking, |
|
sanitizeToolSyntaxInReasoning: sanitizeToolSyntaxInReasoning, |
|
sanitizeToolSyntaxInContent: sanitizeToolSyntaxInContent, |
|
insidePseudoToolBlock: false, |
|
// Force reasoning state tracking |
|
forceReasoningApplied: forceReasoningApplied, |
|
forceReasoningState: "SEARCHING", // SEARCHING, REASONING, FINAL |
|
forceReasoningBuffer: "", |
|
forceReasoningPartialMatch: "", |
|
forceReasoningAccumulatedWhitespace: "" |
|
}; |
|
|
|
// Create a new readable stream for processing with transformation |
|
const transformedStream = new ReadableStream({ |
|
start: (controller) => processStream(response.body, controller, decoder, encoder, streamContext) |
|
}); |
|
|
|
// Return the transformed response with proper headers |
|
return new Response(transformedStream, { |
|
status: response.status, |
|
statusText: response.statusText, |
|
headers: { |
|
"Content-Type": "text/event-stream", |
|
"Cache-Control": "no-cache", |
|
"Connection": "keep-alive" |
|
} |
|
}); |
|
} |
|
|
|
/** |
|
* Processes the streaming response with reasoning-to-thinking transformation |
|
* @async |
|
* @param {ReadableStream<Uint8Array>} body - The response body stream from Fetch API |
|
* @param {ReadableStreamDefaultController} controller - ReadableStream controller for output |
|
* @param {TextDecoder} decoder - TextDecoder for converting Uint8Array to string |
|
* @param {TextEncoder} encoder - TextEncoder for converting string to Uint8Array |
|
* @param {Object} context - Stream state context for reasoning tracking (see processStreamLine for properties) |
|
* @returns {Promise<void>} Resolves when stream processing is complete |
|
*/ |
|
async function processStream(body, controller, decoder, encoder, context) { |
|
const reader = body.getReader(); |
|
|
|
try { |
|
let buffer = ""; |
|
|
|
// Read and process the stream |
|
while (true) { |
|
const { done, value } = await reader.read(); |
|
|
|
if (done) { |
|
// Process any remaining buffer |
|
if (buffer.trim()) { |
|
const lines = buffer.split('\n'); |
|
for (const line of lines) { |
|
if (line.trim()) { |
|
processStreamLine(line, controller, encoder, context); |
|
} |
|
} |
|
} |
|
break; |
|
} |
|
|
|
// Decode the chunk and add to buffer |
|
const chunk = decoder.decode(value, { stream: true }); |
|
buffer += chunk; |
|
|
|
// Safety buffer limit (1MB) |
|
if (buffer.length > 1e6) { |
|
const lines = buffer.split('\n'); |
|
buffer = lines.pop() || ""; // Keep the last partial line |
|
for (const line of lines) { |
|
if (line.trim()) { |
|
processStreamLine(line, controller, encoder, context); |
|
} |
|
} |
|
continue; |
|
} |
|
|
|
// Process complete lines |
|
const lines = buffer.split('\n'); |
|
buffer = lines.pop() || ""; |
|
|
|
for (const line of lines) { |
|
if (line.trim()) { |
|
processStreamLine(line, controller, encoder, context); |
|
} |
|
} |
|
} |
|
} catch (error) { |
|
controller.error(error); |
|
} finally { |
|
try { |
|
reader.releaseLock(); |
|
} catch (error) { |
|
// Ignore lock release errors |
|
} |
|
controller.close(); |
|
} |
|
} |
|
|
|
// ============================================================================ |
|
// NANO GPT PRODUCTION TRANSFORMER CLASS |
|
// ============================================================================ |
|
|
|
class NanoGPTProductionTransformer { |
|
/** |
|
* Creates a new NanoGPTProductionTransformer instance |
|
* @param {Object} options - Configuration options for the transformer |
|
* @param {boolean} [options.enable=true] - Whether the transformer is enabled |
|
* @param {boolean} [options.enableStreamOptions=true] - Add stream_options: { include_usage: true } to requests. |
|
* WHY: Required for statusline to show token counts properly. |
|
* RISK if disabled: Token usage stats won't be available in Claude Code UI. |
|
* @param {boolean} [options.enableReasoningToThinking=true] - Convert NanoGPT reasoning format to Claude Code thinking format. |
|
* WHY: Makes reasoning/thinking visible in Claude Code's "Thinking" mode. |
|
* RISK if disabled: Reasoning content will be lost or malformed in Claude Code. |
|
* @param {boolean} [options.enableFakeResponse=true] - Generate fake response when max_tokens=1. |
|
* WHY: Workaround for models that don't handle max_tokens=1 correctly (Claude Code uses this for /model switching). |
|
* RISK if disabled: Model switching via /model command may hang or fail if the model doesn't respond properly. |
|
* SAFE to disable if: Your model correctly handles max_tokens=1 OR you implement timeout/retry logic. |
|
* @param {boolean} [options.enableForceReasoning=false] - Inject reasoning prompt for non-reasoning models. |
|
* WHY: Forces models without built-in reasoning to think step-by-step using <reasoning_content> tags. |
|
* RISK if enabled on reasoning models: Double reasoning, wasted tokens, confused output. |
|
* NOTE: Automatically skipped for models in REASONING_MODELS list. |
|
* @param {number|string} [options.temperature=0.1] - Controls randomness in output (0-2). Supports range strings like "0.1-0.8". |
|
* @param {number|string} [options.max_tokens=-99] - Maximum tokens to generate. Supports range strings. Preserved at 1 (model switching), default -99 leaves untouched for Claude Code to configure. |
|
* @param {number|string} [options.top_p=0.95] - Nucleus sampling parameter (0-1). Supports range strings. |
|
* @param {number|string} [options.frequency_penalty=0] - Reduces token repetition (-2 to 2). Supports range strings. |
|
* @param {number|string} [options.presence_penalty=0] - Reduces topic repetition (-2 to 2). Supports range strings. |
|
* @param {boolean|number|string} [options.parallel_tool_calls=true] - Whether to enable parallel tool execution. Supports range for randomization. |
|
* @param {number|string|null} [options.top_k=40] - Limits vocabulary to top K tokens (1-100). Set to null to omit. Supports range strings. |
|
* @param {number|string} [options.repetition_penalty=1.15] - Alternative repetition control (0.1-2.0). Supports range strings. |
|
* @param {Object} [options.cache_control] - Cache control settings (default: omitted) |
|
* @param {boolean} options.cache_control.enabled - Whether to enable caching |
|
* @param {string} options.cache_control.ttl - Cache TTL (e.g., "5m", "1h") |
|
* @param {boolean} [options.sanitizeToolSyntaxInReasoning=false] - Whether to sanitize pseudo-tool syntax in reasoning content |
|
* @param {boolean} [options.sanitizeToolSyntaxInContent=false] - Whether to sanitize pseudo-tool syntax in regular content |
|
* @param {string} [options.reasoning_effort="low"] - Reasoning effort level for NanoGPT models. |
|
* Valid values: "none", "minimal", "low", "medium", "high". |
|
* Controls computational effort for reasoning (none=fastest, high=slowest but most thorough). |
|
* **Note**: Can be overridden by Claude Code's thinking mode: |
|
* - thinking.enabled → "medium" (if not explicitly set by user) |
|
* - thinking.disabled → "none" (if not explicitly set by user) |
|
* User's explicit reasoning_effort always takes precedence. |
|
* @param {Object} [options.extra={}] - Custom parameters to merge into the request |
|
* |
|
* @description **Parameter Omission Feature**: Set any numeric parameter to -99 to omit it from the request. |
|
* @description **Range Parameter Support**: Numeric parameters accept range strings like "0.1-0.8" to generate random values per request. |
|
* @description **Custom Parameters**: Add custom parameters via the 'extra' object, which will be merged into the request. |
|
* @description **Reasoning Effort Levels**: none (no reasoning), minimal (~10% tokens), low (~20% tokens), medium (~50% tokens), high (~80% tokens). |
|
*/ |
|
constructor(options) { |
|
this.name = "nanogpt"; |
|
this.options = options; |
|
this.enable = this.options?.enable ?? true; |
|
|
|
|
|
// Feature toggles with safe defaults (all enabled) |
|
this.enableStreamOptions = this.options?.enableStreamOptions ?? true; |
|
this.enableReasoningToThinking = this.options?.enableReasoningToThinking ?? true; |
|
this.enableFakeResponse = this.options?.enableFakeResponse ?? true; |
|
this.enableForceReasoning = this.options?.enableForceReasoning ?? false; |
|
|
|
// Sampling parameters (support both exact numbers and range strings) |
|
// Default parameter values for optimal performance |
|
this.temperature = this.options?.temperature ?? 0.1; |
|
this.max_tokens = this.options?.max_tokens ?? -99; |
|
this.top_p = this.options?.top_p ?? 0.95; |
|
this.frequency_penalty = this.options?.frequency_penalty ?? 0; |
|
this.presence_penalty = this.options?.presence_penalty ?? 0; |
|
this.parallel_tool_calls = this.options?.parallel_tool_calls ?? true; |
|
this.top_k = this.options?.top_k ?? 40; // Balances exploration and exploitation for code tokens 8 |
|
this.repetition_penalty = this.options?.repetition_penalty ?? 1.15; // Optimal for preventing structural repetition in code 35 |
|
this.cache_control = this.options?.cache_control ?? null; |
|
|
|
// Reasoning effort parameter for NanoGPT models |
|
this.reasoning_effort = this.options?.reasoning_effort ?? "none"; |
|
|
|
// Store range configurations for parameters that support ranges |
|
this._rangeConfigs = { |
|
temperature: parseRange(this.temperature), |
|
max_tokens: parseRange(this.max_tokens), |
|
top_p: parseRange(this.top_p), |
|
frequency_penalty: parseRange(this.frequency_penalty), |
|
presence_penalty: parseRange(this.presence_penalty), |
|
parallel_tool_calls: parseRange(this.parallel_tool_calls), |
|
top_k: parseRange(this.top_k), |
|
repetition_penalty: parseRange(this.repetition_penalty) |
|
}; |
|
|
|
// pseudo-tool cleanup flags |
|
this.sanitizeToolSyntaxInReasoning = this.options?.sanitizeToolSyntaxInReasoning ?? false; |
|
this.sanitizeToolSyntaxInContent = this.options?.sanitizeToolSyntaxInContent ?? false; |
|
|
|
// Custom parameters support |
|
this.extra = this.options?.extra ?? {}; |
|
|
|
this.lastRequest = null; |
|
} |
|
|
|
/** |
|
* Transforms incoming requests to include stream_options parameter and sampling parameters |
|
* @async |
|
* @param {Object} request - The incoming request object (OpenAI-compatible format) |
|
* @param {string} [request.model] - The model name for reasoning model detection |
|
* @param {number} [request.max_tokens] - Original max_tokens (preserved if 1 for model switching, omitted if -99, left unchanged if -1 for Claude Code) |
|
* @param {Array} [request.messages] - Messages array for force reasoning injection |
|
* @returns {Promise<Object>} The modified request object with stream_options and parameters applied |
|
*/ |
|
async transformRequestIn(request) { |
|
if (!this.enable) { |
|
return request; |
|
} |
|
|
|
|
|
// Store the request for later inspection in transformResponseOut |
|
this.lastRequest = { ...request }; |
|
|
|
// Create a copy of the request to avoid mutating the original |
|
const modifiedRequest = { ...request }; |
|
|
|
// Conditionally include stream_options: { include_usage: true } |
|
// Only add if enableStreamOptions is true (default) |
|
if (this.enableStreamOptions) { |
|
modifiedRequest.stream_options = { |
|
include_usage: true, |
|
...modifiedRequest.stream_options |
|
}; |
|
} |
|
|
|
// Set sampling parameters (resolve ranges to actual values for each request) |
|
// ONLY include the 5 specified parameters when no user options provided |
|
if (!shouldOmitParameter(this.temperature)) { |
|
const resolvedTemp = resolveParameterValue(this.temperature, 'temperature'); |
|
if (resolvedTemp !== null) { |
|
modifiedRequest.temperature = resolvedTemp; |
|
} |
|
} |
|
|
|
// Handle max_tokens with three distinct cases: |
|
// Case 1: max_tokens=1 - preserve for fake response generation (model switching) |
|
// Case 2: this.max_tokens=-99 (config default) - don't touch, leave CC's value unchanged |
|
// Case 3: Use configured value for other cases (user explicitly set max_tokens in config) |
|
const shouldPreserveMaxTokens = request.max_tokens === 1; |
|
const configOmitsMaxTokens = shouldOmitParameter(this.max_tokens); // -99 means leave CC's value unchanged (default) |
|
|
|
if (shouldPreserveMaxTokens) { |
|
// Case 1: Preserve max_tokens=1 for fake response |
|
modifiedRequest.max_tokens = request.max_tokens; |
|
} else if (configOmitsMaxTokens) { |
|
// Case 2: Config says -99 (default), don't touch max_tokens - leave CC's original value |
|
// Do nothing, let request.max_tokens pass through unchanged |
|
} else { |
|
// Case 3: Use configured value (user explicitly set max_tokens in config) |
|
const resolvedMaxTokens = resolveParameterValue(this.max_tokens, 'max_tokens'); |
|
if (resolvedMaxTokens !== null) { |
|
modifiedRequest.max_tokens = Math.round(resolvedMaxTokens); // max_tokens should be integer |
|
} |
|
} |
|
|
|
if (!shouldOmitParameter(this.top_p)) { |
|
const resolvedTopP = resolveParameterValue(this.top_p, 'top_p'); |
|
if (resolvedTopP !== null) { |
|
modifiedRequest.top_p = resolvedTopP; |
|
} |
|
} |
|
if (!shouldOmitParameter(this.frequency_penalty)) { |
|
const resolvedFreqPenalty = resolveParameterValue(this.frequency_penalty, 'frequency_penalty'); |
|
if (resolvedFreqPenalty !== null) { |
|
modifiedRequest.frequency_penalty = resolvedFreqPenalty; |
|
} |
|
} |
|
if (!shouldOmitParameter(this.presence_penalty)) { |
|
const resolvedPresencePenalty = resolveParameterValue(this.presence_penalty, 'presence_penalty'); |
|
if (resolvedPresencePenalty !== null) { |
|
modifiedRequest.presence_penalty = resolvedPresencePenalty; |
|
} |
|
} |
|
|
|
// ONLY include other parameters if explicitly set by user in options |
|
// Check if user explicitly provided these parameters in the original request |
|
if (request.parallel_tool_calls !== undefined) { |
|
const resolvedParallelTools = resolveParameterValue(this.parallel_tool_calls, 'parallel_tool_calls'); |
|
if (resolvedParallelTools !== null) { |
|
modifiedRequest.parallel_tool_calls = Boolean(resolvedParallelTools); // Ensure boolean |
|
} |
|
} |
|
|
|
// Only add top_k if user explicitly set it and it's not null and not -99 |
|
if (request.top_k !== undefined && this.top_k !== null && !shouldOmitParameter(this.top_k)) { |
|
const resolvedTopK = resolveParameterValue(this.top_k, 'top_k'); |
|
if (resolvedTopK !== null) { |
|
modifiedRequest.top_k = Math.round(resolvedTopK); // top_k should be integer |
|
} |
|
} |
|
|
|
if (request.repetition_penalty !== undefined) { |
|
const resolvedRepetitionPenalty = resolveParameterValue(this.repetition_penalty, 'repetition_penalty'); |
|
if (resolvedRepetitionPenalty !== null) { |
|
modifiedRequest.repetition_penalty = resolvedRepetitionPenalty; |
|
} |
|
} |
|
|
|
// Set cache control only if user explicitly enabled it |
|
if (request.cache_control !== undefined && this.cache_control && this.cache_control.enabled) { |
|
modifiedRequest.cache_control = this.cache_control; |
|
} |
|
|
|
// ============================================================================ |
|
// CLAUDE CODE REASONING FORMAT INTEGRATION |
|
// ============================================================================ |
|
|
|
// Handle Claude Code's reasoning format as a simple on/off switch |
|
// When CC Thinking is ON: set reasoning.enabled=true, exclude=false with configured effort |
|
// When CC Thinking is OFF: set reasoning.enabled=false, exclude=true, effort="none" |
|
// This applies to BOTH reasoning and non-reasoning models consistently |
|
// |
|
// Precedence: user's explicit reasoning_effort > CC thinking toggle > config default |
|
|
|
let finalReasoningEffort = this.reasoning_effort; // Start with config default (e.g., "high") |
|
let reasoningSource = "configuration"; |
|
let ccThinkingEnabled = false; // Track CC Thinking state for Force Reasoning logic |
|
|
|
// Check if user explicitly set reasoning_effort in the original request (highest precedence) |
|
const userSetReasoningEffort = 'reasoning_effort' in request; |
|
|
|
if (userSetReasoningEffort) { |
|
// User explicitly set reasoning_effort - use their value |
|
finalReasoningEffort = request.reasoning_effort; |
|
reasoningSource = "user-explicit"; |
|
// If user explicitly set reasoning_effort to something other than "none", consider thinking enabled |
|
ccThinkingEnabled = finalReasoningEffort !== "none"; |
|
|
|
|
|
// Create reasoning object based on user's explicit setting |
|
if (finalReasoningEffort === "none") { |
|
modifiedRequest.reasoning = { |
|
effort: "none", |
|
enabled: false, |
|
exclude: true |
|
}; |
|
} else { |
|
modifiedRequest.reasoning = { |
|
effort: finalReasoningEffort, |
|
enabled: true, |
|
exclude: false |
|
}; |
|
} |
|
modifiedRequest.reasoning_effort = finalReasoningEffort; |
|
} else if (request.reasoning && typeof request.reasoning === 'object') { |
|
// CC sent reasoning parameter - check if enabled or disabled |
|
const ccReasoning = request.reasoning; |
|
|
|
if (ccReasoning.enabled !== false) { |
|
// CC Thinking is ON → use transformer's configured reasoning_effort (default: "high") |
|
finalReasoningEffort = this.reasoning_effort !== "none" ? this.reasoning_effort : "high"; |
|
reasoningSource = "claude-code-enabled"; |
|
ccThinkingEnabled = true; |
|
|
|
// Create both formats for NanoGPT compatibility |
|
modifiedRequest.reasoning = { |
|
effort: finalReasoningEffort, |
|
enabled: true, |
|
exclude: false |
|
}; |
|
modifiedRequest.reasoning_effort = finalReasoningEffort; |
|
|
|
} else { |
|
// CC Thinking is OFF (enabled=false) → set reasoning to disabled state |
|
finalReasoningEffort = "none"; |
|
reasoningSource = "claude-code-disabled"; |
|
ccThinkingEnabled = false; |
|
|
|
// Create reasoning object with explicit disable |
|
modifiedRequest.reasoning = { |
|
effort: "none", |
|
enabled: false, |
|
exclude: true |
|
}; |
|
modifiedRequest.reasoning_effort = "none"; |
|
|
|
} |
|
} else { |
|
// No reasoning parameter from CC → CC Thinking is OFF (implicit) |
|
finalReasoningEffort = "none"; |
|
reasoningSource = "claude-code-off"; |
|
ccThinkingEnabled = false; |
|
|
|
// Create reasoning object with explicit disable |
|
modifiedRequest.reasoning = { |
|
effort: "none", |
|
enabled: false, |
|
exclude: true |
|
}; |
|
modifiedRequest.reasoning_effort = "none"; |
|
|
|
} |
|
|
|
// Validate reasoning_effort value |
|
const validEfforts = ["none", "minimal", "low", "medium", "high"]; |
|
if (!validEfforts.includes(modifiedRequest.reasoning_effort)) { |
|
modifiedRequest.reasoning_effort = "none"; |
|
modifiedRequest.reasoning = { |
|
effort: "none", |
|
enabled: false, |
|
exclude: true |
|
}; |
|
} |
|
|
|
// Merge custom parameters from extra object ONLY if user explicitly set extra |
|
if (request.extra !== undefined && this.extra && typeof this.extra === 'object' && Object.keys(this.extra).length > 0) { |
|
// Debug: Check if extra object contains reasoning that would override CC Thinking |
|
Object.assign(modifiedRequest, this.extra); |
|
|
|
// Debug: Show reasoning state after extra merge |
|
} |
|
|
|
// Force reasoning injection for non-reasoning models |
|
// This injects the FORCE_REASONING_PROMPT to make models think step-by-step |
|
// Skip if the model is already a reasoning model (has built-in reasoning capabilities) |
|
// Also skip if CC Thinking is OFF (ccThinkingEnabled is false) |
|
if (this.enableForceReasoning && modifiedRequest.messages) { |
|
const modelName = modifiedRequest.model || ""; |
|
|
|
|
|
if (isReasoningModel(modelName)) { |
|
// Skip force reasoning for built-in reasoning models |
|
} else if (!ccThinkingEnabled) { |
|
// Skip force reasoning when CC Thinking is OFF |
|
} else { |
|
// Apply force reasoning - CC Thinking is ON and model is not a reasoning model |
|
// Deep copy messages to avoid mutating the original |
|
modifiedRequest.messages = JSON.parse(JSON.stringify(modifiedRequest.messages)); |
|
|
|
// Find existing system message |
|
let systemMessage = modifiedRequest.messages.find((msg) => msg.role === "system"); |
|
|
|
// If system message exists and has array content, add reasoning prompt |
|
if (Array.isArray(systemMessage?.content)) { |
|
systemMessage.content.push({ type: "text", text: FORCE_REASONING_PROMPT }); |
|
} else if (systemMessage && typeof systemMessage.content === "string") { |
|
// If system message has string content, convert to array and add prompt |
|
systemMessage.content = [ |
|
{ type: "text", text: systemMessage.content }, |
|
{ type: "text", text: FORCE_REASONING_PROMPT } |
|
]; |
|
} |
|
|
|
// Get the last message in the conversation |
|
let lastMessage = modifiedRequest.messages[modifiedRequest.messages.length - 1]; |
|
|
|
// If last message is from user and has array content, add reasoning prompt |
|
if (lastMessage && lastMessage.role === "user" && Array.isArray(lastMessage.content)) { |
|
lastMessage.content.push({ type: "text", text: FORCE_REASONING_PROMPT }); |
|
} else if (lastMessage && lastMessage.role === "user" && typeof lastMessage.content === "string") { |
|
// If user message has string content, convert to array and add prompt |
|
lastMessage.content = [ |
|
{ type: "text", text: lastMessage.content }, |
|
{ type: "text", text: FORCE_REASONING_PROMPT } |
|
]; |
|
} |
|
|
|
// If last message is from tool, add a new user message with reasoning prompt |
|
if (lastMessage && lastMessage.role === "tool") { |
|
modifiedRequest.messages.push({ |
|
role: "user", |
|
content: [{ type: "text", text: FORCE_REASONING_PROMPT }] |
|
}); |
|
} |
|
|
|
// Mark that forceReasoning was applied for response processing |
|
modifiedRequest.forceReasoningApplied = true; |
|
} |
|
} |
|
|
|
// Update lastRequest with the modified version |
|
this.lastRequest = modifiedRequest; |
|
|
|
|
|
return modifiedRequest; |
|
} |
|
|
|
/** |
|
* Transforms outgoing responses to convert reasoning to thinking format for Claude Code |
|
* @async |
|
* @param {Response} response - The Fetch API Response object from the LLM provider |
|
* @returns {Promise<Response>} The transformed Response object with reasoning converted to thinking format |
|
*/ |
|
async transformResponseOut(response) { |
|
// If disabled, return response as-is |
|
if (!this.enable) return response; |
|
|
|
let responseToProcess = response; |
|
|
|
// === ERROR INTERCEPTION (401/403/404/409/422/429/500) === |
|
// Intercept common HTTP errors and return friendly fake responses |
|
// Return a friendly fake response instead of letting the error propagate |
|
if (response.status === 401 || response.status === 403 || response.status === 404 || |
|
response.status === 409 || response.status === 422 || response.status === 429 || |
|
response.status === 500) { |
|
const isStream = !!this.lastRequest?.stream; |
|
const model = this.lastRequest?.model || "nanogpt-model"; |
|
let errorMessage = ""; |
|
|
|
// Generate appropriate error message based on status code |
|
switch (response.status) { |
|
case 401: |
|
errorMessage = "⚠️ **Authentication Error**: Session required. Your API key is invalid or expired. Please check your configuration."; |
|
break; |
|
case 403: |
|
errorMessage = "⚠️ **Forbidden**: Insufficient permissions. You don't have access to this resource or operation."; |
|
break; |
|
case 404: |
|
errorMessage = "⚠️ **Not Found**: The requested resource was not found. Please check your request and try again."; |
|
break; |
|
case 409: |
|
errorMessage = "⚠️ **Conflict**: Resource conflict detected. This could be due to duplicate creation or wrong state."; |
|
break; |
|
case 422: |
|
errorMessage = "⚠️ **Invalid Input**: Validation failed. Please check your request parameters and format."; |
|
break; |
|
case 429: |
|
errorMessage = "⚠️ **Rate Limited**: Too many requests. Please wait and try again later."; |
|
break; |
|
case 500: |
|
errorMessage = "⚠️ **Internal Server Error**: The server encountered an unexpected error. Please try again later."; |
|
break; |
|
default: |
|
errorMessage = `⚠️ **Error ${response.status}**: An unexpected error occurred. Please try again.`; |
|
} |
|
|
|
|
|
const id = generateId(); |
|
const created = Math.floor(Date.now() / 1000); |
|
|
|
if (isStream) { |
|
// Create SSE stream for error message |
|
const streamLines = [ |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: { role: "assistant" }, finish_reason: null }] |
|
})}`, |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: { content: errorMessage }, finish_reason: null }] |
|
})}`, |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: {}, finish_reason: "stop" }] |
|
})}`, |
|
"data: [DONE]" |
|
]; |
|
|
|
return new Response(streamLines.join('\n\n'), { |
|
status: 200, |
|
headers: { |
|
"Content-Type": "text/event-stream", |
|
"Cache-Control": "no-cache", |
|
"Connection": "keep-alive" |
|
} |
|
}); |
|
} else { |
|
// Create JSON response for error message |
|
const fakeResponse = { |
|
"id": id, |
|
"object": "chat.completion", |
|
"created": created, |
|
"model": model, |
|
"choices": [ |
|
{ |
|
"index": 0, |
|
"finish_reason": "stop", |
|
"message": { |
|
"role": "assistant", |
|
"content": errorMessage |
|
} |
|
} |
|
] |
|
}; |
|
|
|
return new Response(JSON.stringify(fakeResponse), { |
|
status: 200, |
|
headers: { |
|
'Content-Type': 'application/json' |
|
} |
|
}); |
|
} |
|
} |
|
|
|
// Check if we should return fake response for max_tokens=1 |
|
// This is a workaround for models that don't handle max_tokens=1 correctly |
|
// Claude Code uses max_tokens=1 when switching models via /model command |
|
if (this.enableFakeResponse && this.lastRequest?.max_tokens === 1) { |
|
const fakeResponse = createFakeResponse(this.lastRequest.model); |
|
const finalResponse = new Response(JSON.stringify(fakeResponse), { |
|
status: 200, |
|
headers: { |
|
'Content-Type': 'application/json' |
|
} |
|
}); |
|
|
|
|
|
return finalResponse; |
|
} |
|
|
|
let transformedResponse = responseToProcess; |
|
|
|
// Handle JSON responses (non-streaming) - TRANSFORM reasoning to thinking (if enabled) |
|
if (responseToProcess.headers.get("Content-Type")?.includes("application/json")) { |
|
transformedResponse = await handleNonStreamingResponse( |
|
responseToProcess, |
|
this.enableReasoningToThinking, |
|
this.sanitizeToolSyntaxInReasoning, |
|
this.sanitizeToolSyntaxInContent, |
|
this.lastRequest?.forceReasoningApplied || false |
|
); |
|
} |
|
// Handle streaming responses - TRANSFORM reasoning to thinking (if enabled) |
|
else if (responseToProcess.headers.get("Content-Type")?.includes("stream")) { |
|
transformedResponse = await handleStreamingResponse( |
|
responseToProcess, |
|
this.enableReasoningToThinking, |
|
this.sanitizeToolSyntaxInReasoning, |
|
this.sanitizeToolSyntaxInContent, |
|
this.lastRequest?.forceReasoningApplied || false |
|
); |
|
} |
|
|
|
|
|
// Return response as-is for other content types (or the transformed one) |
|
return transformedResponse; |
|
} |
|
} |
|
|
|
// ============================================================================ |
|
// GLOBAL FETCH INTERCEPTOR |
|
// ============================================================================ |
|
|
|
/** |
|
* Monkey-patch global fetch to intercept 401/429 errors from NanoGPT |
|
* This is necessary because the router throws on non-2xx responses before the transformer can handle them |
|
*/ |
|
(function patchFetch() { |
|
if (globalThis._nanogpt_fetch_patched) return; |
|
|
|
const originalFetch = globalThis.fetch; |
|
globalThis.fetch = async function (url, options) { |
|
const response = await originalFetch(url, options); |
|
|
|
// Only intercept common HTTP errors |
|
if (response.status !== 401 && response.status !== 403 && response.status !== 404 && |
|
response.status !== 409 && response.status !== 422 && response.status !== 429 && |
|
response.status !== 500) { |
|
return response; |
|
} |
|
|
|
try { |
|
// Clone response to inspect body without consuming the original |
|
const clone = response.clone(); |
|
const text = await clone.text(); |
|
let errorData; |
|
try { |
|
errorData = JSON.parse(text); |
|
} catch (e) { |
|
return response; // Not JSON, ignore |
|
} |
|
|
|
// Check for specific error signatures |
|
const isRateLimit = response.status === 429 && |
|
errorData?.error?.code === 'rate_limit_exceeded' && |
|
errorData?.error?.message?.includes('Too many authentication failures'); |
|
|
|
const isInvalidKey = response.status === 401 && |
|
errorData?.error?.code === 'invalid_api_key' && |
|
errorData?.error?.message === 'Invalid session'; |
|
|
|
const isForbidden = response.status === 403 && |
|
(errorData?.error?.code === 'insufficient_permissions' || |
|
errorData?.error?.code === 'forbidden' || |
|
errorData?.error?.message?.includes('permission')); |
|
|
|
const isNotFound = response.status === 404 && |
|
(errorData?.error?.code === 'not_found' || |
|
errorData?.error?.message?.includes('not found')); |
|
|
|
const isConflict = response.status === 409 && |
|
(errorData?.error?.code === 'conflict' || |
|
errorData?.error?.code === 'resource_conflict' || |
|
errorData?.error?.message?.includes('conflict')); |
|
|
|
const isInvalidInput = response.status === 422 && |
|
(errorData?.error?.code === 'invalid_input' || |
|
errorData?.error?.code === 'validation_failed' || |
|
errorData?.error?.message?.includes('validation')); |
|
|
|
const isInternalError = response.status === 500 && |
|
(errorData?.error?.code === 'internal_error' || |
|
errorData?.error?.code === 'server_error' || |
|
errorData?.error?.message?.includes('internal')); |
|
|
|
if (isRateLimit || isInvalidKey || isForbidden || isNotFound || |
|
isConflict || isInvalidInput || isInternalError) { |
|
// Extract request details to construct fake response |
|
let model = "nanogpt-model"; |
|
let isStream = false; |
|
|
|
if (options && options.body) { |
|
try { |
|
const body = JSON.parse(options.body); |
|
if (body.model) model = body.model; |
|
if (body.stream) isStream = body.stream; |
|
} catch (e) { |
|
// Ignore parsing error |
|
} |
|
} |
|
|
|
// Generate appropriate error message based on error type |
|
let errorMessage; |
|
if (isInvalidKey) { |
|
errorMessage = "⚠️ **Authentication Error**: Session required. Your API key is invalid or expired. Please check your configuration."; |
|
} else if (isRateLimit) { |
|
errorMessage = "⚠️ **Rate Limited**: Too many requests. Please wait and try again later."; |
|
} else if (isForbidden) { |
|
errorMessage = "⚠️ **Forbidden**: Insufficient permissions. You don't have access to this resource or operation."; |
|
} else if (isNotFound) { |
|
errorMessage = "⚠️ **Not Found**: The requested resource was not found. Please check your request and try again."; |
|
} else if (isConflict) { |
|
errorMessage = "⚠️ **Conflict**: Resource conflict detected. This could be due to duplicate creation or wrong state."; |
|
} else if (isInvalidInput) { |
|
errorMessage = "⚠️ **Invalid Input**: Validation failed. Please check your request parameters and format."; |
|
} else if (isInternalError) { |
|
errorMessage = "⚠️ **Internal Server Error**: The server encountered an unexpected error. Please try again later."; |
|
} else { |
|
errorMessage = `⚠️ **Error ${response.status}**: An unexpected error occurred. Please try again.`; |
|
} |
|
|
|
const id = "chatcmpl-" + Math.random().toString(36).substring(2, 15); |
|
const created = Math.floor(Date.now() / 1000); |
|
|
|
if (isStream) { |
|
const streamLines = [ |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: { role: "assistant" }, finish_reason: null }] |
|
})}`, |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: { content: errorMessage }, finish_reason: null }] |
|
})}`, |
|
`data: ${JSON.stringify({ |
|
id: id, |
|
object: "chat.completion.chunk", |
|
created: created, |
|
model: model, |
|
choices: [{ index: 0, delta: {}, finish_reason: "stop" }] |
|
})}`, |
|
"data: [DONE]" |
|
]; |
|
|
|
return new Response(streamLines.join('\n\n'), { |
|
status: 200, |
|
headers: { |
|
"Content-Type": "text/event-stream", |
|
"Cache-Control": "no-cache", |
|
"Connection": "keep-alive" |
|
} |
|
}); |
|
} else { |
|
const fakeResponse = { |
|
"id": id, |
|
"object": "chat.completion", |
|
"created": created, |
|
"model": model, |
|
"choices": [ |
|
{ |
|
"index": 0, |
|
"finish_reason": "stop", |
|
"message": { |
|
"role": "assistant", |
|
"content": errorMessage |
|
} |
|
} |
|
] |
|
}; |
|
|
|
return new Response(JSON.stringify(fakeResponse), { |
|
status: 200, |
|
headers: { |
|
'Content-Type': 'application/json' |
|
} |
|
}); |
|
} |
|
} |
|
} catch (e) { |
|
// If anything goes wrong in interception, return original response |
|
} |
|
|
|
return response; |
|
}; |
|
globalThis._nanogpt_fetch_patched = true; |
|
})(); |
|
|
|
// Export the transformer for use in the router |
|
module.exports = NanoGPTProductionTransformer; |