This document provides the definitive setup for a high-performance, local AI development environment on an M4 Max (36GB RAM). It leverages Speculative Decoding, AST-based Indexing, and Kernel-level Memory Tuning.
On an M4 Max with 36GB, the "Sweet Spot" is running a 35B-parameter model (The Architect) and a 1.5B-parameter model (The Draft Model).
- The Junior Guess (1.5B): Sprints ahead to "guess" common code blocks (loops, imports).
- The Senior Check (35B): Verifies those blocks in a single GPU pass.
- The Result: You get 35B-level reasoning at nearly the speed of a 1.5B model.
Run these commands in your terminal to unlock the GPU's full potential. By default, macOS throttles GPU access to ~75% of RAM; these overrides fix that.
Allows the GPU to claim up to 32GB of your 36GB RAM.
sudo sysctl iogpu.wired_limit_mb=32768
Optimize Ollama for high-speed attention and compressed memory. Run these and restart the Ollama app.
# Enable High-Speed Attention
launchctl setenv OLLAMA_FLASH_ATTENTION 1
# Compresses the "Working Memory" (Saves ~4GB RAM on 32k context)
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0
# Focus GPU power on one request at a time
launchctl setenv OLLAMA_NUM_PARALLEL 1
# Set stable 32k context window
launchctl setenv OLLAMA_NUM_CTX 32768
The proxy manages the handshake between your Junior and Senior models. Save the following as ~/llama-swap/config.yaml.
# ~/llama-swap/config.yaml
healthCheckTimeout: 600
logToStdout: "both"
models:
"fast-coder":
cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
checkEndpoint: "/"
healthCheckEndpoint: "/"
env:
- "OLLAMA_NUM_CTX=65536"
- "OLLAMA_FLASH_ATTENTION=true"
- "OLLAMA_KEEP_ALIVE=-1" # Keeps the junior ready to sprint
useModelName: "qwen-fast-agent"
ttl: 600
"think-brain":
# BACK TO STANDARD CMD: Remove the --draft flag that caused the crash
cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
checkEndpoint: "/"
healthCheckEndpoint: "/"
env:
- "OLLAMA_NUM_CTX=65536"
- "OLLAMA_FLASH_ATTENTION=true"
- "OLLAMA_NUM_PARALLEL=1"
- "OLLAMA_KEEP_ALIVE=-1"
# USE ENVIRONMENT VARIABLE: This is the stable way to trigger speculative decoding in 2026
- "OLLAMA_DRAFT_MODEL=qwen2.5:1.5b"
useModelName: "qwen3.5:35b-a3b"
ttl: 3600
Standard AI agents "brute-force" files by reading everything, which leads to Context Bloat and hallucinations. jCodeMunch turns your project into a structured database.
- AST-Parsing: Instead of reading text, it parses your code into an Abstract Syntax Tree. It knows exactly where a class begins and ends.
- Surgical Retrieval: When the Architect asks "How is CharacterRepository implemented?", jCodeMunch returns only that class, not the 500 lines of unrelated imports and boilerplate around it.
- Local Intelligence: It allows a local 35B model to "navigate" a 100k-line codebase without actually loading 100k lines into memory.
- Precision: Find symbols (functions/classes) by logic, not just keyword matches.
- Efficiency: Reduces token waste by ~90%.
- Context Control: Keeps your context window "clean" for actual reasoning.
To fully utilize this stack, follow these steps to integrate the Librarian and the Architect.
Configure Kilo Code to automatically use the 35B Architect (Kilo-Think) for high-level tasks and the 1.5B Junior (Kilo-Fast) for rapid coding. The following can be saved as JSON and imported into Kilo Code (Settings -> About Kilo Code -> Import (button).
{
"providerProfiles": {
"currentApiConfigName": "Kilo-Fast",
"apiConfigs": {
"Kilo-Fast": {
"diffEnabled": true,
"fuzzyMatchThreshold": 1,
"openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",
"openAiApiKey": "sk-1234",
"openAiModelId": "think-brain",
"openAiStreamingEnabled": true,
"apiProvider": "openai"
},
"Kilo-Think": {
"diffEnabled": true,
"fuzzyMatchThreshold": 1,
"openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",
"openAiApiKey": "sk-1234",
"openAiModelId": "think-brain",
"openAiStreamingEnabled": false,
"apiProvider": "openai"
}
},
"modeApiConfigs": {
"architect": "d305g8e9odi",
"code": "oampj108si",
"ask": "oampj108si",
"debug": "d305g8e9odi",
"orchestrator": "d305g8e9odi",
"review": "d305g8e9odi"
}
},
"globalSettings": {
"autoCondenseContext": true,
"autoCondenseContextPercent": 100,
"maxConcurrentFileReads": 5,
"allowVeryLargeReads": false,
"maxOpenTabsContext": 20,
"diffEnabled": true,
"experiments": {
"powerSteering": true,
"multiFileApplyDiff": true,
"speechToText": true
}
}
}
- Profiles: Use Kilo-Think (Port 5801) for Architect/Review modes and Kilo-Fast (Port 5800) for Code/Ask modes.
- Streaming: OFF for the Architect profile. This is the primary stability fix for local 35B models.
| Metric | Target | Notes |
|---|---|---|
| Baseline RAM | ~2.5GB | No models loaded. |
| Wired Memory | ~24.8GB | Model Weights + Draft + KV Cache active. |
| The Danger Zone | 32.0GB+ | macOS begins swapping; speed drops to ~10 TPS. |
Mission: * intelligence without latency. Privacy without compromise.