adamteale/M4 Max 2026 Local AI Setup Guide.md

## M4 Max 2026 Local AI Setup Guide.md

      
    Raw
  

              M4 Max 2026 Local AI Setup Guide.md
            
          
    🚀 The 2026 M4 Max Local-First AI Guide

Architecting the 110+ TPS "Senior-Junior" Local Brain

This document provides the definitive setup for a high-performance, local AI development environment on an M4 Max (36GB RAM). It leverages Speculative Decoding, AST-based Indexing, and Kernel-level Memory Tuning.
🏗️ 1. The Core Architecture (Senior-Junior)

On an M4 Max with 36GB, the "Sweet Spot" is running a 35B-parameter model (The Architect) and a 1.5B-parameter model (The Draft Model).
Speculative Decoding: Intelligence at 110+ TPS


The Junior Guess (1.5B): Sprints ahead to "guess" common code blocks (loops, imports).
The Senior Check (35B): Verifies those blocks in a single GPU pass.
The Result: You get 35B-level reasoning at nearly the speed of a 1.5B model.

🛠️ 2. Hardware & OS Tuning (Crucial)

Run these commands in your terminal to unlock the GPU's full potential. By default, macOS throttles GPU access to ~75% of RAM; these overrides fix that.
A. Unlock Wired Memory (VRAM Override)

Allows the GPU to claim up to 32GB of your 36GB RAM.
sudo sysctl iogpu.wired_limit_mb=32768
B. Global Environment Variables

Optimize Ollama for high-speed attention and compressed memory. Run these and restart the Ollama app.
# Enable High-Speed Attention

launchctl setenv OLLAMA_FLASH_ATTENTION 1
# Compresses the "Working Memory" (Saves ~4GB RAM on 32k context)

launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0
# Focus GPU power on one request at a time

launchctl setenv OLLAMA_NUM_PARALLEL 1
# Set stable 32k context window

launchctl setenv OLLAMA_NUM_CTX 32768
📄 3. Proxy Configuration (llama-swap)

The proxy manages the handshake between your Junior and Senior models. Save the following as ~/llama-swap/config.yaml.
# ~/llama-swap/config.yaml
healthCheckTimeout: 600
logToStdout: "both"

models:
  "fast-coder":
    cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
    checkEndpoint: "/"
    healthCheckEndpoint: "/"
    env:
      - "OLLAMA_NUM_CTX=65536"
      - "OLLAMA_FLASH_ATTENTION=true"
      - "OLLAMA_KEEP_ALIVE=-1" # Keeps the junior ready to sprint
    useModelName: "qwen-fast-agent"
    ttl: 600

  "think-brain":
    # BACK TO STANDARD CMD: Remove the --draft flag that caused the crash
    cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
    checkEndpoint: "/"
    healthCheckEndpoint: "/"
    env:
      - "OLLAMA_NUM_CTX=65536"
      - "OLLAMA_FLASH_ATTENTION=true"
      - "OLLAMA_NUM_PARALLEL=1"
      - "OLLAMA_KEEP_ALIVE=-1"
      # USE ENVIRONMENT VARIABLE: This is the stable way to trigger speculative decoding in 2026
      - "OLLAMA_DRAFT_MODEL=qwen2.5:1.5b"
    useModelName: "qwen3.5:35b-a3b"
    ttl: 3600

🧩 4. The Librarian (jCodeMunch-MCP)

Standard AI agents "brute-force" files by reading everything, which leads to Context Bloat and hallucinations. jCodeMunch turns your project into a structured database.
How it works:


AST-Parsing: Instead of reading text, it parses your code into an Abstract Syntax Tree. It knows exactly where a class begins and ends.
Surgical Retrieval: When the Architect asks "How is CharacterRepository implemented?", jCodeMunch returns only that class, not the 500 lines of unrelated imports and boilerplate around it.
Local Intelligence: It allows a local 35B model to "navigate" a 100k-line codebase without actually loading 100k lines into memory.

Key Benefits:


Precision: Find symbols (functions/classes) by logic, not just keyword matches.
Efficiency: Reduces token waste by ~90%.
Context Control: Keeps your context window "clean" for actual reasoning.

💻 5. VS Code Setup (Kilo Code)

To fully utilize this stack, follow these steps to integrate the Librarian and the Architect.
A. Global Settings & Mode Mapping

Configure Kilo Code to automatically use the 35B Architect (Kilo-Think) for high-level tasks and the 1.5B Junior (Kilo-Fast) for rapid coding. The following can be saved as JSON and imported into Kilo Code (Settings -> About Kilo Code -> Import (button).
{  
  "providerProfiles": {  
    "currentApiConfigName": "Kilo-Fast",  
    "apiConfigs": {  
      "Kilo-Fast": {  
        "diffEnabled": true,  
        "fuzzyMatchThreshold": 1,  
        "openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",  
        "openAiApiKey": "sk-1234",  
        "openAiModelId": "think-brain",  
        "openAiStreamingEnabled": true,  
        "apiProvider": "openai"  
      },  
      "Kilo-Think": {  
        "diffEnabled": true,  
        "fuzzyMatchThreshold": 1,  
        "openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",  
        "openAiApiKey": "sk-1234",  
        "openAiModelId": "think-brain",  
        "openAiStreamingEnabled": false,  
        "apiProvider": "openai"  
      }  
    },  
    "modeApiConfigs": {  
      "architect": "d305g8e9odi",  
      "code": "oampj108si",  
      "ask": "oampj108si",  
      "debug": "d305g8e9odi",  
      "orchestrator": "d305g8e9odi",  
      "review": "d305g8e9odi"  
    }  
  },  
  "globalSettings": {  
    "autoCondenseContext": true,  
    "autoCondenseContextPercent": 100,  
    "maxConcurrentFileReads": 5,  
    "allowVeryLargeReads": false,  
    "maxOpenTabsContext": 20,  
    "diffEnabled": true,  
    "experiments": {  
      "powerSteering": true,  
      "multiFileApplyDiff": true,  
      "speechToText": true  
    }  
  }  
}

B. Agent Behavior Optimization


Profiles: Use Kilo-Think (Port 5801) for Architect/Review modes and Kilo-Fast (Port 5800) for Code/Ask modes.
Streaming: OFF for the Architect profile. This is the primary stability fix for local 35B models.

📊 6. Monitoring Thresholds


Metric
Target
Notes


Baseline RAM
~2.5GB
No models loaded.


Wired Memory
~24.8GB
Model Weights + Draft + KV Cache active.


The Danger Zone
32.0GB+
macOS begins swapping; speed drops to ~10 TPS.


Mission: * intelligence without latency. Privacy without compromise.
Metric	Target	Notes
Baseline RAM	~2.5GB	No models loaded.
Wired Memory	~24.8GB	Model Weights + Draft + KV Cache active.
The Danger Zone	32.0GB+	macOS begins swapping; speed drops to ~10 TPS.
No results found