Skip to content

Instantly share code, notes, and snippets.

@adamteale
Last active March 11, 2026 13:00
Show Gist options
  • Select an option

  • Save adamteale/ea78f0581c76faa9a80cb239b5fee5f1 to your computer and use it in GitHub Desktop.

Select an option

Save adamteale/ea78f0581c76faa9a80cb239b5fee5f1 to your computer and use it in GitHub Desktop.
The 2026 M4 Max Local-First AI Guide

πŸš€ The 2026 M4 Max Local-First AI Guide

Architecting the 110+ TPS "Senior-Junior" Local Brain

This document provides the definitive setup for a high-performance, local AI development environment on an M4 Max (36GB RAM). It leverages Speculative Decoding, AST-based Indexing, and Kernel-level Memory Tuning.

πŸ—οΈ 1. The Core Architecture (Senior-Junior)

On an M4 Max with 36GB, the "Sweet Spot" is running a 35B-parameter model (The Architect) and a 1.5B-parameter model (The Draft Model).

Speculative Decoding: Intelligence at 110+ TPS

  • The Junior Guess (1.5B): Sprints ahead to "guess" common code blocks (loops, imports).
  • The Senior Check (35B): Verifies those blocks in a single GPU pass.
  • The Result: You get 35B-level reasoning at nearly the speed of a 1.5B model.

πŸ› οΈ 2. Hardware & OS Tuning (Crucial)

Run these commands in your terminal to unlock the GPU's full potential. By default, macOS throttles GPU access to ~75% of RAM; these overrides fix that.

A. Unlock Wired Memory (VRAM Override)

Allows the GPU to claim up to 32GB of your 36GB RAM.

sudo sysctl iogpu.wired_limit_mb=32768

B. Global Environment Variables

Optimize Ollama for high-speed attention and compressed memory. Run these and restart the Ollama app.

# Enable High-Speed Attention
launchctl setenv OLLAMA_FLASH_ATTENTION 1

# Compresses the "Working Memory" (Saves ~4GB RAM on 32k context)
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

# Focus GPU power on one request at a time
launchctl setenv OLLAMA_NUM_PARALLEL 1

# Set stable 32k context window
launchctl setenv OLLAMA_NUM_CTX 32768

πŸ“„ 3. Proxy Configuration (llama-swap)

The proxy manages the handshake between your Junior and Senior models. Save the following as ~/llama-swap/config.yaml.

# ~/llama-swap/config.yaml
healthCheckTimeout: 600
logToStdout: "both"

models:
  "fast-coder":
    cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
    checkEndpoint: "/"
    healthCheckEndpoint: "/"
    env:
      - "OLLAMA_NUM_CTX=65536"
      - "OLLAMA_FLASH_ATTENTION=true"
      - "OLLAMA_KEEP_ALIVE=-1" # Keeps the junior ready to sprint
    useModelName: "qwen-fast-agent"
    ttl: 600

  "think-brain":
    # BACK TO STANDARD CMD: Remove the --draft flag that caused the crash
    cmd: "sh -c 'OLLAMA_HOST=127.0.0.1:${PORT} /Applications/Ollama.app/Contents/Resources/ollama serve'"
    checkEndpoint: "/"
    healthCheckEndpoint: "/"
    env:
      - "OLLAMA_NUM_CTX=65536"
      - "OLLAMA_FLASH_ATTENTION=true"
      - "OLLAMA_NUM_PARALLEL=1"
      - "OLLAMA_KEEP_ALIVE=-1"
      # USE ENVIRONMENT VARIABLE: This is the stable way to trigger speculative decoding in 2026
      - "OLLAMA_DRAFT_MODEL=qwen2.5:1.5b"
    useModelName: "qwen3.5:35b-a3b"
    ttl: 3600

🧩 4. The Librarian (jCodeMunch-MCP)

Standard AI agents "brute-force" files by reading everything, which leads to Context Bloat and hallucinations. jCodeMunch turns your project into a structured database.

How it works:

  1. AST-Parsing: Instead of reading text, it parses your code into an Abstract Syntax Tree. It knows exactly where a class begins and ends.
  2. Surgical Retrieval: When the Architect asks "How is CharacterRepository implemented?", jCodeMunch returns only that class, not the 500 lines of unrelated imports and boilerplate around it.
  3. Local Intelligence: It allows a local 35B model to "navigate" a 100k-line codebase without actually loading 100k lines into memory.

Key Benefits:

  • Precision: Find symbols (functions/classes) by logic, not just keyword matches.
  • Efficiency: Reduces token waste by ~90%.
  • Context Control: Keeps your context window "clean" for actual reasoning.

πŸ’» 5. VS Code Setup (Kilo Code)

To fully utilize this stack, follow these steps to integrate the Librarian and the Architect.

A. Global Settings & Mode Mapping

Configure Kilo Code to automatically use the 35B Architect (Kilo-Think) for high-level tasks and the 1.5B Junior (Kilo-Fast) for rapid coding. The following can be saved as JSON and imported into Kilo Code (Settings -> About Kilo Code -> Import (button).

{  
  "providerProfiles": {  
    "currentApiConfigName": "Kilo-Fast",  
    "apiConfigs": {  
      "Kilo-Fast": {  
        "diffEnabled": true,  
        "fuzzyMatchThreshold": 1,  
        "openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",  
        "openAiApiKey": "sk-1234",  
        "openAiModelId": "think-brain",  
        "openAiStreamingEnabled": true,  
        "apiProvider": "openai"  
      },  
      "Kilo-Think": {  
        "diffEnabled": true,  
        "fuzzyMatchThreshold": 1,  
        "openAiBaseUrl": "\[http://127.0.0.1:8080/v1\](http://127.0.0.1:8080/v1)",  
        "openAiApiKey": "sk-1234",  
        "openAiModelId": "think-brain",  
        "openAiStreamingEnabled": false,  
        "apiProvider": "openai"  
      }  
    },  
    "modeApiConfigs": {  
      "architect": "d305g8e9odi",  
      "code": "oampj108si",  
      "ask": "oampj108si",  
      "debug": "d305g8e9odi",  
      "orchestrator": "d305g8e9odi",  
      "review": "d305g8e9odi"  
    }  
  },  
  "globalSettings": {  
    "autoCondenseContext": true,  
    "autoCondenseContextPercent": 100,  
    "maxConcurrentFileReads": 5,  
    "allowVeryLargeReads": false,  
    "maxOpenTabsContext": 20,  
    "diffEnabled": true,  
    "experiments": {  
      "powerSteering": true,  
      "multiFileApplyDiff": true,  
      "speechToText": true  
    }  
  }  
}

B. Agent Behavior Optimization

  1. Profiles: Use Kilo-Think (Port 5801) for Architect/Review modes and Kilo-Fast (Port 5800) for Code/Ask modes.
  2. Streaming: OFF for the Architect profile. This is the primary stability fix for local 35B models.

πŸ“Š 6. Monitoring Thresholds

Metric Target Notes
Baseline RAM ~2.5GB No models loaded.
Wired Memory ~24.8GB Model Weights + Draft + KV Cache active.
The Danger Zone 32.0GB+ macOS begins swapping; speed drops to ~10 TPS.

Mission: * intelligence without latency. Privacy without compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment