Skip to content

Instantly share code, notes, and snippets.

@greenstevester
Last active April 3, 2026 23:07
Show Gist options
  • Select an option

  • Save greenstevester/fc49b4e60a4fef9effc79066c1033ae5 to your computer and use it in GitHub Desktop.

Select an option

Save greenstevester/fc49b4e60a4fef9effc79066c1033ae5 to your computer and use it in GitHub Desktop.
April 2026 TLDR setup for Ollama + Gemma 4 12B on a Mac mini (Apple Silicon) — auto-start, preload, and keep-alive

April 2026 TLDR setup for Ollama + Gemma 4 on a Mac mini (Apple Silicon) — auto-start, preload, and keep-alive

April 2026 TLDR Setup for Ollama + Gemma 4 on a Mac mini (Apple Silicon)

Prerequisites

  • Mac mini with Apple Silicon (M1/M2/M3/M4/M5)
  • At least 16GB unified memory for Gemma 4 (default 8B)
  • macOS with Homebrew installed

Step 1: Install Ollama

Install the Ollama macOS app via Homebrew cask (includes auto-updates and MLX backend):

brew install --cask ollama-app

This installs:

  • Ollama.app in /Applications/
  • ollama CLI at /opt/homebrew/bin/ollama

Step 2: Start Ollama

open -a Ollama

The Ollama icon will appear in the menu bar. Wait a few seconds for the server to initialize.

Verify it's running:

ollama list

Step 3: Pull Gemma 4

ollama pull gemma4

This downloads ~9.6GB. Verify:

ollama list
# NAME             ID              SIZE      MODIFIED
# gemma4:latest    ...             9.6 GB    ...

Note on model sizing: We originally ran gemma4:26b but it consumed nearly all of the Mac mini's 24GB unified memory, leaving the system barely responsive and causing frequent swapping under concurrent requests. Downgraded to the default gemma4:latest (8B, Q4_K_M quantization, ~9.6GB) which runs comfortably with headroom to spare.

Step 4: Test the Model

ollama run gemma4:latest "Hello, what model are you?"

Check that it's using GPU acceleration:

ollama ps
# Should show CPU/GPU split, e.g. 14%/86% CPU/GPU

Step 5: Configure Auto-Start on Login

5a. Ollama App — Launch at Login

Click the Ollama icon in the menu bar > Launch at Login (enable it).

Alternatively, go to System Settings > General > Login Items and add Ollama.

5b. Auto-Preload Gemma 4 on Startup

Create a launch agent that loads the model into memory after Ollama starts and keeps it warm:

cat << 'EOF' > ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.preload-gemma4</string>
    <key>ProgramArguments</key>
    <array>
        <string>/opt/homebrew/bin/ollama</string>
        <string>run</string>
        <string>gemma4:latest</string>
        <string></string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>StartInterval</key>
    <integer>300</integer>
    <key>StandardOutPath</key>
    <string>/tmp/ollama-preload.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama-preload.log</string>
</dict>
</plist>
EOF

Load the agent:

launchctl load ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist

This sends an empty prompt to ollama run every 5 minutes, keeping the model warm in memory.

5c. Keep Models Loaded Indefinitely

By default, Ollama unloads models after 5 minutes of inactivity. To keep them loaded forever:

launchctl setenv OLLAMA_KEEP_ALIVE "-1"

Then restart Ollama for the change to take effect.

Note: This environment variable is session-scoped. To persist across reboots, add export OLLAMA_KEEP_ALIVE="-1" to your ~/.zshrc, or set it via a dedicated launch agent.

Step 6: Verify Everything Works

# Check Ollama server is running
ollama list

# Check model is loaded in memory
ollama ps

# Check launch agent is registered
launchctl list | grep ollama

Expected output from ollama ps:

NAME             ID              SIZE      PROCESSOR          CONTEXT    UNTIL
gemma4:latest    ...             9.6 GB    14%/86% CPU/GPU    4096       Forever

API Access

Ollama exposes a local API at http://localhost:11434. Use it with coding agents:

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Useful Commands

Command Description
ollama list List downloaded models
ollama ps Show running models & memory usage
ollama run gemma4:latest Interactive chat
ollama stop gemma4:latest Unload model from memory
ollama pull gemma4:latest Update model to latest version
ollama rm gemma4:latest Delete model

Uninstall / Remove Auto-Start

# Remove the preload agent
launchctl unload ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
rm ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist

# Uninstall Ollama
brew uninstall --cask ollama-app

What's New in Ollama v0.19+ (March 31, 2026)

MLX Backend on Apple Silicon

On Apple Silicon, Ollama automatically uses Apple's MLX framework for faster inference — no manual configuration needed. M5/M5 Pro/M5 Max chips get additional acceleration via GPU Neural Accelerators. M4 and earlier still benefit from general MLX speedups.

NVFP4 Support (NVIDIA)

Ollama now leverages NVIDIA's NVFP4 format to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads. As more inference providers scale inference using NVFP4 format, this allows Ollama users to share the same results as they would in a production environment. It further opens up Ollama to run models optimized by NVIDIA's model optimizer.

Improved Caching for Coding and Agentic Tasks

  • Lower memory utilization: Ollama reuses its cache across conversations, meaning less memory utilization and more cache hits when branching with a shared system prompt — especially useful with tools like Claude Code.
  • Intelligent checkpoints: Ollama stores snapshots of its cache at intelligent locations in the prompt, resulting in less prompt processing and faster responses.
  • Smarter eviction: Shared prefixes survive longer even when older branches are dropped.

Notes

  • Memory: Gemma 4 (default 8B) uses ~9.6GB when loaded. On a 24GB Mac mini, this leaves ~14GB for the system — comfortable for concurrent requests.
  • Why not 26B? The 26B variant consumed ~17GB, leaving only ~7GB for macOS and other processes. Under concurrent Ollama requests the system would swap heavily, become unresponsive, and occasionally kill processes. The 8B default offers a much better experience on 24GB machines.

References

@kylehotchkiss
Copy link
Copy Markdown

26B killed the mac mini.

Please elaborate on killed 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment