Skip to content

Instantly share code, notes, and snippets.

@basperheim
Last active November 11, 2025 12:43
Show Gist options
  • Select an option

  • Save basperheim/5296c4a7191315b27aabad325fe0071d to your computer and use it in GitHub Desktop.

Select an option

Save basperheim/5296c4a7191315b27aabad325fe0071d to your computer and use it in GitHub Desktop.
Ollama: Practical Reference (macOS · Linux · Windows)

Ollama: Practical Reference (macOS · Linux · Windows)

Local LLMs with a Docker-like CLI. This doc shows how to install, run, manage RAM/disk usage, and clean up.

TL;DR (cheat-sheet)

# install (mac)
brew install --cask ollama-app   # or: brew install ollama

# pull + chat
ollama pull mistral-nemo:latest
ollama run mistral-nemo -p "Explain diffusion models in 1 paragraph."

# list/running/stop
ollama list
ollama ps
ollama stop mistral-nemo         # or: ollama stop all

# delete from disk
ollama rm mistral-nemo           # removes model files from ~/.ollama/models

# disk location + keep-alive
echo $OLLAMA_MODELS              # custom model dir (optional)
ollama run mistral-nemo --keepalive 60s -p "Hello"

1) Install

macOS

brew install --cask ollama-app   # Mac app + CLI (starts background server)
# or, CLI formula only:
brew install ollama

Notes:

  • Uses Apple Silicon's Metal by default—no flags needed.
  • Models live under ~/.ollama/models unless you set OLLAMA_MODELS.

Linux

One-liner (official script):

curl -fsSL https://ollama.com/install.sh | sh

Systemd service gets created; verify:

systemctl --user enable --now ollama
systemctl --user status ollama

NVIDIA (Linux): If CUDA drivers are present, Ollama uses them automatically. Keep your NVIDIA driver + CUDA runtime up to date.

Docker (optional, Linux/macOS):

docker run -d --name ollama --restart=unless-stopped \
  -p 11434:11434 \
  -v ~/.ollama:/root/.ollama \
  --gpus=all \
  ollama/ollama

Keep the volume mount so model files persist.

Windows

  • Install the official Ollama for Windows installer (includes a background service and ollama.exe).
  • WSL2 users can also install via the Linux script inside WSL.

2) Daily workflow (it's Docker-ish)

# pull
ollama pull mistral-nemo:latest

# interactive chat (REPL)
ollama run mistral-nemo

# one-shot prompt
ollama run mistral-nemo -p "Summarize k-means vs k-medoids."

# list downloaded models
ollama list

# show running/loaded models (in RAM)
ollama ps

# stop a running model (free its RAM)
ollama stop mistral-nemo
ollama stop all

# delete from disk
ollama rm mistral-nemo

Where models live

  • Default: ~/.ollama/models/

  • Change it (e.g., to an external SSD):

    export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models"
    ollama serve &  # or restart the app/service

3) RAM usage & performance control

  • Idle server uses little RAM. Big RAM usage happens only while a model is loaded.
  • Apple Silicon uses unified memory; expect RAM to rise during prompts and with large contexts.

Key knobs (per run or via API):

# keep the model warm or unload quickly
ollama run mistral-nemo -p "hello" --keepalive 30s   # unload 30s after last use
# 0 = unload immediately; -1 = keep forever

# reduce context if you don't need huge windows (saves KV cache memory)
ollama run mistral-nemo --num_ctx 4096 -p "..."

# CPU thread count (when CPU is used alongside GPU)
ollama run mistral-nemo --num_thread 10 -p "..."

Quantization: Smaller quant tags → smaller memory, faster loads, slightly less quality.

  • Examples: :q4_K_M, :q5_K_M

  • Pull explicitly:

    ollama pull mistral-nemo:q4_K_M
    ollama run mistral-nemo:q4_K_M

4) Disk space management

See what's installed (with sizes):

ollama list

Remove models you don't use:

ollama rm mistral-nemo
ollama rm gemma3:12b

Move model store to a bigger disk:

# 1) Stop server
ollama stop all || true
# 2) Move files
mv ~/.ollama /Volumes/ExtSSD/ollama
# 3) Point OLLAMA_MODELS there (shell or service env)
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama/models"
# 4) Restart server
ollama serve &

On Linux with systemd, set Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models (see service section below), then systemctl --user daemon-reload && systemctl --user restart ollama.

Inspect model folder size:

du -sh ~/.ollama/models

5) API usage

The local server listens on http://localhost:11434.

Generate (one pass):

curl http://localhost:11434/api/generate -d '{
  "model": "mistral-nemo",
  "prompt": "Give me three bullet points on SIMD.",
  "options": { "num_ctx": 4096, "num_thread": 10, "temperature": 0.2 },
  "keep_alive": "60s"
}'

Chat (stateful):

curl http://localhost:11434/api/chat -d '{
  "model": "mistral-nemo",
  "messages": [
    {"role":"user","content":"You are a senior dev. Briefly explain CAP theorem."}
  ]
}'

Show model info / defaults:

curl -s http://localhost:11434/api/show -d '{"model":"mistral-nemo"}' | jq

6) Linux service (user systemd)

Created by the install script. If you want to customize environment (e.g., model path, keepalive default):

mkdir -p ~/.config/systemd/user
systemctl --user edit ollama

Paste:

[Service]
Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models
Environment=OLLAMA_KEEP_ALIVE=30s

Then:

systemctl --user daemon-reload
systemctl --user restart ollama
systemctl --user status ollama

7) Choosing a model (12B sweet spot)

Good general 12B options for an M-series MBP (32 GB):

ollama pull mistral-nemo:latest     # strong general model
ollama pull gemma3:12b              # compact, multilingual
# coding-focused (larger, if you want):
ollama pull qwen2.5-coder:14b

Pick a q4 quant for fast chat + smaller RAM, bump up if you need quality and can spare memory.


8) Troubleshooting

"Why is RAM still high after I close the REPL?" Model is kept warm. Use a shorter keep-alive or stop it:

ollama ps
ollama stop <name>   # or: ollama stop all

"I need more disk space." ollama listollama rm <model>. Consider moving OLLAMA_MODELS to a larger drive.

"GPU isn't used on Linux/NVIDIA." Update NVIDIA driver + CUDA runtime; restart. Container users must pass --gpus=all.

"Server not responding on 11434." Start/enable it:

# macOS: open the Ollama app, or:
ollama serve &

# Linux:
systemctl --user enable --now ollama

9) Uninstall / reset

macOS (Homebrew)

brew uninstall --cask ollama-app || brew uninstall ollama
rm -rf ~/.ollama

Linux

systemctl --user disable --now ollama || true
rm -rf ~/.ollama
# remove installed binary (depends on the installer's path)
which ollama && sudo rm -f $(which ollama)

Windows

  • Uninstall via Apps & Features; delete %UserProfile%\.ollama if you want to reclaim space.

10) Opinionated defaults (what I use)

  • Use a q4 quant for 12B models.

  • Set a short keep-alive when you're tight on RAM:

    export OLLAMA_KEEP_ALIVE=30s
  • Keep models on a fast external SSD:

    export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models"
  • Don't run multiple big models concurrently unless you intend to.


Run or Pipe Prompts into Ollama

1) Put the prompt in a file and pipe via stdin (recommended)

# create the file
cat > prompt.txt <<'EOF'
[ paste your full etymology/cognate prompt here ]
EOF

# run it
ollama run mistral-nemo:12b < prompt.txt

# or
cat prompt.txt | ollama run mistral-nemo:12b

2) Pass the prompt as a direct argument (fine for small prompts)

ollama run mistral-nemo:12b "Explain event loop vs threads in Node.js"

3) If you need a system prompt too, use the HTTP API (lets you send both system and prompt)

# (A) just prompt:
curl -s http://localhost:11434/api/generate \
  -d @<(jq -n --arg m "mistral-nemo:12b" --arg p "$(cat prompt.txt)" \
        '{model:$m, prompt:$p}')

# (B) with a system prompt
cat > system.txt <<'SYS'
You are a precise assistant that outputs valid JSON when asked and follows explicit formatting rules.
SYS

curl -s http://localhost:11434/api/generate \
  -d @<(jq -n \
        --arg m "mistral-nemo:12b" \
        --arg sys "$(cat system.txt)" \
        --arg p   "$(cat prompt.txt)" \
        '{model:$m, system:$sys, prompt:$p}')

Tip: jq -n --arg p "$(cat prompt.txt)" '{prompt:$p}' safely JSON-escapes your multiline prompt.

4) Simple timing for TPS comparisons

/usr/bin/time -l ollama run mistral-nemo:12b < prompt.txt > /dev/null

Ollama prints eval stats at the end; that's your tokens/sec. Run a few times and average if you want stability.

That's it. For your big etymology JSON job, method (1) is the cleanest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment