basperheim/ollama-reference-guide.md

## ollama-reference-guide.md

      
    Raw
  

              ollama-reference-guide.md
            
          
    Ollama: Practical Reference (macOS · Linux · Windows)

Local LLMs with a Docker-like CLI. This doc shows how to install, run, manage RAM/disk usage, and clean up.
TL;DR (cheat-sheet)

# install (mac)
brew install --cask ollama-app   # or: brew install ollama

# pull + chat
ollama pull mistral-nemo:latest
ollama run mistral-nemo -p "Explain diffusion models in 1 paragraph."

# list/running/stop
ollama list
ollama ps
ollama stop mistral-nemo         # or: ollama stop all

# delete from disk
ollama rm mistral-nemo           # removes model files from ~/.ollama/models

# disk location + keep-alive
echo $OLLAMA_MODELS              # custom model dir (optional)
ollama run mistral-nemo --keepalive 60s -p "Hello"

1) Install

macOS

brew install --cask ollama-app   # Mac app + CLI (starts background server)
# or, CLI formula only:
brew install ollama
Notes:

Uses Apple Silicon's Metal by default—no flags needed.
Models live under ~/.ollama/models unless you set OLLAMA_MODELS.

Linux

One-liner (official script):
curl -fsSL https://ollama.com/install.sh | sh
Systemd service gets created; verify:
systemctl --user enable --now ollama
systemctl --user status ollama
NVIDIA (Linux): If CUDA drivers are present, Ollama uses them automatically. Keep your NVIDIA driver + CUDA runtime up to date.
Docker (optional, Linux/macOS):
docker run -d --name ollama --restart=unless-stopped \
  -p 11434:11434 \
  -v ~/.ollama:/root/.ollama \
  --gpus=all \
  ollama/ollama

Keep the volume mount so model files persist.

Windows


Install the official Ollama for Windows installer (includes a background service and ollama.exe).
WSL2 users can also install via the Linux script inside WSL.


2) Daily workflow (it's Docker-ish)

# pull
ollama pull mistral-nemo:latest

# interactive chat (REPL)
ollama run mistral-nemo

# one-shot prompt
ollama run mistral-nemo -p "Summarize k-means vs k-medoids."

# list downloaded models
ollama list

# show running/loaded models (in RAM)
ollama ps

# stop a running model (free its RAM)
ollama stop mistral-nemo
ollama stop all

# delete from disk
ollama rm mistral-nemo
Where models live


Default: ~/.ollama/models/


Change it (e.g., to an external SSD):
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models"
ollama serve &  # or restart the app/service


3) RAM usage & performance control


Idle server uses little RAM. Big RAM usage happens only while a model is loaded.
Apple Silicon uses unified memory; expect RAM to rise during prompts and with large contexts.

Key knobs (per run or via API):
# keep the model warm or unload quickly
ollama run mistral-nemo -p "hello" --keepalive 30s   # unload 30s after last use
# 0 = unload immediately; -1 = keep forever

# reduce context if you don't need huge windows (saves KV cache memory)
ollama run mistral-nemo --num_ctx 4096 -p "..."

# CPU thread count (when CPU is used alongside GPU)
ollama run mistral-nemo --num_thread 10 -p "..."
Quantization: Smaller quant tags → smaller memory, faster loads, slightly less quality.


Examples: :q4_K_M, :q5_K_M


Pull explicitly:
ollama pull mistral-nemo:q4_K_M
ollama run mistral-nemo:q4_K_M


4) Disk space management

See what's installed (with sizes):
ollama list
Remove models you don't use:
ollama rm mistral-nemo
ollama rm gemma3:12b
Move model store to a bigger disk:
# 1) Stop server
ollama stop all || true
# 2) Move files
mv ~/.ollama /Volumes/ExtSSD/ollama
# 3) Point OLLAMA_MODELS there (shell or service env)
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama/models"
# 4) Restart server
ollama serve &

On Linux with systemd, set Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models (see service section below), then systemctl --user daemon-reload && systemctl --user restart ollama.

Inspect model folder size:
du -sh ~/.ollama/models

5) API usage

The local server listens on http://localhost:11434.
Generate (one pass):
curl http://localhost:11434/api/generate -d '{
  "model": "mistral-nemo",
  "prompt": "Give me three bullet points on SIMD.",
  "options": { "num_ctx": 4096, "num_thread": 10, "temperature": 0.2 },
  "keep_alive": "60s"
}'
Chat (stateful):
curl http://localhost:11434/api/chat -d '{
  "model": "mistral-nemo",
  "messages": [
    {"role":"user","content":"You are a senior dev. Briefly explain CAP theorem."}
  ]
}'
Show model info / defaults:
curl -s http://localhost:11434/api/show -d '{"model":"mistral-nemo"}' | jq

6) Linux service (user systemd)

Created by the install script. If you want to customize environment (e.g., model path, keepalive default):
mkdir -p ~/.config/systemd/user
systemctl --user edit ollama
Paste:
[Service]
Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models
Environment=OLLAMA_KEEP_ALIVE=30s
Then:
systemctl --user daemon-reload
systemctl --user restart ollama
systemctl --user status ollama

7) Choosing a model (12B sweet spot)

Good general 12B options for an M-series MBP (32 GB):
ollama pull mistral-nemo:latest     # strong general model
ollama pull gemma3:12b              # compact, multilingual
# coding-focused (larger, if you want):
ollama pull qwen2.5-coder:14b
Pick a q4 quant for fast chat + smaller RAM, bump up if you need quality and can spare memory.

8) Troubleshooting

"Why is RAM still high after I close the REPL?"
Model is kept warm. Use a shorter keep-alive or stop it:
ollama ps
ollama stop <name>   # or: ollama stop all
"I need more disk space."
ollama list → ollama rm <model>. Consider moving OLLAMA_MODELS to a larger drive.
"GPU isn't used on Linux/NVIDIA."
Update NVIDIA driver + CUDA runtime; restart. Container users must pass --gpus=all.
"Server not responding on 11434."
Start/enable it:
# macOS: open the Ollama app, or:
ollama serve &

# Linux:
systemctl --user enable --now ollama

9) Uninstall / reset

macOS (Homebrew)
brew uninstall --cask ollama-app || brew uninstall ollama
rm -rf ~/.ollama
Linux
systemctl --user disable --now ollama || true
rm -rf ~/.ollama
# remove installed binary (depends on the installer's path)
which ollama && sudo rm -f $(which ollama)
Windows

Uninstall via Apps & Features; delete %UserProfile%\.ollama if you want to reclaim space.


10) Opinionated defaults (what I use)


Use a q4 quant for 12B models.


Set a short keep-alive when you're tight on RAM:
export OLLAMA_KEEP_ALIVE=30s


Keep models on a fast external SSD:
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models"


Don't run multiple big models concurrently unless you intend to.


Run or Pipe Prompts into Ollama

1) Put the prompt in a file and pipe via stdin (recommended)

# create the file
cat > prompt.txt <<'EOF'
[ paste your full etymology/cognate prompt here ]
EOF

# run it
ollama run mistral-nemo:12b < prompt.txt

# or
cat prompt.txt | ollama run mistral-nemo:12b
2) Pass the prompt as a direct argument (fine for small prompts)

ollama run mistral-nemo:12b "Explain event loop vs threads in Node.js"
3) If you need a system prompt too, use the HTTP API (lets you send both system and prompt)

# (A) just prompt:
curl -s http://localhost:11434/api/generate \
  -d @<(jq -n --arg m "mistral-nemo:12b" --arg p "$(cat prompt.txt)" \
        '{model:$m, prompt:$p}')

# (B) with a system prompt
cat > system.txt <<'SYS'
You are a precise assistant that outputs valid JSON when asked and follows explicit formatting rules.
SYS

curl -s http://localhost:11434/api/generate \
  -d @<(jq -n \
        --arg m "mistral-nemo:12b" \
        --arg sys "$(cat system.txt)" \
        --arg p   "$(cat prompt.txt)" \
        '{model:$m, system:$sys, prompt:$p}')
Tip: jq -n --arg p "$(cat prompt.txt)" '{prompt:$p}' safely JSON-escapes your multiline prompt.
4) Simple timing for TPS comparisons

/usr/bin/time -l ollama run mistral-nemo:12b < prompt.txt > /dev/null
Ollama prints eval stats at the end; that's your tokens/sec. Run a few times and average if you want stability.
That's it. For your big etymology JSON job, method (1) is the cleanest.
No results found