Local LLMs with a Docker-like CLI. This doc shows how to install, run, manage RAM/disk usage, and clean up.
# install (mac)
brew install --cask ollama-app # or: brew install ollama
# pull + chat
ollama pull mistral-nemo:latest
ollama run mistral-nemo -p "Explain diffusion models in 1 paragraph."
# list/running/stop
ollama list
ollama ps
ollama stop mistral-nemo # or: ollama stop all
# delete from disk
ollama rm mistral-nemo # removes model files from ~/.ollama/models
# disk location + keep-alive
echo $OLLAMA_MODELS # custom model dir (optional)
ollama run mistral-nemo --keepalive 60s -p "Hello"brew install --cask ollama-app # Mac app + CLI (starts background server)
# or, CLI formula only:
brew install ollamaNotes:
- Uses Apple Silicon's Metal by default—no flags needed.
- Models live under
~/.ollama/modelsunless you setOLLAMA_MODELS.
One-liner (official script):
curl -fsSL https://ollama.com/install.sh | shSystemd service gets created; verify:
systemctl --user enable --now ollama
systemctl --user status ollamaNVIDIA (Linux): If CUDA drivers are present, Ollama uses them automatically. Keep your NVIDIA driver + CUDA runtime up to date.
Docker (optional, Linux/macOS):
docker run -d --name ollama --restart=unless-stopped \
-p 11434:11434 \
-v ~/.ollama:/root/.ollama \
--gpus=all \
ollama/ollamaKeep the volume mount so model files persist.
- Install the official Ollama for Windows installer (includes a background service and
ollama.exe). - WSL2 users can also install via the Linux script inside WSL.
# pull
ollama pull mistral-nemo:latest
# interactive chat (REPL)
ollama run mistral-nemo
# one-shot prompt
ollama run mistral-nemo -p "Summarize k-means vs k-medoids."
# list downloaded models
ollama list
# show running/loaded models (in RAM)
ollama ps
# stop a running model (free its RAM)
ollama stop mistral-nemo
ollama stop all
# delete from disk
ollama rm mistral-nemoWhere models live
-
Default:
~/.ollama/models/ -
Change it (e.g., to an external SSD):
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models" ollama serve & # or restart the app/service
- Idle server uses little RAM. Big RAM usage happens only while a model is loaded.
- Apple Silicon uses unified memory; expect RAM to rise during prompts and with large contexts.
Key knobs (per run or via API):
# keep the model warm or unload quickly
ollama run mistral-nemo -p "hello" --keepalive 30s # unload 30s after last use
# 0 = unload immediately; -1 = keep forever
# reduce context if you don't need huge windows (saves KV cache memory)
ollama run mistral-nemo --num_ctx 4096 -p "..."
# CPU thread count (when CPU is used alongside GPU)
ollama run mistral-nemo --num_thread 10 -p "..."Quantization: Smaller quant tags → smaller memory, faster loads, slightly less quality.
-
Examples:
:q4_K_M,:q5_K_M -
Pull explicitly:
ollama pull mistral-nemo:q4_K_M ollama run mistral-nemo:q4_K_M
See what's installed (with sizes):
ollama listRemove models you don't use:
ollama rm mistral-nemo
ollama rm gemma3:12bMove model store to a bigger disk:
# 1) Stop server
ollama stop all || true
# 2) Move files
mv ~/.ollama /Volumes/ExtSSD/ollama
# 3) Point OLLAMA_MODELS there (shell or service env)
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama/models"
# 4) Restart server
ollama serve &On Linux with systemd, set
Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models(see service section below), thensystemctl --user daemon-reload && systemctl --user restart ollama.
Inspect model folder size:
du -sh ~/.ollama/modelsThe local server listens on http://localhost:11434.
Generate (one pass):
curl http://localhost:11434/api/generate -d '{
"model": "mistral-nemo",
"prompt": "Give me three bullet points on SIMD.",
"options": { "num_ctx": 4096, "num_thread": 10, "temperature": 0.2 },
"keep_alive": "60s"
}'Chat (stateful):
curl http://localhost:11434/api/chat -d '{
"model": "mistral-nemo",
"messages": [
{"role":"user","content":"You are a senior dev. Briefly explain CAP theorem."}
]
}'Show model info / defaults:
curl -s http://localhost:11434/api/show -d '{"model":"mistral-nemo"}' | jqCreated by the install script. If you want to customize environment (e.g., model path, keepalive default):
mkdir -p ~/.config/systemd/user
systemctl --user edit ollamaPaste:
[Service]
Environment=OLLAMA_MODELS=/mnt/ssd/ollama/models
Environment=OLLAMA_KEEP_ALIVE=30sThen:
systemctl --user daemon-reload
systemctl --user restart ollama
systemctl --user status ollamaGood general 12B options for an M-series MBP (32 GB):
ollama pull mistral-nemo:latest # strong general model
ollama pull gemma3:12b # compact, multilingual
# coding-focused (larger, if you want):
ollama pull qwen2.5-coder:14bPick a q4 quant for fast chat + smaller RAM, bump up if you need quality and can spare memory.
"Why is RAM still high after I close the REPL?" Model is kept warm. Use a shorter keep-alive or stop it:
ollama ps
ollama stop <name> # or: ollama stop all"I need more disk space."
ollama list → ollama rm <model>. Consider moving OLLAMA_MODELS to a larger drive.
"GPU isn't used on Linux/NVIDIA."
Update NVIDIA driver + CUDA runtime; restart. Container users must pass --gpus=all.
"Server not responding on 11434." Start/enable it:
# macOS: open the Ollama app, or:
ollama serve &
# Linux:
systemctl --user enable --now ollamamacOS (Homebrew)
brew uninstall --cask ollama-app || brew uninstall ollama
rm -rf ~/.ollamaLinux
systemctl --user disable --now ollama || true
rm -rf ~/.ollama
# remove installed binary (depends on the installer's path)
which ollama && sudo rm -f $(which ollama)Windows
- Uninstall via Apps & Features; delete
%UserProfile%\.ollamaif you want to reclaim space.
-
Use a q4 quant for 12B models.
-
Set a short keep-alive when you're tight on RAM:
export OLLAMA_KEEP_ALIVE=30s -
Keep models on a fast external SSD:
export OLLAMA_MODELS="/Volumes/ExtSSD/ollama-models"
-
Don't run multiple big models concurrently unless you intend to.
# create the file
cat > prompt.txt <<'EOF'
[ paste your full etymology/cognate prompt here ]
EOF
# run it
ollama run mistral-nemo:12b < prompt.txt
# or
cat prompt.txt | ollama run mistral-nemo:12bollama run mistral-nemo:12b "Explain event loop vs threads in Node.js"# (A) just prompt:
curl -s http://localhost:11434/api/generate \
-d @<(jq -n --arg m "mistral-nemo:12b" --arg p "$(cat prompt.txt)" \
'{model:$m, prompt:$p}')
# (B) with a system prompt
cat > system.txt <<'SYS'
You are a precise assistant that outputs valid JSON when asked and follows explicit formatting rules.
SYS
curl -s http://localhost:11434/api/generate \
-d @<(jq -n \
--arg m "mistral-nemo:12b" \
--arg sys "$(cat system.txt)" \
--arg p "$(cat prompt.txt)" \
'{model:$m, system:$sys, prompt:$p}')Tip: jq -n --arg p "$(cat prompt.txt)" '{prompt:$p}' safely JSON-escapes your multiline prompt.
/usr/bin/time -l ollama run mistral-nemo:12b < prompt.txt > /dev/nullOllama prints eval stats at the end; that's your tokens/sec. Run a few times and average if you want stability.
That's it. For your big etymology JSON job, method (1) is the cleanest.