This tutorial shows how to run Large Language Models locally on your laptop using llama.cpp and GGUF models.
It works on:
- macOS
- Linux
- Windows
No GPU is required. Models run on CPU (and Apple Metal on Mac automatically).
- Install
llama.cpp - Download a GGUF model
- Run the model locally
- Chat with the model from the terminal
- Start an LLM server and access it in the browser
For smooth performance on most laptops:
Model:
Qwen2.5-7B-Instruct (Q4_K_M)
Advantages:
- ~4–5GB size
- Good reasoning
- Fast inference
- Works well on CPU
https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md
Create a directory for GGUF models.
macOS / Linux:
mkdir ~/llm-models
cd ~/llm-models
Windows (PowerShell):
mkdir C:\llm-models
cd C:\llm-models
Install Python first.
brew install python
sudo apt install python3-pip
Install Python from:
https://python.org
Then install HuggingFace Hub:
pip install huggingface_hub
Verify CLI:
hf --help
Download the Qwen model:
hf download bartowski/Qwen2.5-7B-Instruct-GGUF \
--include "Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
--local-dir .
Verify download:
ls -lh
Expected file:
Qwen2.5-7B-Instruct-Q4_K_M.gguf
File size is approximately 4–5GB.
llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
llama-cli.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf
Once the model loads you will see:
>
You can now type prompts:
Example:
Explain Kubernetes in simple terms
Example response:
Kubernetes is a container orchestration system that helps manage
containerized applications across multiple machines.
Press Ctrl + C to exit.
You can speed up inference by using more CPU threads.
Check CPU cores.
macOS:
sysctl -n hw.ncpu
Linux:
nproc
Windows (PowerShell):
echo %NUMBER_OF_PROCESSORS%
Run with threads:
macOS:
llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8
Linux:
./llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8
Windows:
llama-cli.exe -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8
Adjust the number based on your CPU cores.
You can also run the model as a local web server.
llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
./llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf
llama-server.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf
After starting the server you will see something like:
server listening on http://127.0.0.1:8080
Open your browser and visit:
http://localhost:8080
You will see the llama.cpp web chat interface.
Now you can interact with the LLM directly in the browser.
The server also exposes an API.
Example request:
curl http://localhost:8080/completion \
-d '{
"prompt": "Explain Docker",
"n_predict": 200
}'
This makes it easy to integrate with:
- Python scripts
- AI agents
- DevOps automation
- local tools
With llama.cpp + GGUF models, you can run powerful LLMs:
- locally
- privately
- without GPUs
- without cloud APIs
All from your laptop.