Skip to content

Instantly share code, notes, and snippets.

@iam-veeramalla
Created March 8, 2026 19:08
Show Gist options
  • Select an option

  • Save iam-veeramalla/53c24be8ad5941c233258fc647ce1f8d to your computer and use it in GitHub Desktop.

Select an option

Save iam-veeramalla/53c24be8ad5941c233258fc647ce1f8d to your computer and use it in GitHub Desktop.
Run LLMs locally on CPU Architecture

Run LLMs Locally Using llama.cpp

This tutorial shows how to run Large Language Models locally on your laptop using llama.cpp and GGUF models.

It works on:

  • macOS
  • Linux
  • Windows

No GPU is required. Models run on CPU (and Apple Metal on Mac automatically).


What You Will Learn

  1. Install llama.cpp
  2. Download a GGUF model
  3. Run the model locally
  4. Chat with the model from the terminal
  5. Start an LLM server and access it in the browser

Recommended Model for Laptops

For smooth performance on most laptops:

Model:

Qwen2.5-7B-Instruct (Q4_K_M)

Advantages:

  • ~4–5GB size
  • Good reasoning
  • Fast inference
  • Works well on CPU

1. Install llama.cpp

https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md


2. Create a Folder for Models

Create a directory for GGUF models.

macOS / Linux:

mkdir ~/llm-models
cd ~/llm-models

Windows (PowerShell):

mkdir C:\llm-models
cd C:\llm-models

3. Install HuggingFace CLI

Install Python first.

macOS

brew install python

Linux

sudo apt install python3-pip

Windows

Install Python from:

https://python.org

Then install HuggingFace Hub:

pip install huggingface_hub

Verify CLI:

hf --help

4. Download a GGUF Model

Download the Qwen model:

hf download bartowski/Qwen2.5-7B-Instruct-GGUF \
--include "Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
--local-dir .

Verify download:

ls -lh

Expected file:

Qwen2.5-7B-Instruct-Q4_K_M.gguf

File size is approximately 4–5GB.


5. Run the Model

macOS

llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Linux

./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Windows (WSL)

./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Windows Native

llama-cli.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf

6. Chat with the Model

Once the model loads you will see:

>

You can now type prompts:

Example:

Explain Kubernetes in simple terms

Example response:

Kubernetes is a container orchestration system that helps manage
containerized applications across multiple machines.

Press Ctrl + C to exit.


7. Improve Performance

You can speed up inference by using more CPU threads.

Check CPU cores.

macOS:

sysctl -n hw.ncpu

Linux:

nproc

Windows (PowerShell):

echo %NUMBER_OF_PROCESSORS%

Run with threads:

macOS:

llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Linux:

./llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Windows:

llama-cli.exe -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Adjust the number based on your CPU cores.


8. Start the LLM Server (Browser Access)

You can also run the model as a local web server.

macOS

llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Linux

./llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Windows

llama-server.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf

9. Open the Web Interface

After starting the server you will see something like:

server listening on http://127.0.0.1:8080

Open your browser and visit:

http://localhost:8080

You will see the llama.cpp web chat interface.

Now you can interact with the LLM directly in the browser.


10. API Access

The server also exposes an API.

Example request:

curl http://localhost:8080/completion \
-d '{
  "prompt": "Explain Docker",
  "n_predict": 200
}'

This makes it easy to integrate with:

  • Python scripts
  • AI agents
  • DevOps automation
  • local tools

Summary

With llama.cpp + GGUF models, you can run powerful LLMs:

  • locally
  • privately
  • without GPUs
  • without cloud APIs

All from your laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment