iam-veeramalla/llms_on_cpu.md

## llms_on_cpu.md

      
    Raw
  

              llms_on_cpu.md
            
          
    Run LLMs Locally Using llama.cpp

This tutorial shows how to run Large Language Models locally on your laptop using llama.cpp and GGUF models.
It works on:

macOS
Linux
Windows

No GPU is required. Models run on CPU (and Apple Metal on Mac automatically).

What You Will Learn


Install llama.cpp
Download a GGUF model
Run the model locally
Chat with the model from the terminal
Start an LLM server and access it in the browser


Recommended Model for Laptops

For smooth performance on most laptops:
Model:
Qwen2.5-7B-Instruct (Q4_K_M)

Advantages:

~4–5GB size
Good reasoning
Fast inference
Works well on CPU


1. Install llama.cpp

https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md

2. Create a Folder for Models

Create a directory for GGUF models.
macOS / Linux:
mkdir ~/llm-models
cd ~/llm-models

Windows (PowerShell):
mkdir C:\llm-models
cd C:\llm-models


3. Install HuggingFace CLI

Install Python first.
macOS

brew install python

Linux

sudo apt install python3-pip

Windows

Install Python from:
https://python.org

Then install HuggingFace Hub:
pip install huggingface_hub

Verify CLI:
hf --help


4. Download a GGUF Model

Download the Qwen model:
hf download bartowski/Qwen2.5-7B-Instruct-GGUF \
--include "Qwen2.5-7B-Instruct-Q4_K_M.gguf" \
--local-dir .

Verify download:
ls -lh

Expected file:
Qwen2.5-7B-Instruct-Q4_K_M.gguf

File size is approximately 4–5GB.

5. Run the Model

macOS

llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf


Linux

./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf


Windows (WSL)

./llama-cli -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf


Windows Native

llama-cli.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf


6. Chat with the Model

Once the model loads you will see:
>

You can now type prompts:
Example:
Explain Kubernetes in simple terms

Example response:
Kubernetes is a container orchestration system that helps manage
containerized applications across multiple machines.

Press Ctrl + C to exit.

7. Improve Performance

You can speed up inference by using more CPU threads.
Check CPU cores.
macOS:
sysctl -n hw.ncpu

Linux:
nproc

Windows (PowerShell):
echo %NUMBER_OF_PROCESSORS%

Run with threads:
macOS:
llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Linux:
./llama-cli -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Windows:
llama-cli.exe -m Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 8

Adjust the number based on your CPU cores.

8. Start the LLM Server (Browser Access)

You can also run the model as a local web server.
macOS

llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf


Linux

./llama-server -m ~/llm-models/Qwen2.5-7B-Instruct-Q4_K_M.gguf


Windows

llama-server.exe -m C:\llm-models\Qwen2.5-7B-Instruct-Q4_K_M.gguf


9. Open the Web Interface

After starting the server you will see something like:
server listening on http://127.0.0.1:8080

Open your browser and visit:
http://localhost:8080

You will see the llama.cpp web chat interface.
Now you can interact with the LLM directly in the browser.

10. API Access

The server also exposes an API.
Example request:
curl http://localhost:8080/completion \
-d '{
  "prompt": "Explain Docker",
  "n_predict": 200
}'

This makes it easy to integrate with:

Python scripts
AI agents
DevOps automation
local tools


Summary

With llama.cpp + GGUF models, you can run powerful LLMs:

locally
privately
without GPUs
without cloud APIs

All from your laptop.
No results found