Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Created March 9, 2026 15:58
Show Gist options
  • Select an option

  • Save ckandoth/d3f51e54a80ab8941279623c7816c555 to your computer and use it in GitHub Desktop.

Select an option

Save ckandoth/d3f51e54a80ab8941279623c7816c555 to your computer and use it in GitHub Desktop.
Use llama.cpp to run LLMs locally on Windows

This guide details how to set up a LLaMA.cpp HTTP Server with GPU acceleration on a fresh install of Windows 11 (25H2). With immensely smarter frontier AI models available exclusively online, because most of us cannot afford the hardware needed to run them locally, there are few reasons to run local LLMs. But I have found that tinkering with the runtime config of local LLMs is the best way to learn how these models work. And they serve as a tool that a smarter AI agent can operate, reducing token usage of more expensive models. Finally it puts my GeForce RTX 5090 GPU to work, when not running Rocket League.

Press Windows Key + R, type cmd, and press Enter to open a black window running Windows CMD, a command-line interface (CLI) that has existed in Windows since 1987. It will probably never die. Copy and paste the following command into the CLI and press Enter to install the tools we'll need. If this is the first time you're using WinGet, you'll need to type Y (yes) to agree to the source terms and conditions, that specifically ask for your consent to use the Microsoft Store and community repository sources.

winget install Microsoft.PowerShell ggml.llamacpp

Now that we have the latest version of PowerShell installed, we will use that as our CLI moving forward. Type exit to get out of Windows CMD. Press Windows Key + R, type pwsh, and press Enter to open PowerShell.

Copy-paste these commands to peplace the llama.cpp binaries installed by WinGet with the latest CUDA binary and runtime from GitHub.

$llamaServerPath = (Get-Command llama-server -ErrorAction Stop).Source
$llamaInstallDir = Split-Path $llamaServerPath
$latestTag = (Invoke-RestMethod https://api.github.com/repos/ggml-org/llama.cpp/releases/latest).tag_name
mkdir downloads
Invoke-WebRequest -Uri https://github.com/ggml-org/llama.cpp/releases/download/$latestTag/llama-$latestTag-bin-win-cuda-13.1-x64.zip -OutFile downloads/llama-cuda.zip
Invoke-WebRequest -Uri https://github.com/ggml-org/llama.cpp/releases/download/$latestTag/cudart-llama-bin-win-cuda-13.1-x64.zip -OutFile downloads/cudart.zip
tar -xf downloads/llama-cuda.zip -C $llamaInstallDir
tar -xf downloads/cudart.zip -C $llamaInstallDir

The commands below will download a model (gpt-oss-20b from OpenAI) and serve it out of http://localhost:8080. This is a URL that you can copy-paste into a browser, and test some queries. You can even upload files and ask it to analyze them.

llama-server --hf-repo unsloth/gpt-oss-20b-GGUF:F16 --jinja --chat-template-kwargs '{"reasoning_effort": "high"}' -dio -cb -np 4 -fa on -fit on -c 65536 -ngl -1 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0

Now go forth and tinker. A good place to start is https://unsloth.ai/docs/models/tutorials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment