ckandoth/llama-cpp-on-windows.md

## llama-cpp-on-windows.md

      
    Raw
  

              llama-cpp-on-windows.md
            
          
    This guide details how to set up a LLaMA.cpp HTTP Server
with GPU acceleration on a fresh install of Windows 11 (25H2). With immensely smarter frontier AI models available exclusively online,
because most of us cannot afford the hardware needed to run them locally, there are few reasons to run local LLMs. But I have found that
tinkering with the runtime config of local LLMs is the best way to learn how these models work. And they serve as a tool that a smarter
AI agent can operate, reducing token usage of more expensive models. Finally it puts my GeForce RTX 5090 GPU to work, when not running
Rocket League.
Press Windows Key + R, type cmd, and press Enter to open a black window running Windows CMD, a command-line interface (CLI) that
has existed in Windows since 1987. It will probably never die. Copy and paste the following command into the CLI and press Enter to
install the tools we'll need. If this is the first time you're using WinGet, you'll need to type Y (yes) to agree to the source terms
and conditions, that specifically ask for your consent to use the Microsoft Store and community repository sources.
winget install Microsoft.PowerShell ggml.llamacpp

Now that we have the latest version of PowerShell installed, we will use that as our CLI moving forward. Type exit to get out of
Windows CMD. Press Windows Key + R, type pwsh, and press Enter to open PowerShell.
Copy-paste these commands to peplace the llama.cpp binaries installed by WinGet with the latest CUDA binary and runtime from GitHub.
$llamaServerPath = (Get-Command llama-server -ErrorAction Stop).Source
$llamaInstallDir = Split-Path $llamaServerPath
$latestTag = (Invoke-RestMethod https://api.github.com/repos/ggml-org/llama.cpp/releases/latest).tag_name
mkdir downloads
Invoke-WebRequest -Uri https://github.com/ggml-org/llama.cpp/releases/download/$latestTag/llama-$latestTag-bin-win-cuda-13.1-x64.zip -OutFile downloads/llama-cuda.zip
Invoke-WebRequest -Uri https://github.com/ggml-org/llama.cpp/releases/download/$latestTag/cudart-llama-bin-win-cuda-13.1-x64.zip -OutFile downloads/cudart.zip
tar -xf downloads/llama-cuda.zip -C $llamaInstallDir
tar -xf downloads/cudart.zip -C $llamaInstallDir
The commands below will download a model (gpt-oss-20b from OpenAI) and serve it out of http://localhost:8080. This is a URL that you
can copy-paste into a browser, and test some queries. You can even upload files and ask it to analyze them.
llama-server --hf-repo unsloth/gpt-oss-20b-GGUF:F16 --jinja --chat-template-kwargs '{"reasoning_effort": "high"}' -dio -cb -np 4 -fa on -fit on -c 65536 -ngl -1 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0
Now go forth and tinker. A good place to start is https://unsloth.ai/docs/models/tutorials.
No results found