batitous/openhands_llamacpp.md

## openhands_llamacpp.md

      
    Raw
  

              openhands_llamacpp.md
            
          
    Agentic AI and code completion with local LLMs

Here is how I locally run code completion, a chat and a coding agent!

Environment

I am using Linux as operating system and neovim as editor. My code agent is openhands.
In neovim, I rely on lazy.nvim for managing plugins, and for this particular application, I use llama.vim for code completion and codecompanion as chat.
On a different PC, I am running llama.cpp to download, manage and run LLMs. I have 32GB of VRAM, so if you have less, you will probably need to adjust things like the models that you are going to employ and the context sizes. Note that agentic AI needs large context sizes!
Of course, if you prefer, you can do everything on the same PC, or run the models on two different PCs (e.g. code completion locally and chat/agent on a server with more VRAM).

Installation


You should install neovim through the package manager of your distribution.
Follow the lazy.nvim installation instructions for lazy.
In my case, I had to compile llama.cpp from source, but there are other options, like using your distribution package manager, using Docker or downloading a binary. Follow instructions here. Building from source was not particularly difficult, and has the advantage that you can sync the repo with the latest version and enjoy new features as soon as they are available. Building from source was mandatory for me because of special compile options that I need for my GPUs.

Models

Llama.cpp can run an http server that exposes LLMs through a web interface. It is technically possible to configure the server to handle more than one session, for instance to allow the code agent to run in the background, and be able to use code completion at the same time. However, with this approach, you have to split your context window evenly between the sessions. This is a problem, because the code agent needs a large context.
My approach is to run two independent llama.cpp server instances. The first instance with a relatively small model for basic code completion and a limited context size, and the second instance with a more powerful model and the largest possible context size fitting remaining VRAM. The rationale is to have a simple code completion engine to help with coding speed, and use the chat function or the code agent for more involved tasks like debugging, refactoring or creating tests.
More precisely, I use:

Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF for code completion, with a context size of 4k.
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf for the chat and coding agent, and I let llama.cpp decide the context size (about 50k tokens in my case).

It is also possible to use just one model for everything, and keep VRAM usage reasonable if only one concurrent session is allowed. However, if the agent is executing tasks in the background, other features like code completion and chat will not work (NB with the setup that I suggest, the code agent/chat and code completion will work concurrently, but the code agent and chat won't - if the agent is running a task, the chat will not be available).
To download the models, run
llama-cli -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF

And
llama-cli -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf

This will place the models in local cache in ~/.cache/llama.cpp/ for later use.
Finally, here is a script to run the llama.cpp servers (tmux needed). After running the script, one can attach and detach the session with tmux to check the servers output. I use port 11434 for chat/code agent and 11436 for code completion, but feel free to change that.
#!/bin/bash

# Create a new tmux session
SESSION_NAME="llama-servers"
tmux new-session -d -s $SESSION_NAME

# Split the window horizontally
tmux split-window -h

# Set the first pane to run the 4k context server
tmux select-pane -t 0
tmux send-keys "llama-server --host 0.0.0.0 --port 11436 -m ~/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -np 1 -c 4096 --flash-attn on" Enter

sleep 10s

# Set the second pane to run the server with remaining memory
tmux select-pane -t 1
tmux send-keys "llama-server --host 0.0.0.0 --port 11434 -m ~/.cache/llama.cpp/unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -np 1" Enter

# Attach to the session
tmux attach -t $SESSION_NAME
Neovim config

In neovim, under the plugins directory (typically ~/.config/nvim/lua/plugins/), create a lua file for code completion and the chat function.
return {
	{
		"ggml-org/llama.vim",
		init = function()
			vim.g.llama_config = {
        		-- replace with your llama.cpp instance
				endpoint = "http://{ip address of the llama.cpp server for code completion}:11436/infill",
				-- Local Context (Line based)
				n_prefix = 64, -- ~640 tokens
				n_suffix = 32, -- ~320 tokens
				n_predict = 128, -- Space for the AI's answer

				-- Ring Buffer (Global context from other files)
				-- Default is usually 8 chunks; we reduce it to keep the current file prioritized
				ring_n_chunks = 4,
				ring_chunk_size = 16, -- Smaller chunks to save space
			}
		end,
	},
	{
		"olimorris/codecompanion.nvim",
		version = "^18.0.0",
		opts = {},
		dependencies = {
			"nvim-lua/plenary.nvim",
			"nvim-treesitter/nvim-treesitter",
		},
		config = function()
			require("codecompanion").setup({
				strategies = {
					chat = { adapter = "llama.cpp" },
					inline = { adapter = "llama.cpp" },
				},
				adapters = {
					http = {
						["llama.cpp"] = function()
							return require("codecompanion.adapters").extend("openai_compatible", {
								env = {
                  					-- replace with your llama.cpp instance
									url = "http://{ip address of the llama.cpp server instance for chat}:11434", 
									api_key = "TERM",
									chat_url = "/v1/chat/completions",
								},
							})
						end,
					},
				},
			})
		end,
		keys = {
			{
				"<leader>cc",
				function()
					require("codecompanion").toggle()
				end,
				desc = "Toggle CodeCompanion chat",
				mode = { "n", "v" },
			},
			{
				"<leader>cp",
				"<cmd>CodeCompanionActions<CR>",
				desc = "CodeCompanion Action Palette",
			},
		},
	},
	{
		"MeanderingProgrammer/render-markdown.nvim",
		ft = { "markdown", "codecompanion" },
	},
}

Openhands

Openhands with Qwen3 30b has been working great for me. I gave it a go with a not-so-small python project, and Openhands was able to solve a complex problem I had where the application was hanging on quit, improve test coverage and documentation.
It should be possible to integrate openhands in neovim/CodeCompanion.nvim via acp, but Openhands is not supported out of the box yet. There might be a hacky way to make it work, but so far running openhands in a separate terminal has been OK for me.
Initially I tried the docker compose way, accessing Openhands via http, but the recommended method with uv, which also yields the CLI interface, is simply better and easier.
No results found