Doctor Shotgun DocShotgun

## llamacpp-moe-offload-guide.md

      
              1 file
            
          
              0 forks
            
          
                7 comments
              
            
              8 stars
            
          
                DocShotgun
                / llamacpp-moe-offload-guide.md
            
            
              Last active
              March 4, 2026 21:24
            
              
                Guide to optimizing inference performance of large MoE models across CPU+GPU using llama.cpp and its derivatives
              
          
    Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp

Introduction

So you want to try one of those fancy huge mixture-of-experts (MoE) models locally? Well, whether you've got a gaming PC or a large multi-GPU workstation, we've got you covered. As long as you've downloaded enough RAM beforehand.
Anatomy of a MoE Model

MoE models are described in terms of their total parameters and active parameters - i.e. DeepSeek V3 671B A37B has 671B total parameters, but we are using only 37B parameters at a time during each forward pass through the model.

  
## disable-numa-balancing.sh
#!/usr/bin/env bash
set -euo pipefail

NUMA_KEY="kernel.numa_balancing"

if [[ $# -eq 0 ]]; then
    echo "Usage: $0 <command> [args...]" >&2
    exit 1
fi

## numactl-bind-socket.sh
#!/bin/bash
# numactl-bind-socket.sh — Run a command bound to a specific socket's CPUs and memory
# Usage:
#   ./numactl-bind-socket.sh --socket <id> --mode <physical|all> [--interleave <on|off>] <command> [args...]

set -euo pipefail

usage() {
    echo "Usage: $0 --socket <id> --mode <physical|all> [--interleave <on|off>] <command> [args...]"
    exit 1

## axolotl_ROCm_setup_v2.sh
#!/bin/bash

# Setup Axolotl with FA2 and BnB ROCm - doctorshotgun Sept 16, 2024
# Runpod image: RunPod Pytorch 2.1.2 ROCm 6.1 runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04

# Install torch and flash-attn
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/rocm6.1
#pip install https://download.pytorch.org/whl/nightly/pytorch_triton_rocm-3.0.0%2Bdafe145982-cp310-cp310-linux_x86_64.whl
pip install https://github.com/DocShotgun/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+rocm6.1+torch2.4.0-cp310-cp310-linux_x86_64.whl

## jank_convert_novel_to_markdown.py
import sys

read_file = open(sys.argv[1],"r")
roleplay_text = read_file.read()
read_file.close()

def convert_novel_to_md(text):
	md = ""
	if '"' in text:
                parts = text.split('"')

## jank_convert_markdown_to_novel.py
import sys

read_file = open(sys.argv[1],"r")
roleplay_text = read_file.read()
read_file.close()

def convert_md_to_novel(text):
	novel = ""
	if '*' in text:
                parts = text.split('*')
	#!/usr/bin/env bash
	set -euo pipefail

	NUMA_KEY="kernel.numa_balancing"

	if [[ $# -eq 0 ]]; then
	echo "Usage: $0 <command> [args...]" >&2
	exit 1
	fi
	#!/bin/bash
	# numactl-bind-socket.sh — Run a command bound to a specific socket's CPUs and memory
	# Usage:
	# ./numactl-bind-socket.sh --socket <id> --mode <physical\|all> [--interleave <on\|off>] <command> [args...]

	set -euo pipefail

	usage() {
	echo "Usage: $0 --socket <id> --mode <physical\|all> [--interleave <on\|off>] <command> [args...]"
	exit 1
	#!/bin/bash

	# Setup Axolotl with FA2 and BnB ROCm - doctorshotgun Sept 16, 2024
	# Runpod image: RunPod Pytorch 2.1.2 ROCm 6.1 runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04

	# Install torch and flash-attn
	pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/rocm6.1
	#pip install https://download.pytorch.org/whl/nightly/pytorch_triton_rocm-3.0.0%2Bdafe145982-cp310-cp310-linux_x86_64.whl
	pip install https://github.com/DocShotgun/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+rocm6.1+torch2.4.0-cp310-cp310-linux_x86_64.whl
	import sys

	read_file = open(sys.argv[1],"r")
	roleplay_text = read_file.read()
	read_file.close()

	def convert_novel_to_md(text):
	md = ""
	if '"' in text:
	parts = text.split('"')