Skip to content

Instantly share code, notes, and snippets.

@AdamIsrael
Last active November 1, 2025 00:44
Show Gist options
  • Select an option

  • Save AdamIsrael/ebf96a183c0eab65a34b3917961ed965 to your computer and use it in GitHub Desktop.

Select an option

Save AdamIsrael/ebf96a183c0eab65a34b3917961ed965 to your computer and use it in GitHub Desktop.
Run multi-part model as a service via ramalama

Here's how I setup a multi-part model (gpt-oss-120b) as a service via ramalama:

# Pull the model (this will take a while)
ramalama pull hf://ggml-org/gpt-oss-120b-GGUF

mkdir -p ~/.config/containers/systemd
cd ~/.config/containers/systemd

# Generate the quadlet
ramalama serve --generate quadlet --image=docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk gpt-oss:120b

# Manually edit gpt-oss-120b.GGUF.container. You can find the sha-256 for the blobs in:
# ~/.local/share/ramalama/store/huggingface/ggml-org/gpt-oss-120b-GGUF/refs/latest.json

# Reload systemd
systemctl --user daemon-reload

# Start the service (this may take a few minutes)
systemctl --user start  gpt-oss-120b-GGUF.service

# Check the logs of the service
journalctl --user -xeu gpt-oss-120b-GGUF.service
[Unit]
Description=RamaLama gpt-oss-120b-GGUF AI Model Service
After=local-fs.target
[Container]
AddDevice=-/dev/accel
AddDevice=-/dev/dri
AddDevice=-/dev/kfd
Image=docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk
RunInit=true
Environment=HOME=/tmp
Environment=HIP_VISIBLE_DEVICES=0
Exec=llama-server --host 0.0.0.0 --port 8080 --model /mnt/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf --chat-template-file /mnt/models/chat_template_extracted --jinja --no-warmup --alias
ggml-org/gpt-oss-120b-GGUF --temp 0.8 --cache-reuse 256 --flash-attn on -ngl 999 --threads 16 --log-colors on
SecurityLabelDisable=true
DropCapability=all
NoNewPrivileges=true
Mount=type=bind,src=/var/home/user/.local/share/ramalama/store/huggingface/ggml-org/gpt-oss-120b-GGUF/blobs/sha256-a4c9919cbbd4acdd51ccffe22da049264b1b73e59055fa58811a99efbd7c8146,target=
/mnt/models/chat_template_extracted,ro,Z
# mount 1/3
Mount=type=bind,src=/var/home/user/.local/share/ramalama/store/huggingface/ggml-org/gpt-oss-120b-GGUF/blobs/sha256-e2865eb6c1df7b2ffbebf305cd5d9074d5ccc0fe3b862f98d343a46dad1606f9,target=
/mnt/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf,ro,Z
# mount 2/3
Mount=type=bind,src=/var/home/user/.local/share/ramalama/store/huggingface/ggml-org/gpt-oss-120b-GGUF/blobs/sha256-81856b5b996da9c9fd68397d49671264ead380a8355b3c83284eae5e21e998ed,target=/mnt/models/gpt-oss-120b-mxfp4-00002-of-00003.gguf,ro,Z
# mount 3/3
Mount=type=bind,src=/var/home/user/.local/share/ramalama/store/huggingface/ggml-org/gpt-oss-120b-GGUF/blobs/sha256-38b087fffe4b5ba5d62fa7761ed7278a07fef7a6145b9744b11205b851021dce,target=/mnt/models/gpt-oss-120b-mxfp4-00003-of-00003.gguf,ro,Z
PublishPort=0.0.0.0:8080:8080
[Install]
WantedBy=multi-user.target default.target
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment