Skip to content

Instantly share code, notes, and snippets.

@possibilities
Created March 9, 2026 18:04
Show Gist options
  • Select an option

  • Save possibilities/d8a3ebbf19879f2513c2a689bca4f7fb to your computer and use it in GitHub Desktop.

Select an option

Save possibilities/d8a3ebbf19879f2513c2a689bca4f7fb to your computer and use it in GitHub Desktop.

RTX 3060 Ti for QMD Inference

For the QMD stack, it'll be great:

QMD models (~3GB VRAM total):

Model Size What it does Expected speed
embeddinggemma-300M Tiny Document embedding Near-instant, batch hundreds/sec
Qwen3-Reranker-0.6B Small Reranking results Very fast, sub-100ms per query
qmd-query-expansion-1.7B Medium Query expansion Fast, probably 80-100+ tok/s

All three models fit in VRAM simultaneously with 5GB to spare. These are small models — the 3060 Ti's 448 GB/s memory bandwidth absolutely crushes them. You'll go from "unusably slow" on CPU to real-time.

Bigger picture inference benchmarks:

  • Llama 2 7B (4-bit): 73 tok/s
  • Mistral 7B (4-bit): 71 tok/s
  • Llama 3.1 8B (4-bit): 57 tok/s

Your models are 5-25x smaller than those, so they'll fly.

What about future models? The 8GB limit is the ceiling. You can comfortably run anything up to ~7B quantized. Beyond that (13B+), performance falls off a cliff as it spills to system RAM. If you ever wanted to run a bigger local LLM for other tasks, the 2080 Ti's 11GB gives you more room — but for QMD as it stands today, the 3060 Ti is more than enough.

TLDR: The ZOTAC 3060 Ti will make QMD search feel instant. Your current bottleneck is CPU inference on models designed for GPU — even the cheapest 3060 Ti will be a night-and-day difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment