For the QMD stack, it'll be great:
QMD models (~3GB VRAM total):
| Model | Size | What it does | Expected speed |
|---|---|---|---|
| embeddinggemma-300M | Tiny | Document embedding | Near-instant, batch hundreds/sec |
| Qwen3-Reranker-0.6B | Small | Reranking results | Very fast, sub-100ms per query |
| qmd-query-expansion-1.7B | Medium | Query expansion | Fast, probably 80-100+ tok/s |
All three models fit in VRAM simultaneously with 5GB to spare. These are small models — the 3060 Ti's 448 GB/s memory bandwidth absolutely crushes them. You'll go from "unusably slow" on CPU to real-time.
Bigger picture inference benchmarks:
- Llama 2 7B (4-bit): 73 tok/s
- Mistral 7B (4-bit): 71 tok/s
- Llama 3.1 8B (4-bit): 57 tok/s
Your models are 5-25x smaller than those, so they'll fly.
What about future models? The 8GB limit is the ceiling. You can comfortably run anything up to ~7B quantized. Beyond that (13B+), performance falls off a cliff as it spills to system RAM. If you ever wanted to run a bigger local LLM for other tasks, the 2080 Ti's 11GB gives you more room — but for QMD as it stands today, the 3060 Ti is more than enough.
TLDR: The ZOTAC 3060 Ti will make QMD search feel instant. Your current bottleneck is CPU inference on models designed for GPU — even the cheapest 3060 Ti will be a night-and-day difference.