Short answer
gpt-oss-120b typically requires a workstation/server GPU with ~60–80 GB of VRAM for single‑GPU local inference (80 GB is the common “safe” target). Multi‑GPU setups (model‑/tensor‑parallelism) can reduce per‑GPU VRAM requirements but add complexity. (ollama.com) Recommended hardware (concise)
GPU VRAM: 80 GB (single‑GPU recommended). LM Studio notes “best with ≥60 GB VRAM” but most vendor docs and deploy guides target an 80 GB card (NVIDIA A100/H100 class) for the full 120B. Multi‑GPU with NVLink/NVSwitch can be used instead of one 80 GB card. (xlxm.cn) System RAM: 64–128 GB (128 GB recommended if you’ll run other services, caching, or multi‑GPU setups). Some production guides recommend 128 GB+ for comfort. (gptoss.net) CPU: Many cores for throughput; a modern server/workstation CPU (16+ cores recommended; 32+ cores for heavy multi‑client/production). High PCIe bandwidth and many PCIe lanes are helpful when attaching big GPUs. (gptoss.net) Storage: NVMe SSD (fast) — plan for ~100 GB to store the model and working files; production/dev setups often recommend 500 GB–1 TB NVMe for swaps, caches, and multiple models. Use fast NVMe for reasonable model load times. (gptoss.net) Network: Not required for pure local inference, but high bandwidth helps for downloads, multi‑node setups, or remote serving. (gptoss.net) Notes about software/quantization and platform differences
Ollama: Ollama supports the model in its native MXFP4 quantized format; OpenAI/Ollama documentation says quantization of MoE weights (MXFP4 ~4.25 bits) enables the larger model to fit on a single 80 GB GPU. Ollama’s engine supports MXFP4 natively. (ollama.com) LM Studio: LM Studio supports gpt-oss-120b but calls it “best with ≥60 GB VRAM” and recommends multi‑GPU or “beefy workstation” setups for the 120B model; LM Studio can run models locally (GGUF/other formats) and also provides guidance for splitting across GPUs. If you use LM Studio and don’t have an 80 GB card, you’ll likely need multi‑GPU with enough aggregate VRAM and support in the inference engine. (xlxm.cn) Possible mitigations: sharding across multiple GPUs (model parallelism), slower CPU/host offloading, or further quantization formats may allow running on less VRAM at the cost of complexity and/or latency. Production guides recommend vLLM/enterprise runtimes for multi‑GPU deployments. (gptoss.net) Rule‑of‑thumb checklist to decide if your machine is sufficient
GPU VRAM >= 80 GB? — Yes: you can run single‑GPU inference comfortably (use Linux + CUDA and supported runtime). If 60–79 GB, some engines (LM Studio) say “best with ≥60 GB” but performance/fit depends on exact quantization/format. If <60 GB, you’ll likely need multi‑GPU sharding or special quantization/engine support. (ollama.com) System RAM >= 64 GB? — Strongly recommended; 128 GB is safer for production or multi‑model use. (gptoss.net) NVMe SSD with enough space (100–500+ GB)? — Required for model files, swap/caches, and fast loading. (nutstudio.imyfone.com) CPU with many cores and good PCIe bandwidth? — Recommended for throughput and feeding the GPU(s). (gptoss.net) If you want a precise “yes/no” for your machine
Tell me your GPU model and VRAM, total system RAM, CPU model/cores, and storage type + free space. I’ll evaluate and tell you whether you can run gpt-oss-120b locally with Ollama or LM Studio, and, if not, what options (multi‑GPU sharding, quantized variants, or using the 20B model instead) would work. Sources used (high‑level)
Ollama model page / MXFP4 quantization notes. (ollama.com) LM Studio / OpenAI guide for running gpt‑oss locally (notes about ≥60 GB VRAM and multi‑GPU). (xlxm.cn) Deployment/how‑to/community writeups summarizing recommended VRAM, RAM, CPU, storage for gpt‑oss-120b. (nutstudio.imyfone.com) Would you like me to evaluate your actual machine specs? If so, paste your GPU model and VRAM, total RAM, CPU (model + core count), and storage type/space and I’ll give a direct yes/no and recommended adjustments.