How to Run OpenAI GPT-OSS-120B Locally: A Detailed Guide
How to Run OpenAI GPT-OSS-120B Locally: A Detailed Guide
OpenAI's GPT-OSS-120B is a groundbreaking open-weight large language model with approximately 117 billion parameters (5.1 billion active), designed to deliver powerful reasoning and agentic capabilities, including code execution and structured outputs. Unlike massive models requiring multiple GPUs, GPT-OSS-120B can run efficiently on a single Nvidia H100 GPU, making local deployment more accessible for organizations and advanced users seeking privacy, low latency, and control.
This article synthesizes the latest knowledge and practical steps as of August 2025 to help you run GPT-OSS-120B locally, including hardware requirements, installation options, containerized deployment, and optimization techniques.
Why Run GPT-OSS-120B Locally?
- Full data sovereignty: Data never leaves your local environment, critical for sensitive applications.
- Cost control: Avoids ongoing cloud API costs and rate limits.
- High performance: Optimized architecture enables high reasoning quality on a single datacenter-class GPU.
- Customization: Fine-tune the model or build advanced autonomous agents with full control.
Hardware and Software Requirements
Component | Minimum | Recommended |
---|---|---|
GPU | Nvidia H100 GPU (40GB+) | Nvidia H100 (1 or more GPUs ideally) |
System RAM | ≥ 32GB RAM | 64GB+ for smooth multitasking |
Storage | ≥ 200GB NVMe SSD | Fast NVMe to cache model weights |
CPU | Modern multi-core | 8+ cores recommended |
OS | Linux (preferred) | Linux for best driver & Docker support |
Due to the large model size, consumer GPUs with <40GB VRAM (e.g., RTX 3090 or 4090) generally cannot run GPT-OSS-120B locally without significant offloading or model parallelism. The model was explicitly designed for H100-class GPUs.
Official Model Characteristics
- Model size: 117 billion parameters, with 5.1 billion active parameters enabled by Mixture-of-Experts (MoE) sparsity.
- Quantization: Trained with MXFP4 precision native to MoE layers for memory and compute efficiency.
- Software compatibility: Compatible with Hugging Face Transformers, vLLM, and OpenAI Harmony API format.
- License: Permissive Apache 2.0 — suitable for experiment, customization, and commercial projects.
Step-by-Step Guide to Running GPT-OSS-120B Locally
1. Deploy Using Northflank Cloud GPU Containers
Northflank offers a reliable way to self-host GPT-OSS-120B in GPU-enabled containers, especially if you have access to Nvidia H100 GPUs.
Procedure:
- Create a Northflank account and start a GPU-enabled project, selecting H100 GPUs in a supported region.
- Create a new service using the external Docker image
vllm/vllm-openai:gptoss
. - Set a runtime environment variable
OPENAI_API_KEY
with a secure random string (length ≥128). - Expose port 8000 with HTTP protocol for API access.
- Select hardware plan with 2 Nvidia H100 GPUs for optimal inference.
- Attach persistent storage volume of ≥200GB mounted at
/root/.cache/huggingface
to cache model downloads and avoid re-fetching on redeploy. - Deploy the service; initially run a sleep command (
sleep 1d
) to bring the container up without loading the model immediately.
This setup supports OpenAI-compatible endpoints and handles the heavy model loading on optimized GPUs.
2. Running Locally on Enterprise-Class GPU Machine
If you have a physical server or workstation equipped with Nvidia H100 GPU(s), you can run GPT-OSS-120B using official OpenAI codebases and Hugging Face tooling.
- Install dependencies:
pip install torch transformers vllm accelerate
- Download or cache the model weights:
git lfs install
git clone https://huggingface.co/openai/gpt-oss-120b
- Run inference via vLLM or custom code:
vllm serve openai/gpt-oss-120b
OR in Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b").cuda()
prompt = "Explain how to run GPT-OSS-120B locally"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs, skip_special_tokens=True))
- Use
torchrun
oraccelerate
utility for multi-GPU parallelism if needed.
3. Running via Azure AI Foundry
Microsoft Azure AI Foundry supports GPT-OSS-120B on their managed enterprise GPU platform.
- Provides CLI tools and UI to instantiate GPU-backed endpoints.
- Enables running GPT-OSS-120B on a single enterprise GPU with low-latency and bandwidth-optimized deployment.
- Supports Windows devices and will soon offer MacOS support with Foundry Local.
This is a good hybrid approach for organizations requiring managed infrastructure alongside local on-prem usage.
Optimization Best Practices
- Use AMP mixed precision (FP16) on GPUs such as Nvidia H100 to reduce memory consumption and increase throughput.
- Use persistent storage volumes to cache models and avoid repeated downloads when using containers.
- Adjust inference parameters like configurable reasoning effort (low, medium, high) to balance latency vs. output quality.
- Leverage batch inference and API compatible endpoints for integrating multiple concurrent requests efficiently.
- Keep drivers (e.g., Nvidia CUDA 12.8+) and libraries up-to-date for compatibility and performance.
Conclusion
Running OpenAI GPT-OSS-120B locally is feasible today—primarily on single Nvidia H100 GPUs or equivalent enterprise hardware—and supported by mature software ecosystems such as vLLM, Hugging Face Transformers, and container platforms like Northflank. For organizations or enthusiasts with access to such resources, GPT-OSS-120B provides unmatched reasoning and capabilities in a self-hosted environment.
If you do not have H100-class GPUs, the smaller GPT-OSS-20B might be a more practical alternative for local runs on consumer-level GPUs.
For cloud-assisted or hybrid workflows, Azure AI Foundry offers an excellent managed platform to deploy GPT-OSS-120B with ease.
For those interested in API and infrastructure solutions complementing local deployment, services like LightNode offer scalable cloud-based interfaces to open models.