In August 2025 OpenAI released two open models – gpt‑oss‑120b with 117 billion parameters and gpt‑oss‑20b with 21 billion. Both use a mixture‑of‑experts (MoE) architecture combined with MXFP4 4‑bit quantization, so only a subset of experts is active during inference: roughly 5.1 million parameters for the larger model and 3.6 million for the smaller one. This reduction enables gpt‑oss‑120b to run on a single NVIDIA H100 GPU with 80 GB memory, while gpt‑oss‑20b can be deployed on consumer‑grade GPUs that have just 16 GB of VRAM.
OpenAI claims that deploying these models cuts infrastructure and licensing expenses by about 70 percent compared with using their hosted APIs. Companies that already own GPU farms or plan to rent cloud instances can retire dozens of costly H100 units and instead pay only for token usage in real time.
Generation quality remains high thanks to MoE layers equipped with SwiGLU activations and a softmax‑after‑topk mechanism. The 4‑bit quantization affects only the expert weights; all other layers run at full precision. In benchmark tests the models match the performance of proprietary equivalents on agentic tasks and complex reasoning challenges.
The Apache 2.0 license shifts the power balance. Enterprises gain complete control over the model and its data, can fine‑tune it for specific domains, and can host it in a private environment without fearing API price hikes or data leaks. Effective customization, however, requires expertise in MoE architectures and 4‑bit quantization, which may raise initial operational expenditures.
Why this matters: CEOs can recalculate CAPEX/OPEX to lower GPU‑farm costs to roughly 30 percent of current API spend and start planning an internal LLM deployment that is independent of external providers. The first step is to assess whether existing infrastructure can handle H100 or consumer GPUs and to assemble a small team of MoE specialists for rapid model adaptation.