I played with GPT OSS

I spent the last 24 hours playing with OpenAI's GPT-OSS models. Downloaded them, broke them, fixed them, and tried to understand why everyone's losing their minds over this release.

Aug 07, 2025

First, the numbers everyone's throwing around: 120 billion parameters running on a single GPU.

Here's why that's insane: Previous models of this size needed a small server farm. This runs on one graphics card. It's like fitting a commercial jet engine into a Honda City and somehow it works.

Metaphorically,

Imagine you're running a massive company with 120 billion employees (stay with me). Traditional AI models are like making EVERY employee come to EVERY meeting. Chaos. Expensive. Slow.

GPT-OSS uses something called Mixture of Experts (MoE). It's like having specialized departments:

Marketing experts
Legal experts
Engineering experts
Customer service experts

When someone asks about refund policy, why wake up the engineering team? Only customer service experts activate. Out of 120 billion "employees" (parameters), only 5 billion actually work on any given question.

Technically,

Both gpt-oss:20b and gps-oss:120b models use mixture-of-experts (MoE). Think of it as having specialised sub-models where only the relevant experts wake up for each token.

The 120B model has 117B total parameters but only activates 5.1B per token. The 20B? Just 3.6B active parameters. This is clever engineering. It's what makes a 120-billion-parameter model fit on a single H100.

The witchcraft

Made it small
They made the model x times smaller using something called MXFP4 quantization.
Think of it like JPEG compression for AI:
Original photo: 100MB, perfect quality
JPEG: 2MB, 95% as good

You can't tell the difference unless you zoom way in. They compressed the model to just requiring 80GB.

The quality loss? About 2%. That's the difference between getting a 98% vs 100% on a test. Nobody cares.

The Attention Revolution (Why It Handles 128K Tokens)

Text within this block will maintain its original spacing when published

If you don’t already know - In transformer models such as these, "attention" is how the model decides which words to focus on when processing each word. When the model reads "The cat sat on the mat," and it's processing "sat," it needs to figure out that "cat" (the subject) is more important than "the" (just an article). Mathematically, it's computing relevance scores between every token and every other token. For 1000 tokens, that's 1,000,000 comparisons. For 128K tokens? That's 16 billion comparisons. Your GPU just cried.

This part broke my brain until I understood it.

The GPT OSS models alternate between two ways of reading text:

Dense attention: Read every single word carefully (slow but thorough)
Sparse attention: Skim for important parts (fast but might miss details)

Some layers look at everything, others focus locally. Having both a microscope and a telescope, switches based on what needs to be seen. This is why it can handle 128,000 tokens (about 200 pages of text).

What I actually experienced running it

The 20B model on my MacBook:

Downloaded in 20 minutes (13GB file)
First run: Worked faster than I expected. No config hell.
Speed: About 15 tokens/second (fast enough for real-time chat). Was much faster in 32 GB system.
Memory usage on first token: 13GB
Quality: Indistinguishable from GPT-4 for generic tasks

The Hardware Reality Check

Tested on two systems:

M3 Pro with 18GB RAM: Sluggish. Constant memory swapping. Basically unusable. Perhaps because it had only 2GB of headroom above the model's minimum requirements
M2 Pro with 32GB RAM: Smooth as butter.

Ironically, my newer M3 Pro is actually more capable than the M2 Pro in terms of better Neural Engine performance, more efficient memory bandwidth and improved GPU cores.

The "Holy Shit" Moments:

It runs offline. Airplane mode. No internet. Still works. Your data never leaves machine.
It's uncensored: No "I can't do that, Harsh" responses. However there some some filters baked into the model itself.
Custom tools work: You can teach it to use your internal APIs
You can see it thinking. With the harmony format, you actually see the reasoning steps. It's like watching someone work through a problem on a whiteboard.
Fine-tuning is surgical. You can teach the coding expert new languages while the writing expert stays unchanged. It's like upgrading individual apps instead of the whole OS.

The good, the bad and the ugly

The Good:
1. Setup is surprisingly painless
2. Performance is legitimately impressive
3. The Apache 2.0 license means you can do whatever

The Bad:
1. When it hallucinates, you can't blame OpenAI
2. You need to get hardware to operate it for production.
3. Running it is one thing operating it for production is a different ball game.
4. Fine-tuning comes with a reasonable learning curve

The Ugly:
1. GPU availability is about to get worse
2. Your competitors already have it running
3. The model occasionally gets... creative
4. No safety rails unless you build them :)

The Fastest Path to Running It

Macbook Users

brew install ollama
ollama run gpt-oss:20b

NVIDIA GPU Users (RTX 4090/3090)

pip install ollama 
ollama run gpt-oss:20b

Cloud Option

Rent an H100 instance on Lambda Labs ($2/hour)
Run the 120b model at full speed

There is more to what that I tried with GPT OSS, comparison with Qwen, fine tuning and more. Will share more.

The uncomfortable truth

This changes everything. Not in the "revolutionary app" way that VCs love, but in the "electricity just became free" way that reshapes industries.

Unengineered

Discussion about this post