RTX 5090 32GB AI LLM Performance Guide: 2026 Benchmarks

The RTX 5090 has been on shelves for over a year now, and the local LLM scene around it has changed more in the last six weeks than in the six months before. Qwen 3.6 dropped on April 22. DeepSeek V4 followed two days later. Gemma 4 had already shipped at the start of the month. So if you’re reading 2025 benchmarks of this card and trying to map them to today, you’re working with a stale picture.
We’ve spent the last two weeks running these new model families through our 5090 to see what actually changes. Some of it confirms what you’d expect from the bandwidth-bound math. Some of it doesn’t – particularly around DeepSeek V4, which a lot of buyer’s guides quietly avoid mentioning when they recommend a 32 GB card.
This is what the RTX 5090 looks like in May 2026 – what it can run, what it can’t, and where the 32 GB ceiling actually starts to bite.
TL;DR
The RTX 5090 remains the strongest single-card option for local inference under $5,000 in May 2026, and it handles all four currently-relevant Qwen 3.6 and Gemma 4 open-weight models comfortably at Q4. It will not run any DeepSeek V4 variant, full stop – even V4-Flash needs ~140 GB at Q4. For coding agents and long-context work on a single GPU, it’s still the right buy. For frontier open-weight reasoning, you need something else.
The rig and how we tested
Our test bench for everything that follows:
- CPU: AMD Ryzen 9 9950X3D
- RAM: 64 GB DDR5-6000
- GPU: NVIDIA RTX 5090 (32 GB GDDR7, 1,792 GB/s, 575W TDP)
- OS: Tested on both Windows 11 24H2 and CachyOS (Arch-based, kernel 6.14)
- Driver: NVIDIA 580.x branch, CUDA 13.0
- Inference stack: llama.cpp built from master (early May 2026), Ollama 0.7.0, vLLM 0.9.x for batched tests
All tokens-per-second numbers below are single-stream, batch size 1, with full GPU offload (-ngl 99) unless noted. We ran llama-bench with -fa 1 (flash attention on) and prompt-processing depths from 4K up to whatever the card could fit. Where we cite published numbers from other testers, we mark them clearly – we didn’t pretend to redo every benchmark from scratch in two weeks.
Two honest caveats. First, Qwen 3.6 GGUFs are roughly two weeks old as we write this, and llama.cpp’s kernels for the architecture are still being tuned. Numbers will improve over the next month or two as the ecosystem catches up. Second, we tested on Linux for the long context runs – Windows results are typically 3-7% slower in our experience due to driver overhead and WDDM scheduling, so if you’re on Windows, shave a bit off everything.
Qwen 3.6 27B Dense – the new single-card sweet spot
Qwen 3.6-27B is the dense flagship of the open-weight 3.6 family. It’s positioned as a coding-focused model with strong agentic behavior, 262K native context (extensible to 1M with YaRN), and the new “thinking preservation” feature for multi-turn tool use. Qwen’s published Terminal-Bench 2.0 score for this size class is genuinely impressive – they claim it edges out Claude Opus 4.5 on agentic coding, though we’ll take that with the usual lab-self-report grain of salt.
At Q4_K_M, the model weights come in around 16.5 GB. With a 32K context KV-cache (about 2.5 GB) and llama.cpp overhead, we sat at 21-22 GB of VRAM used – leaving ample headroom for KV cache extension or a draft model.
| Context depth | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 4,096 | ~5,200 | 87.4 |
| 16,384 | ~3,800 | 81.2 |
| 32,768 | ~2,600 | 74.6 |
| 65,536 | ~1,650 | 62.1 |
| 131,072 | ~890 | 48.3 |
For reference: Hardware Corner’s November 2025 test of the older Qwen3 32B dense reported around 52 t/s peak on the same card. The new 27B is smaller and architecturally tweaked, so the ~75-87 t/s range we saw at moderate context is roughly where the math says it should land – about 88% of the theoretical bandwidth ceiling for a Q4 27B dense (~99 t/s based on 1,792 GB/s ÷ ~18 GB read per token).
What this means in practice: 87 t/s feels great for chat. For agentic loops where you’re chewing through tool calls, the prompt processing matters more than generation – and 5,200 t/s prefill at 4K context is what makes Qwen 3.6 useable as a local code agent. We’ve been running it as a Claude Code substitute on smaller projects and the responsiveness is the difference between “this works” and “I’ll just open the API tab.”
Qwen 3.6 35B-A3B MoE – the model that flies
If the 27B dense is the practical pick, the 35B-A3B MoE is the model that makes the 5090 feel exotic. 35 billion total parameters, but only ~3 billion active per token. The architecture is a sparse MoE with 128 experts, of which a small subset fires on each forward pass. The result: token generation that runs as if you were on a 3B model, with quality that lands much closer to the 27B dense.
At Q4_K_XL, weights come in around 19.8 GB. Add KV cache and you’re using 22-24 GB at 32K context – still comfortable headroom on a 32 GB card. And the speed is what you’d expect from the bandwidth math: only ~2 GB of weights are read per token, so the ceiling is enormous.
| Context depth | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 4,096 | ~7,800 | 198.5 |
| 16,384 | ~5,100 | 164.0 |
| 32,768 | ~3,400 | 121.7 |
| 65,536 | ~2,000 | 89.4 |
| 131,072 | ~1,050 | 62.0 |
198 t/s at 4K context is genuinely silly for what is, in capability terms, a 30B-class model. For comparison, the older Qwen3 30B-A3B was clocked at 234 t/s on the same card back in November 2025 – the slight slowdown on 3.6 is partly because llama.cpp kernels haven’t fully caught up to the new MoE routing yet. Expect this to climb back toward 220+ t/s within a month.
Honestly, this is the model we’ve been keeping loaded by default. The MoE pattern is well-suited to the way local LLM use actually looks: bursty, conversational, lots of short turns. You’re rarely waiting on it. And when you do hit a long-context query – say, dumping 80K tokens of a codebase – the 5090’s bandwidth keeps generation above 80 t/s, which is faster than most cloud APIs feel.
Gemma 4 31B Dense – Google’s quiet contender
Gemma 4 didn’t get the launch hype Qwen 3.6 or DeepSeek V4 did, but it’s the open model release that surprised us most this spring. Google shipped four sizes under Apache 2.0 – a meaningful change from earlier Gemma’s custom license that kept it out of plenty of enterprise stacks. The 31B dense variant is the one that lands squarely in 5090 territory.
The architectural pitch: dense, multimodal (text + image + video), 256K context, native function-calling and structured output. The benchmark numbers Google published are aggressive – Artificial Analysis ranks Gemma 4 31B at 39 on the Intelligence Index, just below Qwen 3.5 27B Reasoning at 42. AIME 2026 jumped from Gemma 3 27B’s 20.8% to 89.2% on Gemma 4 31B, which is a generational leap on that math benchmark.
At Q4_K_M the model weights are about 18.5 GB. KV cache scaling on Gemma 4 is well-behaved thanks to the GQA implementation – we hit 24K context at well under 30 GB total VRAM.
| Context depth | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 4,096 | ~4,700 | 78.9 |
| 16,384 | ~3,400 | 72.4 |
| 32,768 | ~2,300 | 65.2 |
| 65,536 | ~1,400 | 54.1 |
| 131,072 | ~720 | 40.3 |
About 10% slower than Qwen 3.6 27B dense across the board, which makes sense – it’s a slightly larger model. But the quality-per-token feels different. Gemma 4 is markedly better at instruction following and at structured outputs in our testing. We’ve been routing data extraction tasks through it and seeing fewer JSON parsing errors than with Qwen. For pure code, Qwen 3.6 still wins. For agentic glue work where reliability matters more than raw smarts, Gemma 4 31B is our pick.
Multimodal works on the 5090 with the right llama.cpp build (you need the --mmproj flag and the corresponding vision tower file from Hugging Face). Image understanding adds maybe 2-3 GB of VRAM transient and slows prefill modestly. We didn’t test video heavily – that’s a separate piece.
Gemma 4 26B-A4B – Apache 2.0 and absurdly fast
The 26B-A4B is Gemma 4’s MoE variant: 26 billion total parameters, only 3.8 billion active per token. It’s a direct architectural answer to Qwen’s MoE, and the design point is the same – get most of the quality of the dense model at a fraction of the inference cost.
At Q4 it’s about 14.5 GB on disk. With KV cache room to spare, you could run this on a 16 GB card if you wanted – but on the 5090 you can crank the context to absurd lengths without breaking a sweat.
| Context depth | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 4,096 | ~9,200 | 241.6 |
| 16,384 | ~6,400 | 205.3 |
| 32,768 | ~4,300 | 168.0 |
| 65,536 | ~2,500 | 119.2 |
| 131,072 | ~1,300 | 78.5 |
241 t/s at 4K. That’s not a typo. The MoE routing keeps active weight reads tiny, and the 5090’s bandwidth does the rest. For interactive use this feels instantaneous – you’re typing slower than the model is generating. We’ve stress-tested it for hours of agentic loops and the card stays at around 78°C with the stock fan curve, which is comfortably below the thermal throttle point. More on power and heat later.
If you’re building a local coding setup and want raw speed for completions and refactoring, this is the configuration to start with. The quality gap to Qwen 3.6 35B-A3B is real but smaller than you’d think, and 26B-A4B’s instruction-following advantage shows up in tool-calling reliability.
DeepSeek V4 – the reality check
Here’s where we have to be direct, because most of the launch coverage of DeepSeek V4 didn’t bother to be.
DeepSeek V4 dropped on April 24 in two flavors: V4-Pro at 1.6T total parameters / 49B active, and V4-Flash at 284B total / 13B active. Both are MIT-licensed open weights. Both have 1M context windows. The benchmark numbers are real – V4-Pro scored within a couple of points of Claude Opus 4.6 on SWE-bench Verified. Genuinely frontier-class for an open release.
Neither will run on your RTX 5090.
The math is pretty unforgiving. V4-Flash at Q4 is roughly 142 GB on disk before you account for KV cache or activation memory. V4-Pro at Q4 is around 800 GB. To run V4-Flash you need either a multi-card NVIDIA workstation in the RTX PRO 6000 / H200 class, or aggressive CPU offload at speeds that will make you miserable. To run V4-Pro you need an actual server.
We tried. Loaded V4-Flash with -ngl 0 and let llama.cpp shovel the rest through the 64 GB of system RAM and SSD. Got 0.6 tokens per second. Not a typo, not the right unit – point six. It works in the technical sense that text comes out. It works in no useful sense.
So if you saw a “best GPUs for DeepSeek V4” post that recommended an RTX 5090 – that author either didn’t try it or didn’t care to mention. The honest answer in May 2026 is: if you want frontier open-weight reasoning locally, the 5090 isn’t the card for it. You either step up to an RTX PRO 6000 with 96 GB, build a multi-card rig, or rent. For everything else – Qwen 3.6, Gemma 4, the smaller DeepSeek distills, and whatever 70B-class model lands next – the 5090 is still the right card. Just be clear about which problem you’re solving.
How far does 32 GB take you on context length?
One of the things the 5090 does that earlier 24 GB cards don’t is sustain genuinely long contexts in pure VRAM, without falling back to system memory. We pushed each of our four runnable models to its practical limit on the card:
| Model | Max context fit (Q4) | VRAM at max | TG at max |
|---|---|---|---|
| Qwen 3.6 27B Dense | ~180K | 30.8 GB | 38 t/s |
| Qwen 3.6 35B-A3B MoE | ~210K | 31.2 GB | 51 t/s |
| Gemma 4 31B Dense | ~140K | 30.5 GB | 34 t/s |
| Gemma 4 26B-A4B MoE | ~262K (full native) | 29.7 GB | 62 t/s |
The MoE models stretch further because the active-weight memory cost stays small as context grows – KV cache dominates the budget. The Gemma 4 26B-A4B was the only model where we could fit the full 262K native context with room left over. For most realistic agent workloads we don’t recommend pushing past 100-130K – quality degradation past that is real even when the card can hold the cache.
Worth noting: KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0) cuts KV memory roughly in half at minor quality cost and lets you push these numbers another 30-50%. We didn’t include those runs in the table to keep apples to apples, but it’s a free win if you need it.
FP4 native – useful today, or hype?
One of Blackwell’s marquee features is native FP4 – 4-bit floating point support in the 5th-gen Tensor Cores. NVIDIA quotes 3,352 TOPS at FP4 with sparsity for the 5090, roughly double the FP8 figure. On paper, this should mean inference at FP4 runs significantly faster than at typical Q4 GGUF quantization.
In practice, in May 2026, FP4 is mostly potential. vLLM and TensorRT-LLM have FP4 paths for some Llama and Qwen variants but the GGUF ecosystem most local users live in (llama.cpp, Ollama, LM Studio) hasn’t broadly integrated FP4 kernels yet. Where we tested FP4 – through TensorRT-LLM with a custom Qwen 3.6 27B build – we saw about a 1.4x throughput improvement over Q4_K_M at small batch sizes. Real, but not the 2x the marketing numbers might suggest, and the toolchain friction was significant.
We expect this to change in the next 3-6 months as kernel support lands more broadly. If you’re buying a 5090 today specifically for FP4 throughput, set your expectations accordingly: the hardware is ready, the software still has miles to go.
Power, heat, and the 575W reality
The 5090 pulls a lot. NVIDIA rates it at 575W TDP, and under sustained inference loads we measured 510-540W at the wall (less than the rated peak – autoregressive decoding doesn’t max the GPU compute the way training does). Prompt processing spikes higher; we’ve seen brief 565W transients at 4K prefill on Qwen 3.6 35B-A3B.
Thermals on our build (Lian Li O11 Vision, three top exhaust 140s, two intake 120s) settled around 76-79°C on the GPU core during continuous inference. Hotspot was 86-89°C. The Founders Edition cooler is genuinely fine for this – we tested both an FE and an MSI Suprim and the temperature delta in our case was 4-5°C, with the Suprim winning marginally.
One thing worth saying clearly: if you’re using this card while gaming on the same machine, you’ll need to be mindful about loading inference models in the background. We’ve covered our local AI infrastructure setup with Ollama and Docker in another piece – VRAM swap-out behavior matters when you alt-tab from Path of Exile 2 to a code agent. The 5090’s 32 GB makes this less painful than a 4090’s 24 GB, but it’s not zero pain.
Should you buy one in May 2026?
Pricing on the 5090 has not improved much since launch. Founders Edition is intermittently available around the $1,999 MSRP. AIB cards are at $2,400-$3,500 depending on model and how patient you are. The DRAM shortage that hit late 2025 has eased somewhat but supply is still tighter than NVIDIA’s 4090 generation was at the same point in its life. In the EU we’re seeing roughly €2,300-€3,200 for AIB cards as of early May.
The buy/skip decision comes down to which model class you actually want to run. Here’s our honest take:
Buy the 5090 if: you want a single-card setup that runs every relevant open-weight model up to 35-40B comfortably, you do agentic coding work locally, you care about long contexts (130K+), and you can absorb the $2,500-3,000 reality of getting one. The MoE models specifically – Qwen 3.6 35B-A3B and Gemma 4 26B-A4B – make this card feel substantially future-proof in a way the 4090 didn’t a year ago. Our two weeks with these models have been the happiest we’ve been with local inference, full stop.
Skip it if: your real goal is running DeepSeek V4 or other 200B+ open models locally. The 5090 doesn’t get you there. You either need an RTX PRO 6000 (96 GB, ~$8,500 retail), a dual-card setup with the PCIe-bottleneck caveats that implies, or a Mac Studio M4 Ultra with 256 GB unified memory. None of those are the same conversation.
Wait if: you’re not in a rush and can stomach reading benchmarks for another six months. NVIDIA’s RTX 6090 – if it follows the typical two-year cadence – would land in early 2027, and given the architectural wind around FP4 and Blackwell-Next, the next consumer card is likely to bring 48 GB at the high end. If you’re sitting on a 4090 that’s working for you, it’s working for you. The 4090’s bandwidth wall is real but it’s not catastrophic for everything below 30B.
For our use case – running Qwen 3.6, Gemma 4, switching between coding agent and chat, occasional long-document summarization – the 5090 is the card we’d buy again. It’s not the card we’d recommend to someone whose primary interest is the absolute frontier of open-weight quality. That distinction matters more in May 2026 than it did six months ago, because the gap between “the best 32 GB can run” and “what a state-of-the-art open-weight model actually is” has gotten genuinely large.
We’ll be revisiting these numbers as llama.cpp’s Qwen 3.6 and Gemma 4 kernels mature, and as FP4 support broadens. Expect another data point in July.
By quantized.fyi editorial · Tested May 2026