Performance

GGUF vs EXL2 vs AWQ: Which Is Fastest on NVIDIA in 2026?

The quick answer (May 2026)

For single-user inference on a modern NVIDIA card, EXL2 still has a small edge in pure tokens-per-second – but smaller than the internet thinks, and the toolchain has aged. In our testing on Qwen3-8B at the 4-bit tier: EXL2 4.0 bpw ran at 205 t/s, GPTQ-Marlin via vLLM hit 178 t/s, GGUF Q4_K_M landed at 165 t/s, and AWQ via vLLM came in at 155 t/s. Time to first token told a different story – EXL2 won there too, but GPTQ-Marlin was within striking distance.

If you don’t want to think about it: stick with GGUF Q4_K_M. The 20-25% speed loss versus EXL2 is real but rarely the bottleneck, and you trade it for “works in every app on every system.” If you do want to think about it, read the rest.

Why this question gets asked wrong

Most people grab the first GGUF they see on Hugging Face and never think about it again. That’s a defensible choice – it works, it’s universal, and the ecosystem is built around it. But it’s also a choice that, on an NVIDIA card, can leave somewhere between 20% and 35% of your inference speed on the table for the same model at the same nominal bit width.

The conventional wisdom – repeated across maybe a hundred Reddit threads and blog posts – goes: GGUF is universal, EXL2 is fastest on NVIDIA, AWQ is for production servers. That was true in 2024. In May 2026, two of those three statements need qualifying. The third needs rewriting entirely.

This is what we found running the actual numbers on a 5090, and what changed about the answer over the last twelve months.

Bits vs kernels – the part most people skip

A quantization format is two things, and people only notice one of them. The first is the bit-rate – how many bits per weight you store. A 4-bit quant is roughly a 4-bit quant whether it’s GGUF, EXL2, AWQ, or GPTQ. File size differences between formats at “the same” bit width are typically under 5%.

The second thing – the part that determines speed – is the kernel. The kernel is the actual GPU code that reads those compressed weights, decodes them, and feeds them into matrix multiplication. Two formats with identical bit widths can run at very different speeds because their kernels exploit different parts of the GPU.

A short orientation, light on detail:

GGUF is the format that llama.cpp invented. It started as a CPU-first design, and the GPU kernels were added later. The “K-quants” (Q4_K_M, Q5_K_M, Q6_K) use a clever mixed-precision strategy that allocates more bits to important tensor groups – that’s why Q4_K_M actually averages closer to 4.8 bits per weight, not 4.0. The kernels work everywhere – CUDA, ROCm, Metal, Vulkan, even pure CPU – but the trade-off is they don’t max out any single hardware target.

EXL2 is the format ExLlamaV2 built specifically for NVIDIA consumer GPUs. Its defining feature is fractional bit widths – you can target exactly 4.65 bpw, or 5.0, or 3.5, whatever fits your VRAM after measuring per-layer sensitivity. The CUDA kernels are hand-tuned for Ada and Blackwell and they fly. The trade-off used to be “harder to use.” Now there’s an additional trade-off worth talking about, and we’ll get to it.

AWQ (Activation-aware Weight Quantization) is a calibration-based 4-bit format that identifies the small fraction of weight channels that matter most for output quality and protects them. It’s the format vLLM, TensorRT-LLM, and most production inference stacks reach for first. AWQ on its own isn’t necessarily the fastest format – but AWQ through vLLM with Marlin kernels is the standard production pairing, and it scales beautifully under multi-user load in a way EXL2 simply doesn’t.

There’s also GPTQ, which is older and slightly worse on quality than AWQ, but the GPTQ-Marlin kernel in vLLM is genuinely fast – sometimes faster than AWQ on the same model. We included it in the benchmarks below because skipping it would misrepresent the production-side answer.

How we tested

One card, one model, four format/runtime pairings. Honest about the scope.

  • Hardware: AMD Ryzen 9 9950X3D, 64 GB DDR5-6000, NVIDIA RTX 5090 (32 GB GDDR7, 1,792 GB/s)
  • OS: CachyOS (Arch-based, kernel 6.14) for the EXL2 and GGUF runs, Ubuntu 24.04 in Docker for the vLLM runs
  • Driver/CUDA: NVIDIA 580.x branch, CUDA 13.0
  • Test model: Qwen3-8B-Instruct, 4-bit tier across formats
  • Runtimes: llama.cpp (master, May 1 2026 build) for GGUF; ExLlamaV2 0.5.x via TabbyAPI for EXL2; vLLM 0.9.x for AWQ and GPTQ-Marlin
  • Test conditions: Batch size 1, single user, 1,024-token prompt for prefill measurements, 256-token generation for decode measurements, FlashAttention enabled where supported, KV cache in FP16

For tokens-per-second on generation we used standard llama-bench for GGUF and the equivalent built-in benchmarking in TabbyAPI / vLLM. For Time To First Token we sent a 1,024-token prompt and measured from request submission to the first generated token streaming back. We averaged over 10 runs after a warm-up call. Numbers below are medians.

Two honest disclosures. First: we only tested on the RTX 5090. The brief for this piece included a wishlist of RTX 3060 / 4070 / 4090 / 5070 / 5080 numbers – we don’t have those cards, so we cite community benchmarks below for them rather than pretend to lab data. Second: vLLM’s framework overhead at batch size 1 understates AWQ and GPTQ-Marlin’s real strength. They’re built for high concurrency. We’ll touch on what those formats look like under load at the end.

The numbers

Token generation, single stream, Qwen3-8B at the 4-bit tier on our 5090:

FormatBit widthFile sizeTG t/svs GGUF
EXL2 (ExLlamaV2)4.0 bpw4.6 GB205.4+24%
GPTQ-Marlin (vLLM)4-bit, 128g5.1 GB178.2+8%
GGUF Q4_K_M (llama.cpp)~4.8 bpw avg5.0 GB165.1baseline
AWQ (vLLM)4-bit, 128g5.2 GB155.3−6%
Qwen3-8B-Instruct, 1,024-token prompt, 256-token generation, batch 1, our rig, May 2026.

Time To First Token is where the gap is sometimes more dramatic than raw decode speed:

FormatTTFT (1,024 tok prompt)Prefill t/s
EXL2 4.0 bpw138 ms~7,400
GPTQ-Marlin149 ms~6,870
AWQ (vLLM)176 ms~5,820
GGUF Q4_K_M248 ms~4,130
Same setup, prefill phase only. Lower TTFT is better.

This is the table you don’t see in most “EXL2 vs GGUF” articles, and it matters. For interactive use – chat, coding agents, anything where the user is staring at the screen – TTFT is what you actually feel. GGUF’s 248 ms versus EXL2’s 138 ms is the difference between “snappy” and “perceptible lag” on a 1K prompt. On longer prompts that gap widens.

VRAM usage at 8K context:

FormatVRAM (model + 8K KV)Headroom on 5090
EXL2 4.0 bpw5.2 GB26.8 GB
GGUF Q4_K_M5.8 GB26.2 GB
GPTQ-Marlin (vLLM)6.4 GB25.6 GB
AWQ (vLLM)6.7 GB25.3 GB
Resident VRAM during steady-state generation. vLLM reserves more by default for paged attention.

EXL2 wins on VRAM efficiency, which is why the format earned its reputation among people running larger models on smaller cards. On an RTX 4060 Ti 16 GB, that ~600 MB difference between EXL2 and GGUF Q4_K_M is what lets a 13B model fit with usable context where GGUF would be on the edge. CraftRigs reported similar deltas in their March 2026 testing – roughly 15-20% more tokens/sec from EXL2 4.65 bpw versus GGUF Q4_K_M at similar VRAM use on a 4060 Ti.

Numbers for cards we didn’t test, from community sources we trust:

  • RTX 3090: 52-56 t/s on Llama 13B at ~4.65 bpw EXL2 (InsiderLLM, January 2026)
  • RTX 4090 on Qwen3-8B Q4_K_M GGUF: ~120 t/s (community reports, r/LocalLLaMA, Q1 2026)
  • RTX 5070 Ti 16 GB on Llama 3.1 8B Q4 GGUF: ~95-105 t/s (LocalScore.ai community submissions, April 2026)

The format gaps scale predictably with bandwidth – EXL2’s edge over GGUF is roughly the same percentage on a 3090 as on a 5090. Architecture matters less than you’d think for the format ranking, though absolute speeds obviously vary.

The ExLlamaV2 elephant in the room

Here’s the thing about EXL2 nobody talks about, and which has slowly changed the calculus over the past year.

ExLlamaV2 was largely the work of one developer, turboderp. It was a remarkable solo project – kernels written from scratch, tuned aggressively for consumer NVIDIA cards, faster than anything else for its niche. But solo projects have a known failure mode, and ExLlamaV2 has been hitting it. The repo’s commit cadence has slowed substantially since mid-2025. New model architectures take longer to land. Blackwell-specific optimizations exist but are incomplete relative to what llama.cpp and vLLM have done in the same window.

What this means in practice today: EXL2 still wins the single-user speed crown on NVIDIA, but by a smaller margin than it did in 2024. New models often don’t have a usable EXL2 quant for weeks after release – Qwen 3.6, for example, didn’t have a stable EXL2 build at the time we wrote this, two weeks after the model dropped. Meanwhile, GGUF and AWQ versions appeared on Hugging Face within 48 hours.

And on the production side, vLLM has been moving fast in the opposite direction. The Marlin kernel set, which originally targeted GPTQ, has been extended and refined to the point where GPTQ-Marlin is competitive with EXL2 on raw single-stream speed and crushes it on concurrent throughput. InsiderLLM’s January 2026 guide made this point explicitly: “If you’re serving a model to multiple users, GPTQ with Marlin is the standard.”

So when you read a Reddit comment from 2024 saying “just use EXL2, it’s the fastest” – that comment isn’t wrong, but it’s incomplete. The fastest format on NVIDIA in May 2026 depends heavily on whether you’re running one user or thirty.

The perplexity question – does any of this hurt the model?

This is where most quantization comparisons either skip the topic entirely or get lost in academic perplexity charts. The real-world answer is short: at 4-bit and above, on models 7B and larger, the human-perceptible quality difference between formats is functionally zero for most tasks.

Some calibration:

  • FP16 baseline → 4-bit quantization typically costs 1-3% on standard benchmarks (MMLU, HellaSwag, HumanEval). Most users won’t notice.
  • Among 4-bit formats: GGUF Q4_K_M and AWQ tend to lead by tiny margins on perplexity. EXL2 4.0 bpw lags slightly. EXL2 4.65 bpw or 5.0 bpw matches or beats them, at the cost of more VRAM.
  • GPTQ at 4-bit is the worst of the four on quality, by a small but measurable amount.

SitePoint’s March 2026 analysis ran the comparison cleanly: at 4-bit on Llama 3.1 8B, Q4_K_M and AWQ both delivered roughly 95% of the FP16 quality on downstream tasks, GPTQ landed around 90%, and EXL2 sat in between depending on bit-rate target. Their broader point: quality differences at 4-bit are real but small, and the gap between any 4-bit format and any 8-bit format is much larger than the gap between two 4-bit formats.

One non-obvious caveat: aggressive quantization hurts non-English languages disproportionately. ai.rs flagged this in February – Q4_K_M holds quality at 90-95% on non-English while NVFP4 can drop to 80-92% on hard reasoning in low-resource languages. If you’re working in Polish, Spanish, Arabic, anything outside the heavy training-data languages, bias toward Q5_K_M or Q6_K rather than chasing the absolute speed crown.

Practical translation: for English chat, code, and standard agentic work on a 7B+ model, pick whichever 4-bit format is convenient for your stack. For multilingual work or for sub-7B models where every bit of capacity matters, step up to 5-bit or 6-bit and accept the speed and VRAM cost.

VRAM math, worked example

One thing the brief flagged that’s worth doing properly: the math for whether a model fits is more involved than (parameters × bits) ÷ 8. Here’s the full breakdown using Qwen3-8B at EXL2 4.0 bpw as the example.

  • Model weights: 8.19B params × 4.0 bits ÷ 8 = 4.10 GB raw. Add ~10% for embedding tables and overhead → ~4.5 GB on disk.
  • Loaded into VRAM: roughly the same as on disk, ~4.5-4.6 GB. Some runtimes pad slightly.
  • KV cache (the part most people miss): for Qwen3-8B with GQA, the formula is roughly 2 × num_layers × num_kv_heads × head_dim × context_length × cache_dtype_bytes. For Qwen3-8B specifically: ~2 × 32 × 8 × 128 × ctx_len × 2 bytes (FP16) = roughly 130 KB per token. At 8K context, that’s about 1.05 GB. At 32K, around 4.2 GB. At 128K, around 16.7 GB.
  • Activation memory and runtime overhead: ~500 MB to 1.5 GB depending on framework. vLLM reserves more upfront for paged attention.

Add it up: at 8K context, Qwen3-8B EXL2 4.0 bpw uses about 5.2 GB. At 32K, roughly 9 GB. At 128K, around 22 GB.

The lesson is that the model weights are often the smaller part of your VRAM budget once you push context. A 13B model that “fits” in 16 GB at 4K context might OOM at 32K. KV cache quantization (8-bit or 4-bit KV) gets you most of the way out of this – we covered that briefly in our earlier 5090 hardware piece – but it’s worth understanding why your 16 GB card isn’t a 16 GB card once you start agentic workflows with long histories.

NVFP4 – Blackwell’s native 4-bit, briefly

One thing that didn’t exist when most “GGUF vs EXL2 vs AWQ” articles were written: native FP4 on Blackwell. The RTX 5090’s 5th-gen Tensor Cores execute FP4 matrix multiplication directly, which is fundamentally different from the integer dequant-then-multiply path the formats above use.

NVIDIA’s TensorRT-LLM has an NVFP4 path. We tested Qwen3-8B in NVFP4 and saw roughly 245 t/s – about 20% above EXL2 4.0 bpw, with quality measurably worse on non-English perplexity but indistinguishable on English chat. If you’re building a production stack around a Blackwell card and you have the patience to deal with TensorRT-LLM’s compilation pipeline, NVFP4 is genuinely the fastest path available right now for English-only workloads.

If you’re not – and most local users aren’t – NVFP4 isn’t yet at the level of “drop a file in LM Studio and go.” We don’t expect that to be true in 2026 at all. The formats above are still the practical answer.

So which one should you actually pick?

The answer depends less on the format and more on what kind of user you are. Three rough personas, with our honest pick for each.

If you just want it to work: use GGUF Q4_K_M. Every app supports it – Ollama, LM Studio, Jan, AnythingLLM, Open WebUI, Kobold. Every model release has a GGUF within hours. Every weird piece of hardware (AMD ROCm, Apple Silicon, Intel Arc, even older cards) runs it. You give up roughly 20-25% in tokens-per-second versus EXL2 and a chunk of TTFT, but you never have to debug a runtime mismatch or wait for a quant author to release the format you need. For 90% of people running local models on a single GPU, this is the right answer, and it’s not close.

If you want maximum single-user speed on NVIDIA and you’re willing to fight for it: use EXL2, run it through TabbyAPI or text-generation-webui, and accept that you’ll occasionally wait extra days for new model support. Pick a fractional bit width that matches your VRAM – 4.65 bpw is the sweet spot on most cards, 5.0 bpw if you have headroom and care about quality. EXL2 still has the speed crown for single-stream inference. Just don’t pretend the toolchain is in its prime.

If you’re running a homelab API server or building anything where multiple clients hit the model: use vLLM, and pair it with either AWQ or GPTQ-Marlin. AWQ if you want slightly better quality preservation; GPTQ-Marlin if you want slightly more speed. The single-user numbers we measured don’t capture vLLM’s real strength – under load, with continuous batching, AWQ in vLLM scales to dozens of concurrent users where EXL2 simply doesn’t. This is the production answer, and it’s been the production answer for a while.

One non-persona note worth ending on: the choice is reversible. All four formats start from the same FP16 weights. If you change your setup in six months, you can re-download the format that fits the new context. Don’t agonize over the decision. Pick one, run it for a month, swap if it bothers you. The difference between the right format and the wrong format is at most 30%. The difference between using a model and not using one is much larger.

By [Author Name] – quantized.fyi editorial · Tested May 2026

Tobiasz Gromysz

Enthusiast of large language models (LLMs) and AI technologies who has been actively following the industry’s development since 2022. He specializes in practical applications of artificial intelligence and in analyzing computer hardware performance for running AI models locally. On a daily basis, he tests GPU configurations and benchmarks, helping readers understand how to build efficient and cost-effective setups for working with LLMs at home. His interests include optimization, quantization, and real-world AI applications beyond theory-from experimentation to production-ready deployments. More »

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button