Skip to content
Stark Insider
  • Culture
  • Filmmaking/Tech
  • Atelier Stark Films
Tech

Can You Fit a 70B Model on a Single RTX 5090? Google’s TurboQuant Says Yes

Inside Google Research's new quantization method and why the Ollama community should care

BY Clinton Stark — 03.25.2026

Diagram showing how Google's TurboQuant compresses high-dimensional AI vectors into a compact quantized grid, with four colored vector arrays (green, blue, red, pink) mapping to and from a central quantization matrix
TurboQuant compresses AI model vectors from 32 bits down to as few as 3 bits by mapping high-dimensional data onto an efficient quantized grid. (Image: Google Research)
KEY POINTS:
  • Google’s TurboQuant compresses LLM key-value cache by 6x with zero accuracy loss. No retraining required.
  • It’s a research paper (April 2025), not a product. No official code released. Independent developers are already implementing it from the math alone.
  • The real story: the AI efficiency race may matter more than the parameter race. Compression is what puts AI on edge devices, home labs, and small businesses.

The AI industry loves a big number. Trillion-parameter models. Million-token context windows. Massive GPU clusters that cost more than most houses. But some of the most important work happening right now has nothing to do with scale. It’s about compression. Basically: doing more with less. And a quiet Google Research paper, nearly a year old, is about to get its moment in the spotlight.

TurboQuant is a compression algorithm that reduces the memory footprint of large language models by up to 6x. Zero accuracy loss. No retraining required. The paper first appeared on arXiv in April 2025, but Google is featuring it on their Research blog along with some experiments and results this week ahead of its formal presentation at ICLR 2026 in late April.

So why should you care about a year-old research paper? Because it attacks a problem that everyone running AI bumps into eventually: the key-value cache bottleneck.

What’s the KV Cache Problem?

Here’s the short version. When you chat with an LLM, the model doesn’t just process your latest message. It keeps a running record of the entire conversation in something called the key-value (KV) cache. Think of it as the model’s short-term memory for your session. A sort of journal tracking all the things that matter in the conversation you’re having with your LLM of choice.

The problem: that memory grows with every turn (prompt/response). The longer the conversation, the more GPU memory it consumes. For long-context tasks like document analysis, code reviews, or multi-step research, the KV cache can balloon to the point where it pushes out the model itself.

If you’ve ever had a long conversation with ChatGPT, Claude, or Gemini and noticed things slowing down, or gotten a message that your context is too long, that’s the KV cache hitting its limit. We’ve all experienced compaction (compacting), which is another method to manage the cache by throwing away what you don’t need in a conversation. Cloud providers manage this behind the scenes with massive hardware. But the constraint is real, and it costs them (and eventually you) money. For anyone running models locally, on a single GPU in a lab or small office, there’s no hiding from it.

Cloud providers can throw hardware at this. Small labs and small businesses can’t.

Stark Insider logo

What Is TurboQuant? A Beginner's Guide

In plain English: TurboQuant is a way to shrink the data that AI models store during conversations. It compresses vectors (the numbers AI uses to understand language) from 32 bits down to as few as 3 bits per number, without losing accuracy.

Three techniques working together:

1. PolarQuant converts data into a more efficient coordinate system (think: “go 5 miles at 37 degrees” instead of “go 3 miles east, then 4 miles north”). This eliminates the overhead that traditional compression methods carry.

2. QJL (Quantized Johnson-Lindenstrauss) is a 1-bit error corrector. It catches the tiny mistakes left over from compression, using zero additional memory overhead.

3. TurboQuant combines both into a single pipeline. PolarQuant does the heavy lifting. QJL cleans up the residual error. The result: up to 6x memory reduction, zero accuracy loss, and faster inference.

No retraining required. TurboQuant works on existing models out of the box. No fine-tuning, no new training runs. Just apply it and go.

The Small Lab Reality Check

We run an AI research lab here at Stark Insider (StarkMind). Our main rig is a Threadripper-based RTX 5090 with 32 GB of VRAM. It handles 35-billion parameter models comfortably for daily work (Qwen 3.5 35B is my current local favorite). But when we fire up our 120B open-source model (GPT OSS 120B still holding strong) for overnight evaluation and research jobs, things get tight fast. And I should also mention being based here in Silicon Valley: things get very hot. I spent far too much time on cooling and air flow management.

Stark Insider logo

StarkBench 2026 Lab Configuration

Screenshot of the Vertigo AI Lab homepage dashboard showing the full StarkBench infrastructure stack including Ollama, LangGraph, Langfuse, Arize Phoenix, SearXNG, Qdrant, LiteLLM, and CrewAI running on an AMD Threadripper system with NVIDIA RTX 5090.
The StarkMind Vertigo AI Lab dashboard showing the infrastructure behind StarkBench. Every tool mentioned in the article runs on this single home lab server built on an AMD Threadripper 9970X with an RTX 5090.

Hardware (“Vertigo”):

  • NVIDIA RTX 5090 (32 GB VRAM)
  • AMD Threadripper 9970X
  • 252 GB RAM

Orchestration:

  • LangGraph (graph-based state machine with checkpointing)

Observability:

  • Langfuse (cost tracking, session grouping)
  • Arize Phoenix (span waterfalls, token breakdown)

Search:

  • SearXNG (self-hosted, 16 engines)
  • Brave Search API

Vector Search:

  • Qdrant

Inference:

  • Ollama (local)
  • Ollama Cloud
  • vLLM
  • OpenRouter (cloud frontier models)

Local LLM daily driver:

  • Qwen 3.5 35B (fits comfortably in 32 GB VRAM)

Spillover (offload) models:

  • gpt-oss:120b
  • Qwen 3.5 122B (CPU offload, ~18 min/run)

Evals:

  • memoryscope — multi-turn research synthesis across academic literature (hypothesis testing for our Symbiotic Studio research)
  • cinemascope — film recommendations from a Letterboxd 3,700+ film watch history

The KV cache is usually what kills us; resulting in dreaded OOM (out-of-memory) issues. We can load the model weights fine. But as conversation context grows during long, multi-turn benchmark runs, VRAM fills up. We run an internal eval harness called StarkBench that puts models through multi-turn research synthesis and film recommendation tasks. When we tested gpt-oss:120b and Qwen 3.5 122B this week, we had to dial the num_ctx parameter (the context window size in Ollama) down from 32K to 16K tokens just to avoid those pesky out-of-memory crashes. That’s a direct KV cache constraint I’ve painfully learned. With a 120-billion parameter model spilling over into system RAM, there’s simply not enough VRAM left for a full-size context window. Each run took about 18 minutes with CPU spillover. It works sure. But it’s duct tape, and not what I’d consider an ideal solution.

Some of the most important work happening right now has nothing to do with scale.

On paper, a 6x reduction in KV cache memory would change that equation significantly. Instead of capping num_ctx at 16K, we might fit the full 32K. Instead of one long evaluation running overnight, we could potentially run several. For a small lab like ours (and yours), that’s could make for a meaningful difference.

The catch? TurboQuant is a research paper, not a product as I’ve discovered. You can’t pip install it. It’s not in vLLM, llama.cpp, Ollama, or any of the serving frameworks that AI labs and developers actually use. The paper has been public for nearly a year, and none of those projects have merged it yet.

That’s worth noting I think. Plenty of impressive research never makes it into the tools people use every day. Then again, this is Google so I would expect this to eventually find its way into the mainstream. It’s possibly already being used for future versions of Gemini.

Peer Review and Early Adopters

That said, the research has real credibility behind it. TurboQuant was accepted at ICLR 2026 (April 23-25), one of the most selective machine learning conferences in the world. Its companion papers also passed peer review at top venues:  a quick Google search revealed that QJL was published at AAAI 2025, and PolarQuant was accepted at AISTATS 2026.

And within hours of Google’s blog post going live, independent developers started implementing TurboQuant from scratch. Not using Google’s code, because Google hasn’t released any that I could find. These are people reading the paper and writing their own implementations based on the math alone. Check Reddit and X for the early adopters who are sharing their findings and experiences.

One developer built a PyTorch implementation with a custom Triton kernel, tested it on a Gemma 3 4B model running on an RTX 4090, and got character-identical output to the uncompressed baseline at 2-bit precision (they’ve also made the code available for download if you’re adventurous). Another got it running on Apple Silicon via MLX with a 35B model, scoring 6 out of 6 on needle-in-a-haystack tests at every quantization level. Over in the llama.cpp community, at least three developers are working on C and CUDA implementations, with one reporting 18 out of 18 tests passing and compression ratios matching the paper’s claims.

That’s a good sign. The math is likely reproducible and the results hold up outside Google’s benchmarks.

A few caveats, though I discovered when reading more about the research and announcement. Google’s own experiments only tested on 8-billion parameter models (Gemma, Mistral, Llama 3.1). Whether TurboQuant scales cleanly to larger models is still undemonstrated — though it’s worth noting small models are in vogue these days due to their usability with less compute. The headline “8x speedup” refers specifically to attention computation, not end-to-end inference. And one early implementer found that the QJL error-correction component is tricky to get right. The naive approach produced garbage output. Getting the full pipeline working correctly requires careful adherence to the paper’s asymmetric estimator design.

What the Benchmarks Show

With those caveats noted, the results are hard to ignore. Google tested TurboQuant across standard long-context benchmarks:

  • 3-bit quantization of the KV cache with no training, no fine-tuning, and no measurable accuracy loss
  • Perfect scores on needle-in-a-haystack tests (finding a single fact buried in massive text) across all benchmarks
  • Up to 8x speedup in computing attention on H100 GPUs compared to unquantized 32-bit keys
  • Superior recall ratios in vector search compared to state-of-the-art methods, even those using larger codebooks and dataset-specific tuning

That last point matters for anyone building search. TurboQuant isn’t just about chat. It also speeds up vector search, the technology behind semantic search engines and RAG (retrieval-augmented generation) pipelines. Faster index building with near-zero preprocessing time. Basically, lower memory and better recall. For search infrastructure, that’s a meaningful combination. For StarkMind, I’ll be looking to pilot TurboQuant with some of our RAG pipelines that we experimented with last year.

Bigger Than Bigger

Here’s what I think is the real story, and it has nothing to do with TurboQuant specifically.

The AI conversation for the past few years has been dominated by scale. Bigger models. More parameters. Larger context windows. Every morning we wake up to a new Frontier LLM or local model that promises X% improvements across all the usual synthetic benchmarks. And yes, scale matters of course. But the most consequential breakthroughs might not come from building the next trillion-parameter behemoth. They might come from clever tricks like this. Compression. Quantization. Efficient math.

Because compression is what puts AI in places it can’t go today. Edge devices. Phones. Embedded systems. A medical clinic with a single workstation. A law firm that needs to keep client data on-premises. A startup that can’t afford a $50,000/month cloud GPU bill. Our small lab here at Stark Insider.

The parameter race may make headlines, but compression makes deployment possible to memory challenged edge devices, workstations and even smartphones (check out Enclave on your iPhone!).

The KV cache problem is getting attention from multiple angles. Some workarounds already exist today. Tools like llama.cpp support basic KV cache quantization (q4_0, q8_0). Modern model architectures like Grouped Query Attention reduce KV cache size by design. Sliding window attention caps the cache at a fixed number of recent tokens. Even the compaction you experience I mentioned earlier in ChatGPT or Claude is a form of KV cache management: the system summarizes or drops older context to stay within limits.

TurboQuant’s contribution is pushing quantization to extreme bit-widths (3-bit) without the accuracy loss that existing methods introduce. GGUF quantization has already made it possible to run models locally that would have required a data center two years ago. Techniques like speculative decoding, flash attention, and PagedAttention have all chipped away at the compute problem from different angles. The trend is clear. And it’s accelerating.

What to Watch For

TurboQuant will be formally presented at ICLR 2026 in late April (main conference runs April 23-25). Also, its companion paper, PolarQuant, will be presented at AISTATS 2026 around the same time.

If you’re a developer or running a lab, the thing to watch is whether any of the major open-source serving frameworks merge this. The techniques are described as “exceptionally efficient to implement” with “negligible runtime overhead,” and the early independent implementations suggest that’s true. Whether it becomes a checkbox in Ollama or a flag in llama.cpp is the real question.

For everyone else, the takeaway is simpler. The AI industry is learning that you don’t always need a bigger model, or boat. Sometimes you need a smarter one. Or at least, smarter plumbing.

FURTHER READING

Stark Insider - Arts, Film, Tech & Lifestyle
  • TurboQuant: Redefining AI Efficiency (Google Research Blog)
  • TurboQuant Paper on arXiv
  • TurboQuant: From Paper to Triton Kernel in One Session (dejan.ai)
  • IPE: How to Build Your Own AI Command Center

Frequently Asked Questions

What is Google TurboQuant?

TurboQuant is a compression algorithm developed by Google Research that reduces the memory footprint of large language models by up to 6x with zero accuracy loss. It compresses the key-value cache (the model's conversational memory) down to as few as 3 bits per number, enabling longer conversations and faster inference on less hardware. The paper was first published in April 2025 and will be formally presented at ICLR 2026.

Can I use TurboQuant today?

Not directly. Google has not released official code or a software library. However, independent developers have already built working implementations from the paper's math, including versions in PyTorch, MLX (Apple Silicon), and C/CUDA for llama.cpp. These are early community efforts, not production-ready tools, but they validate the core claims.

Does TurboQuant require retraining AI models?

No. TurboQuant works on existing models without any training or fine-tuning. It can be applied as a post-processing step to models like Gemma, Mistral, and other open-source LLMs.

What is the key-value (KV) cache in AI models?

The KV cache is a temporary memory store that LLMs use during conversations. It keeps track of all previous turns in a chat so the model can maintain context. When you have a long conversation with ChatGPT, Claude, or Gemini, the KV cache grows with each message. This is why very long conversations can slow down or hit context limits. The KV cache consumes GPU memory, and compressing it means longer conversations on less hardware.

How could TurboQuant help small businesses running local AI?

Small businesses running local LLMs for private data (legal firms, clinics, financial advisors) are limited by GPU memory. TurboQuant's 6x compression could enable longer, more complex conversations on modest hardware, and its vector search improvements could speed up internal knowledge bases and document retrieval. However, these benefits depend on the technology being adopted into mainstream tools first.

Tags:AI Research Artificial Intelligence (AI) Google

Clinton Stark

Filmmaker and editor at Stark Insider, covering arts, AI & tech, and indie film. Inspired by Bergman, slow cinema and Chipotle. Often found behind the camera or in the edit bay. Peloton: ClintTheMint.

Short Films
Loni Stark - A West Coast Adventure - A Lifetime in the Making - Stark Insider

Stark Insider
  • CULTURE
  • BEST OF AI
  • FILMMAKING/TECH
  • ATELIER STARK FILMS
  • HUMANxAI SYMBIOSIS
THE STARK COLLECTIVE
  • THE STARK CO
  • STARK INSIDER
  • STARKMIND
  • ATELIER STARK
© Copyright 2005-2026 BLG Media LLC. v2.19.0
  • Review Policy and Shipping
  • Privacy Policy
  • Contact
  • About