TurboVec & Google TurboQuant: 31 GB → 4 GB Vector Search Without Training | 2026
TurboVec Google TurboQuant Vector Search Compression Guide
Google Research · ICLR 2026 · Open Source

TurboVec: How Google’s TurboQuant Compresses
31 GB of Vectors Down to 4 GB

📅 June 9, 2026 ⏱ 14 min read 🦀 Rust + Python 3,500+ GitHub Stars

The 31 GB Problem Every RAG Developer Knows

You’ve built a RAG pipeline. You embedded 10 million documents using text-embedding-3-small at 1,536 dimensions, stored them as float32. You open the deployment plan and see it: 31 GB of vectors. That’s before a single byte of metadata. Suddenly your self-hosted, privacy-respecting, on-prem AI system is now a managed-cloud-instance problem.

This is not a niche edge case — it is the default reality of production semantic search in 2026. TurboVec is Google’s answer to this problem. Built on TurboQuant, a vector quantization algorithm from Google Research presented at ICLR 2026, it shrinks those 31 GB to approximately 4 GB using just 2 bits per dimension. No training. No codebook. No tradeoff in retrieval quality. And on ARM hardware, it searches faster than FAISS.

31→4
GB reduction (10M vecs)
16×
Compression ratio
0
Codebook training needed
3.5k+
GitHub stars

What Is TurboVec? (And What Is TurboQuant?)

Definition — TurboVec: An open-source vector index written in Rust with Python bindings, created by Ryan Codrai. It implements Google Research’s TurboQuant algorithm (ICLR 2026) — a data-oblivious, training-free vector quantizer that achieves near-optimal distortion rate across all bit-widths and dimensions.

TurboVec is not a research toy. It is a production-grade library that you can pip install and drop into any Python RAG pipeline today. The algorithm underneath it — TurboQuant — was published by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni, all affiliated with Google Research and Google DeepMind, and presented at ICLR 2026.

The key distinction from traditional Product Quantization (PQ) used by FAISS: PQ requires training a codebook on your data first. You must run a k-means algorithm over a representative sample of your vectors before you can index anything. TurboQuant eliminates this requirement entirely by exploiting mathematical properties of high-dimensional geometry.

  • Language: Core in Rust (github.com/RyanCodrai/turbovec), Python via PyO3 bindings
  • License: MIT — production-safe, no restrictions
  • Algorithm basis: TurboQuant (Google Research + Google DeepMind + NYU)
  • Peer review: Accepted at ICLR 2026, proving near-optimal distortion rate within ≈2.7× of the Shannon limit
  • GitHub stars: 3,500+ (946 at earlier reports, growing rapidly)
  • Supported platforms: ARM (fastest) and x86 with AVX-512 SIMD kernels
Short Extractable Answer: TurboVec is an open-source Rust vector index implementing Google’s TurboQuant algorithm (ICLR 2026). It compresses 10 million float32 vectors from 31 GB to 4 GB using 2 bits per dimension with zero training required. It outperforms FAISS on ARM hardware by 12–20% and operates within 2.7× of the theoretical Shannon limit for compression quality.

How TurboQuant Works: The Full Algorithm Pipeline

Definition — Data-Oblivious Quantization: A quantization approach that derives compression boundaries from mathematical theory rather than from statistical properties of the actual dataset. Unlike PQ, it requires zero training time and zero representative samples — making it suitable for streaming or online indexing scenarios where data is not known in advance.

TurboQuant’s elegance comes from a deep insight about high-dimensional geometry. When you randomly rotate a high-dimensional vector, its coordinates stop being arbitrary — they start following a predictable near-Gaussian distribution. This is a consequence of the Johnson-Lindenstrauss lemma and concentration-of-measure phenomena. Once the distribution is known, you can derive optimal quantization boundaries analytically without seeing the data.

Here is the full pipeline, step by step:

TurboQuant Algorithm Pipeline — Interactive
📐
Raw Embedding
float32 vector, D dims
⚖️
Extract Magnitude
store ‖v‖ separately
🔄
Random Rotation
orthogonal transform
📊
Gaussian Coords
predictable distribution
🗜️
Quantize → 2 bits
analytical boundaries
SIMD Lookup
nibble-split tables
  1. Separate magnitude from direction. The original vector’s length (magnitude ‖v‖) is extracted and stored as a single float32. The vector is then normalized to unit length — now it only encodes direction, lying on a hypersphere.
  2. Apply a random orthogonal rotation. A random rotation matrix (fixed per index, derived from a seed) is multiplied with the unit vector. This is a structure-preserving transformation — inner products and distances are exactly preserved.
  3. Exploit the Gaussian miracle. After rotation, the coordinates of the unit vector follow a distribution that closely approximates a Gaussian with known parameters. This is not an approximation — it follows rigorously from high-dimensional concentration of measure.
  4. Quantize with analytical boundaries. Because the distribution is known, TurboQuant computes optimal quantization bin boundaries mathematically. Each coordinate is mapped to 2 bits (values 0–3) with no training needed. The boundaries that minimize distortion for a Gaussian distribution are precomputed once, universally.
  5. Store a per-vector correction factor. Quantization underestimates inner products (a known bias). TurboVec stores one small correction scalar per vector to counteract this, recovering retrieval quality without significant memory overhead.
  6. Score with nibble-split SIMD lookup tables. At query time, the 2-bit code for each dimension is split into two halves and scored against a precomputed 4-entry lookup table — mapping perfectly to NEON (ARM) and AVX-512BW (x86) SIMD instructions.
Hypersphere Rotation Visualization — Coordinate Distribution Before & After
Live animation · Coords → Gaussian after rotation
Before Rotation
Coordinates are arbitrary — no predictable statistical pattern. Quantization requires dataset-specific training.
After Rotation ✓
Coordinates follow ≈ Gaussian(0, 1/D). Optimal quantization boundaries are now mathematically derivable.

Actionable takeaway: The training-free property is not a minor convenience — it means you can index new vectors in real time without ever needing a warm-up phase. This is critical for streaming RAG systems where documents arrive continuously.

Short Extractable Answer: TurboQuant works by (1) separating vector magnitude from direction, (2) applying a random orthogonal rotation so coordinates follow a near-Gaussian distribution, (3) using mathematically derived quantization boundaries to compress each coordinate to 2 bits, and (4) storing a correction factor per vector to recover inner product accuracy. No training data is ever required.

Interactive: Visualizing the Compression Impact

Memory Footprint — 10 Million Vectors (1,536 dimensions)
float32 (standard)31 GB
float32 · 32 bits/dim
TurboVec (2 bits/dim)~4 GB
TurboVec
8× smaller · same retrieval quality
32 bits
float32 / dim
2 bits
TurboVec / dim
Bit Allocation — 1 Vector Dimension, Visualized
float32 representation (32 bits per dimension)
TurboVec quantized (2 bits per dimension)
16× fewer bits. Shannon limit compliance proven at ICLR 2026: within ≈2.7× of theoretical optimum.

TurboVec vs FAISS vs HNSW: Benchmark Results

Definition — Approximate Nearest Neighbor (ANN) Search: The core operation of every vector database — given a query vector, find the k vectors in the index most similar to it. “Approximate” means you accept a small recall penalty in exchange for orders-of-magnitude faster search than exhaustive linear scan.

FAISS (Facebook AI Similarity Search) is the industry standard for vector indexing. HNSW (Hierarchical Navigable Small World graphs) is the dominant algorithm in most production vector databases (Pinecone, Weaviate, Qdrant). Here is how TurboVec compares on the dimensions that matter most for production deployment:

Search Throughput on ARM — Queries/Second (Higher = Better)
TurboVec
88k q/s
+20%
FAISS PQ
73k q/s
baseline
HNSW
44k q/s
–40%
Source: turbovec benchmarks (ARM Cortex-A76). FAISS = IVF-PQ, HNSW = hnswlib default params. Results vary by hardware and dataset.
Dimension TurboVec FAISS (IVF-PQ) HNSW
Compression (10M vecs) ~4 GB 8× smaller ~8 GB ~65 GB (no compression)
Training required None Zero Yes — k-means on sample None
ARM performance Fastest (NEON vtbl) Moderate Moderate
x86 performance Fast (AVX-512BW) Fast (AVX-512) Moderate
Streaming / online indexing ✓ No warm-up needed ✗ Requires retraining
Quality guarantee Within 2.7× Shannon limit (proven) Heuristic (no theory) High recall, no compression
Language Rust + Python C++ + Python C++ + Python
License MIT MIT Apache 2.0
Direct Answer: TurboVec outperforms FAISS by 12–20% on ARM hardware and compresses vectors 2× smaller than FAISS IVF-PQ — while requiring zero training. Its advantage is most pronounced in streaming scenarios (no codebook retraining) and memory-constrained environments (edge, mobile, self-hosted). HNSW provides higher recall but uses uncompressed vectors, making it unsuitable for large corpora on limited hardware.

The Correction Factor: Why Quality Doesn’t Drop

Definition — Quantization Bias: When vectors are compressed from float32 to low-bit representations, inner product scores are systematically underestimated. This would lower recall in ANN search — unless corrected. TurboVec stores one correction scalar per vector to eliminate this bias.

Quantization is not free. Compressing 32 bits to 2 bits inevitably introduces some loss of information. The specific failure mode is that inner product scores between query and compressed database vectors are systematically lower than the true scores — meaning top-k results are correct in rank order but the margins are wrong, which can cause recall to degrade if not addressed.

Compressed inner product
0.71
systematically underestimated
+ δ →
Corrected score
0.84
matches true float32 score
  • TurboVec stores one float16 correction scalar (δ) per indexed vector — roughly 2 bytes per vector
  • At search time, the approximate inner product is adjusted: score = raw_inner_product + δ
  • This correction factor is computed once at index time and costs negligible memory (~20 MB for 10M vectors)
  • The result: recall closely matching float32 exhaustive search despite 16× compression
  • Per the ICLR 2026 paper, TurboQuant achieves near-optimal distortion rate — within a provable constant factor of the Shannon rate-distortion limit

Actionable takeaway: The correction factor is what separates TurboVec from naive 2-bit quantization schemes. Without it, you would lose 10–20% recall. With it, recall is competitive with full-precision FAISS at a fraction of the memory.

Using TurboVec in a RAG Pipeline: Real-World Flow

Definition — RAG (Retrieval-Augmented Generation): An AI architecture where a language model’s responses are grounded by retrieving relevant documents from an external vector index at inference time. Vector search quality and speed directly determine answer quality — making TurboVec’s compression-without-quality-loss property critical.

Every production RAG system eventually faces the same scaling inflection point: the vector index becomes the bottleneck, not the LLM. TurboVec shifts that inflection point dramatically — a system that previously required a 64 GB RAM server can now run on a standard 16 GB developer machine.

RAG + TurboVec Integration Flow
📄
Documents
chunk + clean
🧠
Embed
OpenAI / local model
🗜️
TurboVec Index
31 GB → 4 GB
User Query
embed in real-time
ANN Search
top-k results
💬
LLM Answer
grounded response
  • Self-hosted deployment: 10M document corpora now fit on a $50/month VPS — previously required $300+/month memory-optimized instances
  • On-device AI: Laptops, edge servers, and even high-end mobile devices can host meaningful vector indices
  • Streaming ingestion: New documents can be indexed without retraining — critical for real-time knowledge bases
  • Multi-tenant systems: More indices can coexist in the same memory budget — one machine can host separate vector spaces per tenant

For further reading on building complete RAG systems, see our RAG Pipeline Guide and Vector Database Comparison 2026.

ScenarioBefore TurboVecAfter TurboVec
10M document RAG index 64 GB RAM server required 16 GB RAM — standard dev machine
Streaming new documents Pause indexing, retrain PQ codebook Index instantly, no warm-up
ARM server deployment FAISS available but slow TurboVec 12–20% faster than FAISS
Multi-tenant index hosting 1–2 indices per server 8× more indices per server
Infrastructure cost High memory = high cost 8× memory reduction = 8× savings

Step-by-Step: Implementing TurboVec in Python

Definition — nibble-split lookup table: TurboVec’s SIMD scoring kernel splits each 4-bit code into two 2-bit halves (“nibbles”), each scored against a precomputed 4-entry table. This maps to a single vtbl instruction on ARM NEON — doing 8 parallel lookups per cycle, explaining the ARM performance advantage over FAISS.

Step 1: Install TurboVec

pip install turbovec
# Rust must be installed for source builds (maturin build system)
# Or use the pre-built wheel for your platform

Step 2: Build an Index and Search

import numpy as np
from turbovec import Index

# Simulate 1M embeddings (1536-dim, like text-embedding-3-small)
DIM = 1536
N   = 1_000_000
rng = np.random.default_rng(42)

# float32 vectors — this is what you get from an embedding model
vectors = rng.standard_normal((N, DIM)).astype(np.float32)

# ─── Build TurboVec Index ───
# Internally: magnitude separation → random rotation → 2-bit quantization
# No training data, no codebook — just build directly
index = Index(DIM)
index.add(vectors)   # ~4 GB compressed vs ~23 GB float32

print(f"Index holds {index.size:,} vectors")
print(f"Approximate memory: {index.memory_bytes / 1e9:.2f} GB")

# ─── Query ───
query = rng.standard_normal((1, DIM)).astype(np.float32)

# top-10 approximate nearest neighbors
distances, ids = index.search(query, k=10)

print(f"Top-10 neighbor IDs:  {ids[0]}")
print(f"Similarity scores:    {distances[0].round(4)}")

Step 3: RAG Pipeline Integration

from openai import OpenAI
from turbovec import Index
import numpy as np

client = OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    """Embed texts using OpenAI, return float32 array."""
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return np.array([e.embedding for e in resp.data], dtype=np.float32)

# ── Build index from your corpus ──
corpus = [
    "TurboQuant is a data-oblivious vector quantizer from Google Research.",
    "It requires zero codebook training and achieves 16x compression.",
    "TurboVec is the open-source Rust implementation of TurboQuant.",
    # ... add your millions of documents here
]

corpus_vecs = embed(corpus)
index = Index(corpus_vecs.shape[1])  # dim inferred from first batch
index.add(corpus_vecs)

# Chunk metadata stored alongside (IDs map to text)
id_to_text = {i: t for i, t in enumerate(corpus)}

# ── At query time ──
def rag_retrieve(question: str, k: int = 5) -> list[str]:
    q_vec = embed([question])
    _, ids = index.search(q_vec, k=k)
    return [id_to_text[i] for i in ids[0]]

# ── Use in LLM call ──
question = "How does TurboVec compress vectors without training?"
context  = "\n\n".join(rag_retrieve(question))

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role":"system","content":"Answer based on the context provided."},
        {"role":"user","content":f"Context:\n{context}\n\nQuestion: {question}"}
    ]
)
print(response.choices[0].message.content)

Step 4: Streaming / Online Indexing

# TurboVec's key advantage: add vectors one-by-one in real time
# No retraining, no batch requirement

def stream_index_documents(index: Index, doc_stream):
    """Add documents to TurboVec index as they arrive."""
    for doc_id, text in enumerate(doc_stream):
        vec = embed([text])           # embed on the fly
        index.add(vec)                # add immediately — no warm-up
        if doc_id % 10_000 == 0:
            print(f"  Indexed {doc_id:,} docs | ~{index.memory_bytes/1e9:.1f} GB")

# Usage
dim    = 1536
index  = Index(dim)
stream = iter(["new document 1", "new document 2", ...])  # live feed
stream_index_documents(index, stream)

Best Practices Checklist for TurboVec in Production

  • Always normalize your embeddings to unit length before adding to TurboVec — the algorithm separates magnitude internally, but pre-normalizing ensures consistent magnitude correction factors
  • Use ARM instances (AWS Graviton 3, Azure Ampere) for maximum throughput — TurboVec’s NEON vtbl kernel delivers 12–20% more queries/second than x86 FAISS in benchmarks
  • Set k slightly higher than you need (e.g., k=20 then rerank to top-5) — the correction factor recovers most recall, but oversampling provides an additional safety net for precision-critical applications
  • For float16 embedding models, cast to float32 before indexing: vecs.astype(np.float32) — TurboVec’s Rust core expects float32 input
  • Persist your index to disk between runs — TurboVec does not keep the original float32 vectors, so the compressed index is your only representation after indexing
  • Monitor the correction factor distribution during indexing — extreme values (|δ| > 3 std) can indicate poorly normalized input vectors
  • Combine with re-ranking: use TurboVec for fast ANN to get top-100 candidates, then re-rank with exact float32 dot products on just those 100 vectors
  • Read the full TurboQuant paper at arXiv:2504.19874 for the theoretical guarantees before choosing bit-width
  • Star and watch the GitHub repository — it is under active development and new SIMD kernels ship frequently

Related reading: FAISS vs HNSWLIB Deep Dive · Vector Quantization Guide · RAG Pipeline Implementation


Frequently Asked Questions about TurboVec & TurboQuant

What is TurboVec and who made it?

FACT: TurboVec is an open-source vector index written in Rust with Python bindings, created by Ryan Codrai, implementing Google Research’s TurboQuant algorithm accepted at ICLR 2026.

The underlying algorithm was invented by Amir Zandieh and Vahab Mirrokni (Google Research), Majid Daliri (NYU), and Majid Hadian (Google DeepMind). Ryan Codrai built TurboVec as an open-source library to make the algorithm accessible to developers. The repository lives at github.com/RyanCodrai/turbovec, is MIT-licensed, and has accumulated over 3,500 GitHub stars since its release in early 2026.

Why doesn’t TurboVec need training data?

FACT: Traditional Product Quantization learns quantization codebooks from a sample of the dataset via k-means. TurboQuant instead applies a random orthogonal rotation that forces coordinates to follow a predictable near-Gaussian distribution — allowing optimal quantization boundaries to be derived mathematically, without ever seeing the data.

This property — called “data-oblivious” quantization — emerges from a deep result in high-dimensional geometry: after a random rotation, the coordinates of a unit vector concentrate around a Gaussian regardless of the original data distribution. TurboQuant exploits this mathematical certainty to replace the learned codebook with analytically computed boundaries that provably minimize distortion for any input.

Is TurboVec always faster than FAISS?

FACT: TurboVec benchmarks 12–20% faster than FAISS IVF-PQ on ARM hardware, but this is not universal — on x86 with AVX-512, the gap narrows, and FAISS flat index (no compression) may score individual batches faster on high-memory x86 machines.

TurboVec’s advantage is most pronounced on ARM (AWS Graviton, Apple Silicon servers, mobile) where its NEON vtbl kernel maps perfectly to hardware. On x86, both TurboVec and FAISS use AVX-512 and performance is more comparable. The primary advantage of TurboVec over FAISS is not raw speed but the combination of higher compression (8× vs 4× for IVF-PQ), zero training, and competitive throughput — particularly for memory-constrained scenarios.

How much recall does TurboVec lose vs full float32 search?

FACT: With the per-vector correction factor enabled (the default), TurboVec’s recall@10 is competitive with FAISS IVF-PQ at the same compression level — and TurboQuant’s distortion rate is proven to be within ≈2.7× of the Shannon theoretical optimum at 2 bits per dimension.

In practice, at 2 bits per dimension, you will see a small recall gap compared to uncompressed float32 exhaustive search — typically 2–5% at recall@10 for well-normalized embeddings. This can be recovered almost entirely by oversampling (retrieve top-50, then rerank to top-10 using exact dot products on just those 50 candidates). The ICLR 2026 paper provides theoretical guarantees that TurboQuant is near-optimal — meaning no other 2-bit scheme can do substantially better.

Can TurboVec be used with any embedding model?

FACT: TurboVec is model-agnostic — it accepts any float32 dense vector, regardless of which embedding model produced it, as long as the dimension is consistent across the index.

It works with OpenAI text-embedding-3-small (1,536 dims), text-embedding-3-large (3,072 dims), Cohere Embed v3, sentence-transformers, custom fine-tuned models, or any other dense float32 embedding. The only requirement is that all vectors in the index share the same dimension, and that they are cast to float32 before being passed to TurboVec’s Python API. The Rust core handles the rest.

Where can I read the full TurboQuant research paper?

FACT: The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” is available at arXiv:2504.19874, published by Google Research and accepted at ICLR 2026.

The primary arXiv link is arxiv.org/abs/2504.19874. Google Research published an accessible blog post at research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/. The companion algorithms QJL (Quantized Johnson-Lindenstrauss) and PolarQuant are at arXiv:2406.03482 and arXiv:2502.02617 respectively, and TurboQuant builds on both of them to achieve its results.

Conclusion: TurboVec Changes What’s Possible for On-Prem AI

The 31 GB problem is not an edge case — it is the default for any organization building production RAG pipelines at scale. For years, the answer was “pay for more RAM” or “use a managed vector database service.” TurboVec changes that equation fundamentally.

By implementing Google’s TurboQuant algorithm — a theoretically grounded, training-free quantizer proven to operate within 2.7× of the Shannon compression limit — TurboVec delivers an 8× memory reduction that enables use cases that were previously financially or practically impossible: self-hosted vector search at scale, real-time streaming ingestion without codebook retraining, and multi-tenant vector databases on commodity hardware.

The future of this technology points directly toward on-device AI. As LLMs continue shrinking (Phi-4, Gemma 3, Llama 3.x) and embedding models become more efficient, the vector index is often the last bottleneck to fully local, privacy-preserving AI. TurboVec, and the TurboQuant research behind it, are the piece that makes the local RAG pipeline a reality. If you are building any AI system that depends on vector search in 2026, TurboVec should be on your benchmarking list today.

Related reading: Mamba 3 vs Transformer · RAG Pipeline Guide · Vector Database Comparison · LLM Inference Optimization

Build Smarter RAG — Without the Cloud Bill

Get our TurboVec integration template, benchmark scripts, and step-by-step guide to deploying 10M-vector RAG on a $20/month VPS.

🚀 Get the Free Template
logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox.

We don’t spam! Read our privacy policy for more info.

Scroll to Top
-->