Group
← Back
Part 4 of 4 · Vector search

Vector search - Quantization

May 9, 2026MLAlgorithms

Part 3 ended with two ways to make HNSW fit at billion-vector scale: shrink the vectors so they take a fraction of their raw size, or push them to disk and read them on demand. The on-disk article worked through pushing them to disk. This one works through shrinking them.

The technique is called quantization: replace each high-precision number with a lower-precision approximation, store the approximation, and accept a small loss of fidelity in exchange for a large reduction in size and compute. The same idea recurs across machine learning, audio codecs, image compression, and large language model weights, with different details every time. This article works through the two flavours that matter for vector search: scalar quantization, which compresses each number independently, and product quantization, which groups numbers into short chunks and compresses the chunks against a small dictionary of learned prototypes.

Scalar quantization is the simpler of the two and a good warm-up. Product quantization (PQ) is where most of the depth lives, because it is what every modern billion-scale vector index uses to drive memory and bandwidth costs down by one or two orders of magnitude. We'll spend most of the article on it.

Scalar quantization

The simplest form of compression: take each high-precision number and replace it with a lower-precision approximation, one value at a time. Three formats appear over and over.

  • float16 halves the cost of a float32 value (4 bytes down to 2) while keeping a similar dynamic range. Modern hardware does float16 arithmetic at the same throughput as float32 in its SIMD units (Single Instruction Multiple Data, the CPU and GPU mode that processes several numbers per instruction in parallel), so both speed and accuracy survive the swap. This is what GPUs use by default for inference.
  • int8 stores each value as a single signed byte: one of the 256 integer values from to . The mapping from float to integer needs a scaling factor. Take the largest absolute value in some unit (one vector, one matrix row, one block of weights), call it , and store alongside the encoded data. Each float is then encoded as the integer , clamped to . Decoding multiplies back: . Worked example: a vector whose largest absolute value is produces , and the float encodes as . Decoding gives , off by about from the original. The compression ratio is over float32, and the recall loss in vector search (the fraction of true nearest neighbours that the compressed search misses) is typically around 1%.
  • int4 stores each value in 4 bits, half a byte. Used mostly for aggressive LLM weight compression. Compression doubles to over float32, at the cost of more careful calibration to keep the loss tolerable.

Scalar quantization is a single pass over the data and easy to reason about. It is usually the first technique to reach for. For vector search at billion scale, the savings are real but not enough on their own: a 768-dimensional float32 embedding is 3 KB, and shrinking it to about 750 bytes with int8 still leaves a billion vectors taking 750 GB. Squeezing the per-vector cost down to a few percent of its raw size needs the next technique.

Product quantization

Product quantization (PQ) is the technique that makes 32× to 96× compression workable for high-dimensional vectors. It was published by Hervé Jégou, Matthijs Douze, and Cordelia Schmid in 2010, originally for image retrieval, and it's still the workhorse of every billion-scale vector search system.

The construction

Start with a single 768-dimensional vector . Split it into 96 contiguous chunks of 8 dimensions each: dimensions 1 through 8 form the first chunk, dimensions 9 through 16 the second, and so on through 96. Each chunk is itself a vector, but in a much smaller space (8 dimensions instead of 768). It lives in its own 8-dimensional subspace.

Now do this same split for every vector in the dataset. The dataset has database vectors (a billion in our running example), and every one of them contributes exactly one chunk to each subspace: vector 's dimensions 1-8 join subspace 1, its dimensions 9-16 join subspace 2, and so on. Subspace 1 ends up holding chunks, all of them 8-dimensional, one chunk per database vector. The same is true for every other subspace.

PQ compresses each subspace independently with exactly the same procedure. We'll describe what happens inside one subspace; the full PQ runs the procedure 96 times in parallel, once per subspace, with no interaction between them.

Memory in one subspace. Each chunk is an 8-dimensional vector, which means 8 float32 numbers. At 4 bytes per float, one chunk takes bytes. With chunks in this subspace, the raw storage cost is bytes, the count of chunks times the size of one chunk. (Across all 96 subspaces this multiplies up to bytes, which is exactly the raw cost of the original 768-dimensional vectors.)

The goal is to shrink the per-subspace memory from bytes (raw chunks) to something much smaller.

The trick. Pick 256 "representative" 8-dimensional vectors that approximate the dataset's chunks. Then replace each of the chunks with whichever representative is closest to it in Euclidean distance, throwing away the chunk's exact coordinates. After the replacement, every chunk in the compressed dataset is one of those 256 representatives, no matter how diverse the original chunks were. This is lossy: a chunk that was at, say, position is now stored as "the representative at position ", with a small approximation error baked in. The compressed form of a chunk is just the integer index of its assigned representative.

How much memory this saves. Step through the arithmetic:

  • There are 256 representatives, so each one is identified by an integer between 0 and 255. That integer fits in 8 bits, or 1 byte, because .
  • Each chunk is now stored as that 1-byte index. With chunks in this subspace, the indices total bytes.
  • The 256 representatives themselves still have to be stored somewhere, otherwise the indices are meaningless. Each representative is an 8-dimensional float32 vector, taking bytes. Storing all 256 of them takes bytes, about 8 KB.
  • Total storage for this subspace is bytes (indices) plus 8 KB (the 256 representatives), versus the raw bytes.

For in the millions or billions, the 8 KB constant is negligible and the compression ratio approaches .

The 256 representatives are called centroids, the geometric word for "average point of a group" (think of a centroid as the cluster's balance point if its members were equal weights). They are produced by an algorithm called -means clustering (the in the name is the number of clusters; here ).

The objective. We need to pick the 256 representatives well. The natural goal is to minimise the total error introduced by replacing each chunk with its assigned representative. Quantify error as squared Euclidean distance and sum across all chunks:

where the sum is over all chunks , and each chunk is assigned to whichever of the 256 representatives is closest to it. Lower error means the representatives are doing a better job of standing in for the chunks.

Two optimality facts about this error. They are the engine of the algorithm.

The first fact is almost too obvious to state. If the 256 representative positions are fixed, the best assignment is "each chunk to its closest representative." Suppose chunk is at distance 1 from representative and distance 5 from representative . Assigning to contributes to the total error; assigning it to contributes . Picking the closer one always gives a smaller error contribution, no matter which other chunks exist. So whenever we know the representative positions, the assignment is forced: it has to be "every chunk to its nearest representative."

The second fact takes a moment to absorb but is just as concrete. Given a cluster of chunks (any group of chunks all sharing a representative), the single point that minimises the sum of squared distances to all of them is their arithmetic mean. Take a 1-dimensional toy: three numbers , whose mean is . From the mean, the sum of squared distances is . Try other candidates: from it would be ; from it would be ; from , . Anywhere off the mean is worse. The same holds in 8 dimensions, and in fact in any number of dimensions, coordinate by coordinate:

is the unique point minimising the cluster's total squared error. So whenever we know which chunks are assigned to a cluster, the cluster's best representative position is forced too: it must be the mean of the cluster's chunks.

Why we can't solve this in one shot. The assignment step is one-shot in a precise sense: given the 256 representative positions, a single pass over the data places every chunk in its cluster. The trouble is, we don't know the right positions to begin with. The positions are exactly what we are trying to solve for, jointly with the assignment.

Could we enumerate every possible way of grouping the chunks into 256 clusters, and pick whichever grouping leads to the lowest error (with each cluster's optimal representative position computed as the cluster's mean)? In principle yes, in practice no. Each of the chunks can land in any of 256 clusters independently, so the total number of possible assignments is .

(chunks)Total possible assignments
10
50
1,000,000,000a number with several billion digits

By comparison, the number of atoms in the observable universe is roughly . So even the demo's tiny dataset of 50 points has more possible assignments than every computer cluster on the planet could enumerate before the heat death of the universe. The full billion-vector case is in a different category of impossible.

Putting the two facts together: -means. Each fact lets us improve one half of the answer once the other half is held fixed:

  • If we know the current representative positions, fact 1 hands us the best assignment (each chunk to its nearest representative). The total error goes down or stays the same.
  • If we know the current assignment, fact 2 hands us the best representative positions (each one the mean of its assigned chunks). The total error goes down or stays the same again.

The -means algorithm is what you get when you alternate these two updates:

  1. Initial positions. Start with 256 representative positions somewhere. The simplest choice is 256 chunks picked at random from the dataset.
  2. Assignment step. For every chunk, compute its distance to each of the 256 current representatives, and assign the chunk to the closest one. The chunks are now sorted into 256 groups.
  3. Update step. For each group, compute the average of its chunks (coordinate by coordinate). Move that group's representative to the average.
  4. Repeat steps 2 and 3 until no chunk changes its assigned representative between one iteration and the next.

Each step either strictly reduces the total error or leaves it unchanged, and error can't go below zero, so the algorithm has to stop after a finite number of iterations. On real data, it stops after tens of iterations.

What -means actually costs. Each iteration is dominated by the assignment step: every chunk computes its distance to each of the 256 representatives. A distance in 8 dimensions takes about 16 floating-point operations (eight subtractions, eight squares, seven additions, plus a square root), so an assignment step costs about operations. The update step is a single pass over the chunks accumulating cluster means, roughly more operations. With around 20 iterations to convergence, the total cost is on the order of operations.

Compare with the cost of enumeration:

(chunks)Enumeration ( assignments)-means ( ops)
10
50
1,000,000,000unimaginably large

The shape of the difference is what matters: enumeration is exponential in (each extra chunk multiplies the cost by 256), while -means is linear in (each extra chunk adds a small constant). Even at a billion chunks, -means' operations run in tens of seconds on a modern multi-core machine with SIMD distance kernels. The cost is approximate, not optimal: -means converges to a local minimum of the error, not necessarily the global one. In exchange, the algorithm is fast enough to actually run.

The codebook. When the algorithm stops, every one of the 256 representatives is sitting at the average of its assigned chunks (because the last step we ran was an update). The collection of those 256 final centroids is called the codebook for this subspace, by analogy with the codebook of a cipher: a small fixed list of entries that the rest of the data refers back to by index.

Encoding a chunk. Once the codebook is built, compressing any 8-dimensional chunk (a database chunk, or a chunk from a vector being inserted later) is a single step: compute the Euclidean distance from the chunk to each of the 256 centroids, find the nearest one, and record that centroid's index (a number from 0 to 255, fitting in one byte). The compressed chunk is just that one byte.

Encoding a full 768-dimensional vector. Split the vector into 96 chunks of 8 dimensions each, and for each chunk find the nearest centroid in that subspace's codebook (256 distance computations per chunk). Record the 96 centroid indices. The compressed vector is 96 bytes, one byte per subspace.

For , , :

  • Original: bytes per vector.
  • Compressed: bytes per vector. A 32× ratio.
  • Codebooks (one per subspace): bytes ≈ 768 KB total, fixed regardless of dataset size.

Why "product"?

A compressed vector is described by 96 independent choices: for each subspace, which of the 256 centroids did this vector's chunk get assigned to? The total number of distinguishable compressed vectors is the product of the choice counts:

That is the vocabulary of compressed forms PQ can produce, and it is enormous: more entries than there are atoms in the observable universe (). The actual storage required is only the 96 codebooks themselves, centroids, about 768 KB. So PQ's encoding can effectively distinguish different vectors using stored centroids. The huge vocabulary is implicit, never materialised; each compressed vector picks one element from it by selecting 96 centroid IDs.

The mathematical name for this kind of construction is the Cartesian product. Given several sets, the Cartesian product is the collection of every possible tuple you can form by picking one item from each set. For two sets and , the Cartesian product contains six pairs: . The size of is the size of times the size of : .

PQ does the same construction with 96 sets instead of 2. The 96 input sets are the per-subspace codebooks, each holding 256 centroids; every compressed PQ vector is one tuple from the Cartesian product of those 96 codebooks (a sequence of 96 centroid choices, one per subspace). The total tuple count is (96 times) , the same multiplication we already saw above. That product is where "product quantization" gets its name.

The structural payoff is exponential expressiveness from a linear amount of stored data; you couldn't get the same expressiveness with a single flat 768-dimensional codebook because storing centroids is impossible.

The catch. The vocabulary entries are not freely placed in 768-dimensional space. Each one is implicitly the concatenation of one centroid per subspace, and the subspace centroids were trained independently. If subspaces are correlated in the original data (knowing one subspace's values tells you a lot about another's), many compound encodings land in regions of 768-dimensional space where no real data lives, and the effective vocabulary is smaller than in practice. PQ works well when subspaces are roughly independent of each other. A common refinement called OPQ (Optimized Product Quantization) learns a rotation of the original space and applies it before splitting into chunks, so that the post-rotation subspaces are as independent as possible. OPQ is a drop-in replacement for the splitting step; everything downstream (codebook training, encoding, querying) is identical.

Asymmetric distance computation (ADC)

At search time, every database vector is no longer a list of 768 floats; it is a list of 96 centroid IDs, one per subspace. The query , on the other hand, is the original 768 floats. Computing the distance between a full-precision and a 96-byte ID list looks impossible at first: there are no float coordinates to subtract from on the database side, only ID numbers.

The technique that solves this is called asymmetric distance computation, abbreviated ADC. The "asymmetric" word will become clear at the end of this section. The core idea is to do the precision-heavy work once per query and reuse it across every database vector the search visits.

There are only two precision-rich kinds of objects in the system: the query (one vector per search) and the centroids in the codebooks (a fixed of them, shared across every database vector). So before scanning the database, compute every distance that involves and a centroid up front, and store those distances. After that, each database vector's distance to is reduced to picking the right precomputed numbers and adding them up.

Step 1: build the lookup table

A lookup table (LUT) is just a precomputed array of answers. Instead of running a calculation each time you need a number, you run it once, store the result in a fixed slot, and from then on read the slot instead of recomputing. Multiplication tables that schoolchildren memorise are the everyday example: rather than recomputing each time, you remember it is and look up the entry.

PQ's lookup table is two-dimensional, indexed by (subspace, centroid_id). Each entry holds the squared distance between 's chunk in that subspace and one centroid in that subspace's codebook.

Construction:

  1. Split into its chunks: holds dimensions 1-8, holds 9-16, and so on through .
  2. For each subspace and each of the centroids in that subspace's codebook, compute the squared distance and write it into row , column of the table.

For a tiny example with subspaces and centroids each, the LUT is a 2-by-3 array of floats, with one row per subspace:

centroid 0centroid 1centroid 2
subspace 00.421.070.18
subspace 10.910.050.63

Each cell is "the squared distance from 's chunk in this subspace to this centroid in this subspace's codebook". Cell means 's first chunk is at squared distance from centroid 2 of subspace 0.

In the production setting (, ) the table is 96 rows by 256 columns: float32 entries, taking bytes, about 96 KB. Building it costs small distance computations in 8-dimensional space, all done at the start of the query in a fraction of a millisecond.

Step 2: distance to a compressed database vector

Now take any compressed database vector , stored as the 96 centroid IDs (one byte each). The squared distance from to is approximately:

In words: walk through the 96 subspaces. In subspace , 's chunk was encoded as centroid number . The distance from 's chunk in that subspace to that centroid is sitting in row , column of the LUT. Read it out. Add the 96 numbers together.

Going back to the tiny example: if a database vector is encoded as (centroid 2 in subspace 0, centroid 1 in subspace 1), its squared distance to is .

The total work per database vector is memory reads from the LUT and floating-point additions, roughly 100 nanoseconds for . There is no decompression, no full-precision distance calculation, and nothing about the 768-dimensional ambient space appears in the inner loop. The LUT is built once at the start of each query and reused for every database vector the search evaluates.

Why this works as an approximation: the squared Euclidean distance over 768 dimensions is the sum of squared distances over disjoint chunks of dimensions (because squaring then summing is the same whether you split the dimensions into groups or not). So the true distance from to is exactly the sum of per-subspace squared distances. ADC plugs in the centroid that stood in for 's chunk in each subspace's distance, instead of 's actual chunk. The error is bounded by how far each chunk drifted from its centroid, which is small when -means has done its job.

Symmetric vs asymmetric

The "asymmetric" in the name reflects an asymmetry between and . In the algorithm above, is left at full float32 precision and only is compressed; the LUT entries are exact distances on 's side. There is a less common variant called symmetric distance computation that compresses too: encode against the codebooks like any database vector, store its 96 centroid IDs, and look up centroid-to-centroid distances in a precomputed table per subspace (which can be cached because the codebooks are fixed). The inner loop is slightly faster, but recall drops noticeably because 's coordinates have been snapped to the same coarse grid as the database, doubling the approximation error. Most production systems run ADC.

A subspace, in pictures

The demo below shows what PQ does inside a single subspace. The 50 points are the chunks of the dataset's vectors in this one subspace ( for visibility). -means with clusters them into the four marked centroids (the codebook for this subspace). Each point is then encoded as the index of its nearest centroid. For ADC, the query Q's distance to each of the four centroids is precomputed (the LUT entries shown below the chart), and any compressed point's contribution to in this subspace becomes a single array lookup.

1 / 5
Fifty raw points in the subspace

Each point is the projection of one database vector onto this 2-dimensional subspace. The full PQ would do this for 96 subspaces in parallel; here we look at one of them.

The full PQ does this independently for each of the subspaces. The compressed representation of a 768-dimensional vector is then such cluster indices, and the distance from Q to any compressed vector is the sum of subspace lookups.

Recall

PQ is approximate: the distances it computes are estimates, not exact, so the top- neighbours it returns can differ from the true top- that an exhaustive float32 comparison would have produced. Some queries return all the right answers; others swap one or two correct neighbours for slightly-wrong ones; very rarely a query returns badly wrong results. We need a single number that summarises how often this approximation produces the correct answer.

The standard metric is recall@k. The recipe:

  1. Pick a test query.
  2. Run an exact brute-force search: compare the query against every full-precision database vector, sort by true distance, keep the top . Call this set for "ground truth".
  3. Run the approximate search (PQ in our case) and keep its top . Call this set .
  4. Recall@k for this query is : how many vectors are in both lists, divided by .
  5. Repeat for many queries (a held-out test set of a few thousand is typical) and average.

Concrete example with . Suppose the brute-force top 10 for some query is . The approximate search returns , with a wrong vector that snuck in instead of . The intersection is , size , so recall@10 for this query is . Average that over the test set to get the overall recall@10. A configuration scoring 0.99 means that, on average, the search returns 9.9 of the 10 true neighbours per query.

Recall is a number between (none of the true neighbours found) and (every true neighbour found). Production targets vary: for semantic search on top of an LLM, recall@10 around 0.95-0.99 is the usual band, since the downstream application is forgiving of swapping near-equivalent neighbours but breaks if the search returns something irrelevant.

PQ does not lose much recall in practice, but the loss is not free. Three knobs shape the error in distance estimation, and therefore the recall:

  • Subspace dimension . Smaller subspaces (more chunks for the same ) give a higher compression ratio per vector but introduce more error per chunk, because each centroid has to stand in for a wider variety of inputs in its small subspace.
  • Codebook size . More centroids per subspace mean a finer-grained approximation, so the encoded chunk is closer to its original value. Diminishing returns kick in past , which is why 1-byte indices are the production sweet spot.
  • Data distribution. If the data within each subspace lies on a low-dimensional manifold (most chunks cluster in a thin slice of the 8-dimensional space), -means picks centroids on top of the dense regions and the per-chunk error stays small. If the data fills the subspace uniformly, the same number of centroids has to spread out and the per-chunk error grows.

Typical configuration on 768-dimensional sentence embeddings: , gives around 99% recall@10 compared to brute-force float32 search. The 1% recall loss is barely visible in downstream applications.

Training the codebook, and what to do when the data drifts

The codebook is trained once on a representative sample of the dataset, not on every database vector. A sample of a few hundred thousand to a few million chunks per subspace is enough to capture the distribution; -means on a million chunks runs in seconds with a SIMD distance kernel. Once the codebook is trained, it is used to encode all database vectors and any vectors that arrive later: each new vector is split into 96 chunks, and each chunk is mapped to the nearest centroid in its subspace's codebook. The centroids themselves don't move when new vectors are inserted.

This raises a real problem. If the data distribution shifts over time, the existing centroids stop being a good fit. Newly inserted chunks land far from any centroid; the per-chunk approximation error grows; distance estimates lose accuracy; recall drops. For a stable workload the drift is slow and the codebook stays useful for a long time, but for an evolving one (e.g., user-generated content or a model that gets re-trained) the codebook can become stale within weeks.

Production systems handle this in two ways:

  1. Periodic retraining. On a schedule (weekly, daily, hourly depending on how fast the data moves), draw a fresh sample, re-run -means to get a new codebook, and re-encode the dataset against it. The cost is moderate: -means on the sample plus one re-encoding pass over the database. Done in the background, the index never sees downtime.
  2. Streaming / online -means. Update centroids incrementally as new vectors are inserted, nudging each centroid towards the chunks newly assigned to it. This keeps the codebook fresh without explicit retraining, at the cost of a more complex insertion path. Less common in practice because the periodic-retrain approach is simpler and good enough for most workloads.

Whatever the cadence, the codebook is treated as a slowly-aging artefact: trained once, used for a while, retrained and replaced when the recall numbers say it's time.

The same idea recurs everywhere precision can be traded for size. The clearest case is large language model weight compression: a 70-billion-parameter model with float16 weights takes 140 GB, INT8 halves that to 70 GB, and INT4 halves it again to 35 GB, the difference between needing a small server and fitting on a single consumer GPU. The bookkeeping differs from PQ (no chunking into subspaces, just per-row or per-block scaling factors plus integer storage), but the high-level move is the same: replace each value with a coarse approximation and accept a small accuracy loss in exchange for a large saving in memory and bandwidth. Modern variants (GPTQ, AWQ, the GGUF Q4 family) layer calibration data and per-block scaling on top to push that trade further at 4 bits without breaking model accuracy.

The codebook construction itself recurs in audio and image compression, where a small dictionary of learned prototypes stands in for the raw signal. Different domains, different details, the same trade.

The series so far, and what's next

Four articles, working from the geometry outward:

  1. The problem. Why nearest neighbor search is easy in 1D, manageable in 2D, and breaks in 768D. The curse of dimensionality, derived from first principles.
  2. Small worlds. How to do approximate nearest neighbor search by walking on a graph. Naive proximity graphs and their failures, Watts-Strogatz, NSW, HNSW.
  3. On disk. How to scale HNSW to billions by serving the heavy data from SSD with DiskANN and Vamana, where each graph hop is one disk page read.
  4. Quantization. How to compress the heavy data instead, so the same graph index can hold a much larger dataset in the same RAM budget. Scalar quantization for the easy cases, product quantization for the hard ones, with -means as the engine and ADC as the inner loop.

The arc so far, in one sentence: pairwise distance information collapses in high dimensions, but graph-based search avoids partitioning the ambient space, the graph can be laid out so each hop is one disk page read, and quantization shrinks the per-vector cost so billion-scale indexes fit in budgets that float32 vectors never could.

Upcoming parts will cover GPU-accelerated indexes, hybrid search combining vector NN with metadata filters, and streaming indexes for continuously-changing datasets.