Vector search: Small worlds

The previous article ended with two requirements. Search structures cannot afford to partition the ambient space, because pruning bounds collapse once distances concentrate. And exact answers cost too much in high dimensions to be worth chasing, so we accept approximation in exchange for speed.

This article works through the structure that powers most modern vector databases. The acronym is HNSW, for Hierarchical Navigable Small World. There is a story behind every word in that name, and we'll build the structure piece by piece, starting from a much simpler graph and watching it fail in informative ways.

From cells to neighbors

The kd-tree thinks in cells. It carves space into rectangles and asks "which rectangle is the query in, and which neighboring rectangles could still hold a closer point?" The pruning answer collapses in 768 dimensions because pairwise distances are too tightly concentrated for any cell boundary to discriminate.

The shift is to stop asking about regions of space and start asking about the data itself. Build a graph where nodes are the points in your dataset and edges connect points that are near each other. To answer a query Q, start at some node and walk along edges, always moving to a neighbor that is closer to Q than where you currently stand. Stop when no neighbor improves.

That procedure is greedy walk on a proximity graph. It performs no geometric pruning. It computes no bounding boxes and compares no query coordinates against any hyperplane. The only operations are a distance comparison between Q and a handful of specific points in the dataset, and a decision about which point to look at next.

Why this might work in high dimensions, where partitioning fails: pruning broke down because bounding boxes are a description of the ambient space, and the ambient geometry stopped carrying useful information once $$d$$ grew large. Neighborhood relationships between actual data points don't depend on the ambient geometry at all. If two embeddings sit near each other on whatever low-dimensional manifold the data lives on, an edge between them in the graph encodes that fact directly. The walk never has to reason about the ambient space.

Now to the simplest version of the idea, and where it falls short.

The naive proximity graph

The simplest proximity graph is the k-nearest-neighbor graph (k-NN graph). For each point in the dataset, find its $$k$$ closest neighbors and add an edge to each. Every node ends up with approximately $$k$$ neighbors, all of them genuinely close.

Building this naively costs $$O(N^2)$$ distance computations. There are smarter constructions, but for the moment we just want to study what queries look like once the graph exists.

The greedy walk to find the nearest neighbor of a query Q:

current ← some starting node
loop:
    neighbor ← whichever of current's neighbors is closest to Q
    if dist(Q, neighbor) >= dist(Q, current):
        return current
    current ← neighbor

Each step strictly reduces the distance to Q, so the walk is guaranteed to terminate. The question is whether it terminates at the right answer.

The demo below builds a small k-NN graph in 2D and animates a greedy walk on a query the walk handles cleanly.

When the walk works, it works in roughly $O(\log N)$ hops on well-behaved data. On sixteen points, that is a handful of hops. On a million points, on the order of twenty hops if the graph is well-built.

Local minima

A point at which no neighbor of current is closer to Q than current itself is called a local minimum of the distance-to-Q function on the graph. The walk terminates at every local minimum, by definition. The trouble is that on a k-NN graph, points other than the global nearest neighbor can also be local minima.

The demo below shows this concretely: twelve points split into two well-separated clusters of six (left cluster labelled L1 through L6, right cluster R1 through R6), and a query Q placed inside the right cluster.

The walk starts at L1, hops to L5, then to L6, and stops at L6. The walk's claim is that L6 is the nearest neighbor of Q. The orange ring on R5 says otherwise: R5 is the actual nearest neighbor, sitting in the right cluster at distance 0.20 from Q, while L6 is at distance 5.00. The walk's answer is wrong by a factor of 25.

The mechanical reason the walk stopped is that L6's neighbors are all worse than L6:

Point	Distance to L6	Distance to Q
L6 (current)	0	5.00
L5	1.41	6.05
L4	2.24	7.10
L3	2.83	7.23

L6's only edges go to L5, L4, and L3, all sitting back inside the left cluster, all farther from Q than L6 itself. The greedy rule only steps to a strictly closer neighbor, so the walk has nowhere to go and terminates.

The deeper question is why L6 has no edge to R5, or to any right-cluster point. The answer is structural. L6's three nearest points anywhere in the dataset are L5, L4, and L3 (the table above), all on L6's own side. The closest right-cluster point to L6 is R1 at distance 3, farther than every one of L6's three within-cluster neighbors. With $$k = 3$$ , L6's edges all stop at the cluster boundary. The same reasoning applies to every other point in the left cluster: each one has five same-cluster candidates, and its three closest within-cluster neighbors always beat its closest cross-cluster neighbor. The result is a k-NN graph with two disconnected components, and a greedy walk that cannot leave whichever component it started in.

This is not a special case engineered for the demo. It is the generic behavior of k-NN graphs on data with cluster structure. Whenever a point's $$k$$ closest in-cluster neighbors sit nearer than its closest out-of-cluster neighbor, the k-NN edges stay inside clusters. Real embedding data is more clustered than the demo, not less: sentences about cooking, about astronomy, about basketball; images of cats, of trucks, of mountains. Whatever the data is, the embedding model places semantic neighbors close together and unrelated points far apart, which is precisely the geometry that disconnects k-NN graphs.

A common workaround is restarts: pick several random starting points, run greedy from each, return the best result. It works, in the sense that one of the random starts eventually lands in the right cluster. It is also expensive, and the number of restarts you need grows with the number of clusters in the data, which on real embeddings can be in the thousands.

The structural fix is to add edges that bridge clusters and let the walk make occasional long jumps. That is what the next section is about.

Small-world graphs

In 1998, Duncan Watts and Steven Strogatz published a model of networks that captured a longstanding observation about social graphs. Most of your friends know each other (the local neighborhood of any node is dense), and yet any two people on the planet are connected by a short chain of friendships (the network's average path length is small). Both can be true simultaneously, in a way that neither random graphs nor regular lattices reproduce on their own.

Their construction:

Start with a regular ring lattice. $$N$$ nodes arranged around a circle, each connected to its $$k$$ nearest neighbors along the ring.
For each edge, with probability $$p$$ , rewire one endpoint to a randomly chosen target.

The two endpoints of the parameter $$p$$ are familiar:

$$p = 0$$ : the lattice is unchanged. Locally dense, average path length $$O(N/k)$$ . Walking from one side of the ring to the other takes proportionally many hops.
$$p = 1$$ : every edge is rewired to a random target. The graph is essentially random. Almost no local structure, average path length $O(\log N)$ .

The interesting regime is in the middle. For $$p$$ somewhere between $$0.01$$ and $$0.1$$ , the graph keeps almost all of the lattice's local density, but the few rewired edges drop the average path length close to the random-graph value. The graph is locally dense and globally short. Watts and Strogatz called such a graph a small-world graph, after the "small world" effect from social psychology, popularised by Stanley Milgram's letter experiments and the "six degrees of separation" line.

Two graph-level metrics make the property measurable.

The average path length is the mean shortest-path distance over all pairs of distinct nodes. The shortest-path distance between two nodes is the smallest number of edges in any path connecting them, computed by running breadth-first search from one node and reading off the depth at which the other was reached. A ring lattice with $$N$$ nodes and connectivity $$K$$ has average path length on the order of $$N/(2K)$$ because walking from one side of the ring to the other requires stepping around it. A random graph with the same edge count has average path length on the order of $\log N$ , because random shortcuts collapse the diameter exponentially. On big graphs the two regimes differ by orders of magnitude.

The clustering coefficient measures how interconnected a node's neighbors are. Pick a node $$u$$ with $$k$$ neighbors and count the edges $$e$$ that exist among those $$k$$ neighbors, ignoring $$u$$ itself. There are $$k(k-1)/2$$ possible neighbor pairs, so $$u$$ 's local clustering coefficient is $$2e / (k(k-1))$$ . Concretely: if $$u$$ has three neighbors $$a$$ , $$b$$ , $$c$$ , the possible neighbor-pairs are $$ab$$ , $$ac$$ , $$bc$$ . If edges $$ab$$ and $$bc$$ are present but $$ac$$ is not, then $$e = 2$$ and $$u$$ 's local clustering is $$2/3$$ . The graph's clustering coefficient is the average of this quantity over all nodes with degree at least 2. In plain English: what fraction of my friends know each other? Ring lattices score high, around $$0.5$$ for $$K = 4$$ , because your ring-neighbors are also each other's neighbors. Random graphs score low, on the order of $$K/N$$ , because two random friends rarely happen to be each other's friends as well.

Three graphs on the same 20-node ring make the pattern concrete. The lattice on the left wraps neighbors tightly together (high clustering, around 0.50) but takes many hops to cross (average path length around 2.90). The random graph on the right has shortest paths (around 2.10) but no local structure left (clustering near 0.20). The middle graph rewires roughly 10% of the lattice's edges, the highlighted ones; it keeps almost all of the local clustering but drops the average path length almost halfway to the random graph's value. The asymmetry is the small-world property: a few long-range edges shorten paths drastically while leaving local density nearly intact.

The relevance to nearest-neighbor search is direct. Local edges are useful when the walk is already close to Q: they let it refine one short hop at a time. Long-range edges are useful when the walk is far from Q: a single hop can cover ground that a chain of local edges would take many steps to traverse. A graph with both is navigable: greedy walk can find short paths to any target from any starting point.

Watts and Strogatz proved that small-world graphs exist and are easy to construct. They did not give an algorithm for building one for nearest-neighbor search specifically. The catch is that you can't compute exact nearest neighbors during construction without doing the very thing you are trying to avoid. Yury Malkov and collaborators showed how to sidestep this in 2014, with the Navigable Small World construction.

NSW: build the graph as you go

NSW stands for Navigable Small World. The construction inserts data points one at a time, and gets the long-range edges for free by exploiting the order of insertion.

The algorithm:

The graph starts empty.
To insert a new point $P$:
- Use the existing graph to find the $$M$$ approximate nearest neighbors of $$P$$ among the points already inserted. This is itself a greedy walk (or a small beam search) from a few random starting nodes.
- Add $$P$$ as a new node and connect it bidirectionally to those $$M$$ neighbors.

That is the entire construction rule. The interesting structural fact is what it does to early insertions.

When the graph contains only five points, "the $$M$$ approximate nearest neighbors of the new point among the existing five" can span a wide swath of the data, because there are only five candidates to choose from. None of them is necessarily close in any absolute sense; they are just the closest of what is available. Those edges become long-range edges in the final graph: they connect points that, by the time the graph is full, sit far apart on the underlying manifold.

When the graph contains 100,000 points, "the $$M$$ approximate nearest neighbors of the new point among the existing 100,000" are usually genuinely close in the absolute sense. Those edges are short-range.

The same insertion rule produces long edges early and short edges late. Each node ends up with a mixture of both kinds, weighted toward whichever insertion era it belongs to. The aggregate is a small-world graph: locally dense (short edges everywhere), globally short (the early-era long edges keep the diameter low). Greedy walk navigates it well, because the local neighborhood is tight enough to refine and the long-range edges keep the walk from getting trapped in a single cluster.

There is one practical refinement. Strict greedy still occasionally gets trapped at a local minimum on real data. The fix is beam search, which keeps several candidates in flight at once instead of just one. The width of that frontier is a parameter, conventionally called ef (typically a small integer between 50 and a few hundred in production HNSW). Greedy walk corresponds to ef = 1; raising ef widens the frontier.

Beam search maintains two distinct structures, both seeded with the entry point:

A result set holding the closest ef points to Q whose distance has been computed during the walk so far. Think of it as the running answer: at any moment, these are the top-ef candidates the search has discovered. Capped at size ef: when the set is full and a newly evaluated point beats the current worst entry, that worst entry is evicted.
A candidate queue holding points whose graph neighbors still need to be examined, ordered by distance to Q with the closest at the front. A point enters this queue the first time its distance is computed and leaves it once its neighbors have been examined.

Each step pops the closest unexamined candidate from the queue and evaluates the distance from Q to every one of that point's graph neighbors. Each neighbor that beats the current worst entry in the result set is added to both the result set (evicting the previous worst if needed) and the candidate queue. The loop stops when the closest remaining candidate in the queue is already farther from Q than the worst entry in the result set: past that point, expanding a distant candidate is unlikely to reveal closer points (its graph neighbors live near it on the manifold), so the search bails out.

The slack is what lets beam search escape local minima, and the mechanism is more mechanical than directional. There is no detection of "this direction dead-ended." Each iteration just pops the closest unexpanded candidate from the queue and expands it. When an expansion produces no neighbors close enough to enter the result set, those neighbors fail the threshold check and never enter the queue at all. The next iteration simply pops the next-closest existing candidate, which may be something added several expansions ago that has been sitting in the queue waiting.

Walk through ef = 3 concretely. After expanding the entry point, the queue might be [B, C, D], ordered by distance to Q with B closest. Pop B and expand it. Some of B's neighbors are close enough to enter both the queue and the result set; the rest are discarded. The queue is now, say, [C, D, E, F]. Pop C. C might sit in a completely different region of the graph from B, but the algorithm does not know or care: C is just the next-closest unexpanded candidate by distance to Q. Repeat. Greedy walk is the ef = 1 case where the queue can never accumulate alternatives, so when one expansion produces no improvement the queue empties and the search terminates.

The search does not necessarily expand all ef candidates. The stop condition is the one defined earlier: as soon as the closest remaining candidate's distance to Q exceeds the worst entry in the result set, the search bails out, regardless of how many unexpanded candidates remain in the queue. On well-behaved queries this happens after a handful of expansions per layer. On harder queries more of the queue drains before the bound holds.

The demo below runs ef = 1 and ef = 2 side by side on the same seven-node graph. The query Q sits on the right; the true nearest neighbor is F (orange ring). Greedy (ef = 1) descends from S into A, finds that A's only other neighbor N is farther from Q, and terminates at A. Beam search (ef = 2) keeps C in its candidate queue from the very first expansion and eventually pops it, walking C → D → E → F.

Larger ef explores more of the graph, finds the true nearest more reliably, and costs more time. The same algorithm runs during both construction (where it finds each newly inserted point's M neighbors) and querying (where it answers user queries). The two roles are usually called efConstruction and efSearch and tuned independently.

NSW with beam search is already a strong ANN index. But it has one remaining problem.

HNSW: hierarchy fixes the rest

NSW has an asymmetry that bothered Malkov enough to revisit the design four years later. The long-range edges that make the graph navigable were created during the first few insertions. As the graph grows, those edges stay rare, but they are disproportionately load-bearing: every greedy walk on a large NSW relies heavily on the handful of long-range edges seeded in the early era. Heavily-used edges become routing bottlenecks. And the small-world property depends on the insertion order, so a poorly-ordered stream of insertions can produce an NSW graph with too few long-range edges to navigate well.

The fix, published by Malkov and Yashunin in 2018, is HNSW: Hierarchical Navigable Small World. Stack multiple NSW graphs on top of each other, with progressively fewer points in each upper layer. The thinning ratio between layers is $$1/M$$ , where $$M$$ is the same parameter that bounded each node's neighbor count in NSW (typically a small integer like 16 or 32 in production). Each upper layer holds roughly $$1/M$$ of the layer below it.

The construction:

Layer 0 contains every point. Edges in layer 0 connect each node to its short-range neighbors.
Layer 1 contains a random sample of layer 0, roughly $$1/M$$ of the points. Within layer 1 the same rule applies (each node connects to its short-range neighbors among the layer-1 nodes), which means layer-1 edges connect points that are short-range relative to layer 1's sparser population, and therefore medium-range relative to the underlying data.
Layer 2 contains a random sample of layer 1, roughly $$1/M$$ of that. Layer-2 edges are short-range among layer 2's even sparser population, which makes them long-range relative to the underlying data.
And so on, up to a topmost layer with $O(\log N)$ points total.

When a point is inserted, it draws a max-level value from this distribution. The max level decides how high up the layers the point reaches: a point with max level $$L$$ participates in every layer from 0 up to $$L$$ , inclusive.

The distribution is geometric:

P(\text{level} = L) = \left(\tfrac{1}{M}\right)^{L}\!\!\cdot\!\left(1 - \tfrac{1}{M}\right)

The $$(1 - 1/M)$$ factor is a normaliser. As an analogy, imagine a jar of 12 balls coloured 6 red, 4 blue, 2 green. The weights are $$6, 4, 2$$ , but the probabilities are $$6/12, 4/12, 2/12$$ : you divide each weight by the total to get a valid distribution that sums to 1. Same idea here. The unnormalised weights for the levels are $1, 1/M, 1/M^{2}, \ldots$ , each level $$M$$ times smaller than the one below. These sum to $$M/(M-1)$$ across all $L \ge 0$ , so dividing each weight by that sum (equivalently, multiplying by $$1 - 1/M$$ ) turns them into probabilities. The $$(1/M)^L$$ piece on its own is meaningful too: it's the probability of reaching level $$L$$ or any higher level, which is what you'd compute to count "how many points are at layer $$L$$ or above." Multiplying by $$(1 - 1/M)$$ converts that to "reach level $$L$$ exactly."

Layer 0 always contains every point, regardless of max level, because that's the bottom of the descent where the final search runs over the full dataset. Layer 1 contains the subset whose max level is at least 1 (about $$1/M$$ of all points); layer 2 contains the subset whose max level is at least 2 (about $$1/M^2$$ of all points); and so on. No points are lost in this thinning: a point that reaches max level 2 still participates in layers 0 and 1 below it. With $$M = 16$$ and a million inserted points, the layer populations work out roughly to: 1,000,000 at layer 0, 62,500 at layer 1, 3,900 at layer 2, 240 at layer 3, 15 at layer 4, and a couple at layer 5. The number of layers needed to cover $$N$$ points is on the order of $\log_M N$ , which keeps the top layer at $O(\log N)$ points regardless of dataset size.

Most points end up at level 0 only. A small fraction also exist at level 1; a smaller fraction reach level 2; and so on, with the count of nodes at each level decaying by a factor of $$M$$ as you go up. A point that exists at level $$L$$ participates in the NSW graph at every level from 0 up to $$L$$ , with separate adjacency lists per level but the same underlying point.

The skip-list analogy is exact. Skip lists are sorted linked lists with extra "express lanes": every node lives on the bottom lane, a roughly $$1/2$$ fraction also live on the second lane, $$1/4$$ on the third, and so on. Search starts at the top lane, races along it until it would overshoot the target, drops to the next lane, and continues. The expected cost is $O(\log N)$ , not because of any sophisticated argument, but because each lane has roughly half the nodes of the one below.

HNSW is the same construction in two more dimensions. The "lanes" are now NSW graphs over a 2D (or in production, $$d$$ -D) point set, and "racing along" is greedy walk. Search starts at the topmost layer's entry point (a fixed node, the highest-level point ever inserted), runs greedy walk on that layer until no neighbor improves, then drops to the next layer down using the best node found as the new starting point. Repeat until layer 0, where a final beam search of width efSearch returns the answer.

The visual demo below builds an HNSW with three layers and animates a query.

The top layers do the long-jump work that small-world rewiring did in a flat NSW, but in a structured way that does not depend on insertion order. The asymmetry is gone. Every layer is a small-world graph at its scale; the descent zooms in by an effective factor of $$M$$ at each step. Total search cost works out to $O(\log N)$ in expectation, the same as a skip list, for the same reason: a constant number of greedy hops per layer, times a logarithmic number of layers.

What you tune in practice

The four parameters that matter:

Parameter	Meaning	What it trades
`M`	Max neighbors per node at each layer	Larger $$M$$ = denser graph, higher recall, more memory and slower per-step search. Typical: 8 to 64.
`efConstruction`	Beam width during construction	Wider beam = better graph, slower indexing. One-time cost. Typical: 100 to 500.
`efSearch`	Beam width during query	Wider beam = higher recall, slower queries. The main runtime knob. Typical: 50 to 1000.
$$m_L$$	Level multiplier, defaults to $1/\ln M$	Sets the geometric distribution of node levels; rarely tuned by hand.

Recall (the fraction of the true nearest neighbors the search actually returns) and latency (the wall-clock time per query) are the two axes any vector database benchmark plots. On a million 128-dimensional vectors with a typical HNSW configuration, you can expect 95% recall@10 (averaged across queries, the search returns 9.5 of the true top-10 nearest neighbors) in around 1ms per query on a single core. Higher recall costs more efSearch, which scales sub-linearly with latency. The exact numbers vary with dimension, data distribution, and hardware, but the curve shape is consistent across implementations.

What's still missing

HNSW lives in RAM. Every edge is a pointer into the graph, every node holds a full vector for the inner-loop distance computations, and a billion 768-dimensional float32 vectors take up around 3 terabytes. That does not fit on a normal machine.

Two techniques close the gap. Quantization compresses each vector down to a few bytes (scalar quantization, product quantization, residual quantization), trading a small recall hit for a 10x to 100x memory saving. Disk-based indexes like DiskANN and Vamana keep the graph on SSD and use traversal patterns that keep the IO cost sublinear in the index size.

Together they are how production systems run vector search at billion scale. They are the topic of the next article in this series.

Parameter	Meaning	What it trades
`M`	Max neighbors per node at each layer	Larger $\(M\)$ = denser graph, higher recall, more memory and slower per-step search. Typical: 8 to 64.
`efConstruction`	Beam width during construction	Wider beam = better graph, slower indexing. One-time cost. Typical: 100 to 500.
`efSearch`	Beam width during query	Wider beam = higher recall, slower queries. The main runtime knob. Typical: 50 to 1000.
$\(m_L\)$	Level multiplier, defaults to $1/\ln M$	Sets the geometric distribution of node levels; rarely tuned by hand.