3 Minutes to Start Your Research in Nearest Neighbor Search
Spotify likely represents each song as a vector in a high-dimensional space (say, around 100 dimensions). Sounds overly complex, but that's how they predict your taste (though not always exactly).
I recently got involved in research on nearest neighbor search and here's what I've learned about the fundamentals: where it's used, the main algorithms, evaluation metrics, and the datasets used for testing. I’ll use simple examples and high-level explanations so you can get the core idea in one read.
Where You've Already Used This
You definitely use nearest neighbor search at least several times a week. Most commonly in recommendation systems like music apps and YouTube, since the easiest way to recommend content is to show you things similar to what you already liked. These are the neighbors such systems try to find. These algorithms have many other applications too, like identifying similar faces in social networks, or matching similar molecules in drug discovery.
You can encode virtually anything into these algorithms: music and videos, dance movements, even legal documents, as long as you can convert it into numbers.
How Precise Does the Search Need to Be?
These algorithms split into two categories: exact and approximate. If you want to find X similar objects to a specific one: an exact algorithm guarantees the X truly nearest neighbors, while an approximate algorithm shows X close neighbors that aren't necessarily the closest. Approximate search is usually faster.
It's not obvious, but approximate search is actually more common in recommendation systems. YouTube doesn't need to show you a video that's absolutely identical to one you just watched, it's fine if it's "roughly" similar, which helps them learn more about your interests.
The High-Level Idea That Connects These Algorithms
I noticed that nearest neighbor algorithms share one common challenge, but approach it differently—that's what distinguishes them.
Imagine searching for a specific person in a stadium with 50,000 spectators. The naive approach is to check everyone. The smarter approach is to first locate their section, then row, then seat, skipping most of the stadium entirely. Most neighbor search algorithms do something similar: they figure out how NOT to check most of the data.
This is the key technique for speeding up search. You'll likely need to compare your point against some neighbors anyway; the crucial task is selecting which ones to check to reach the result as quickly as possible.
Brief Overview of Main Algorithms
- Brute-force: Check everything, slow but accurate, like scrolling through all your phone contacts to find a specific person.
- Ball tree: Divides space into zones (typically balls in space) and finds each zone's center. If the center is far from your query point, all points inside that ball are likely far too, so you can skip the entire zone.
- HNSW: Builds multiple graph layers with exponentially fewer points at each higher layer. Upper layers get you close quickly because there are fewer points and you cover distance faster, lower layers, which you descend to later once the upper layers have roughly found the right zone, refine the search with more detailed local connections.
What Format Is the Data In?
It's important to understand the data format these algorithms work with and why having 100 dimensions is completely normal. Each object (point) in space is simply a list of numbers. The number of "dimensions" equals the number of values in that list. Each number represents one learned or defined feature of the object. These characteristics can be anything, it depends on how the implementers decide to use them.
These number lists can describe all kinds of objects. For example, your home coordinates are just two numbers (longitude and latitude); a desk's position in a room needs three (x, y, z). An image can be represented by numbers: either the raw pixel colors, or higher-level patterns that a computer can learn, like shapes, colors, or textures. For example, one number in the list might indicate whether there’s a cube in the image, while another number could indicate whether there’s a traffic light. These numbers allow the computer models to "understand" the image in a way that’s similar to how a human perceives patterns. The key is that all objects must be encoded the same way so you can find similar ones.
How Can You Test the Quality of Neighbor Search?
To evaluate these algorithms, specialized datasets are often used. These datasets consist of a list of points in space and a list of queries for which we want to find neighbors. The algorithm processes these queries and results are compared against "ground truth" answers provided by the dataset itself.
For example, the SIFT1M dataset contains one million 128-dimensional points and provides ground truth results for 10,000 queries. Each point represents SIFT features extracted from an image. This dataset serves as a benchmark to test how an algorithm would behave when searching for similar images in high-dimensional space.
Other datasets may contain information about text documents, molecular structures, or other types of data depending on the application domain.
More About Search Quality Metrics
- Recall@k - how many of the k nearest neighbors were correctly retrieved (verified against the provided ground truth)
- Query latency - how long the search took for a single query
- Build time - how long it took to construct the necessary data structures, if required. Ball trees need to build the hierarchical tree structure, while brute force needs no preprocessing, for example.
But you can also go much deeper into metrics depending on your goals: how much data was processed, memory access frequency, the cost of each memory query.
Additional Research Questions
If this topic interests you, here are several questions that can deepen your understanding. Most of them have already been well studied.
- How do different distance metrics (Euclidean, cosine, Manhattan) affect performance?
- How do these algorithms scale to billions of points?
- Can we combine multiple algorithms for better results?
- What preprocessing techniques improve performance?
If you have comments or think I'm wrong somewhere, please email: firstname.lastname@warwick.ac.uk.