ʕ´•ᴥ•`ʔ roman bikbulatov

3 Minutes to Start Your Research in Nearest Neighbor Search

Spotify likely represents each song as a vector in a high-dimensional space (say, around 100 dimensions). Sounds overly complex, but that's how they predict your taste (though not always exactly).

I recently got involved in research on nearest neighbor search and here's what I've learned about the fundamentals: where it's used, the main algorithms, evaluation metrics, and the datasets used for testing. I’ll use simple examples and high-level explanations so you can get the core idea in one read.

Where You've Already Used This

You definitely use nearest neighbor search at least several times a week. Most commonly in recommendation systems like music apps and YouTube, since the easiest way to recommend content is to show you things similar to what you already liked. These are the neighbors such systems try to find. These algorithms have many other applications too, like identifying similar faces in social networks, or matching similar molecules in drug discovery.

You can encode virtually anything into these algorithms: music and videos, dance movements, even legal documents, as long as you can convert it into numbers.

How Precise Does the Search Need to Be?

These algorithms split into two categories: exact and approximate. If you want to find X similar objects to a specific one: an exact algorithm guarantees the X truly nearest neighbors, while an approximate algorithm shows X close neighbors that aren't necessarily the closest. Approximate search is usually faster.

It's not obvious, but approximate search is actually more common in recommendation systems. YouTube doesn't need to show you a video that's absolutely identical to one you just watched, it's fine if it's "roughly" similar, which helps them learn more about your interests.

The High-Level Idea That Connects These Algorithms

I noticed that nearest neighbor algorithms share one common challenge, but approach it differently—that's what distinguishes them.

Imagine searching for a specific person in a stadium with 50,000 spectators. The naive approach is to check everyone. The smarter approach is to first locate their section, then row, then seat, skipping most of the stadium entirely. Most neighbor search algorithms do something similar: they figure out how NOT to check most of the data.

This is the key technique for speeding up search. You'll likely need to compare your point against some neighbors anyway; the crucial task is selecting which ones to check to reach the result as quickly as possible.

Brief Overview of Main Algorithms

What Format Is the Data In?

It's important to understand the data format these algorithms work with and why having 100 dimensions is completely normal. Each object (point) in space is simply a list of numbers. The number of "dimensions" equals the number of values in that list. Each number represents one learned or defined feature of the object. These characteristics can be anything, it depends on how the implementers decide to use them.

These number lists can describe all kinds of objects. For example, your home coordinates are just two numbers (longitude and latitude); a desk's position in a room needs three (x, y, z). An image can be represented by numbers: either the raw pixel colors, or higher-level patterns that a computer can learn, like shapes, colors, or textures. For example, one number in the list might indicate whether there’s a cube in the image, while another number could indicate whether there’s a traffic light. These numbers allow the computer models to "understand" the image in a way that’s similar to how a human perceives patterns. The key is that all objects must be encoded the same way so you can find similar ones.

To evaluate these algorithms, specialized datasets are often used. These datasets consist of a list of points in space and a list of queries for which we want to find neighbors. The algorithm processes these queries and results are compared against "ground truth" answers provided by the dataset itself.

For example, the SIFT1M dataset contains one million 128-dimensional points and provides ground truth results for 10,000 queries. Each point represents SIFT features extracted from an image. This dataset serves as a benchmark to test how an algorithm would behave when searching for similar images in high-dimensional space.

Other datasets may contain information about text documents, molecular structures, or other types of data depending on the application domain.

More About Search Quality Metrics

But you can also go much deeper into metrics depending on your goals: how much data was processed, memory access frequency, the cost of each memory query.

Additional Research Questions

If this topic interests you, here are several questions that can deepen your understanding. Most of them have already been well studied.

If you have comments or think I'm wrong somewhere, please email: firstname.lastname@warwick.ac.uk.