Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS'21 competition track

[Code, data, results on github] [Medium channel]

NeurIPS'21 leaderboard and session schedule

Track T1/T2 leaderboard.
Track T3 leaderboard.

The NeurIPS session for this competition is scheduled for Dec 8. NeurIPS registration is required for access to the session.
Overview Talk and Break-out session schedule (GMT).

Abstract for Invited talk: "Learning to Hash Robustly, with Guarantees"
There is a gap between the high-dimensional nearest neighbor search (NNS) algorithms achieving the best worst-case guarantees and the top-performing ones in practice. The former are based on indexing via the randomized Locality Sensitive Hashing (LSH), and its derivatives. The latter "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Alas, the latter also almost always come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries (or apply to datasets with an assumed extra structure/model). How can we bridge these two perspectives and bring the best of both worlds? As a step in this direction, we will talk about an NNS algorithm that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We will discuss the algorithm's ability to optimize for a given dataset from both theoretical and practical perspective.

Abstract for Invited talk: "Iterative Repartitioning for Learning to Hash and the Power of k-Choices"
Dense embedding models are commonly deployed in commercial search engines, wherein all the vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this talk, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow-up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high-dimensional embeddings across multiple GPUs without any communication. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. We call our novel system designed on sparse embeddings as IRLI (pronounced `early'), which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. Furthermore, IRLI employs a superior power-of-k-choices based load balancing strategy. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing. IRLI surpasses the best baseline's precision on multi-label classification while being 5x faster on inference. For near-neighbor search tasks, the same method outperforms the state-of-the-art Learned Hashing approach NeuralLSH by requiring only ~ {1/6}^th of the candidates for the same recall. IRLI is both data and model parallel, making it ideal for distributed GPU implementation. We demonstrate this advantage by indexing 100 million dense vectors and surpassing the popular FAISS library by >10%.

Why this competition?

In the past few years, we’ve seen a lot of new research and creative approaches for large-scale ANNS, including:

In addition to an uptick in academic interest, many implementations of these algorithms at scale now appear in production and high availability datacenter contexts: powering enterprise-grade, mission-critical, and web-scale search applications. In these deployment scenarios, benchmarks such as cost, preprocessing time, power consumption become just as important as the recall-vs-latency tradeoff. Despite this, most empirical evaluations of algorithms have focused on smaller datasets of about a million points, e.g. However, deploying recent algorithmic advances in ANNS techniques for search, recommendation and ranking at scale requires support at billion or substantially larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale.

We believe that this challenge will be impactful in several ways: By providing a platform for those interested in this problem, we aim to encourge more collaboration and collectively advance the field at a more rapid pace. Researchers can request Azure compute credit from a pool sponsored by Microsoft Research.


Standard Hardware Tracks (T1 and T2)

There are two standard standard hardware tracks:

Participants are expected to release their code for index building and search which the organizers will run on separate machines. Participants provide a configuration for their index build code that would complete in 4 days for each dataset. The protocol for evaluation is as follows:

Finalized details for build and search hardware timing will be released along with the the eval framework.

Custom Hardware Track (T3)

Participants can use non-standard hardware such as GPUs, AI accelerators, FPGAs, and custom in-memory silicon. In this track, participants will either 1) send their hardware, such as PCI boards to GSI Technology or 2) evaluate themselves using the scripts made available by the organizers. For T3 participants sending hardware, we will make specific delivery arrangements at participant’s expense. We will install the hardware on a system under the organizers control (we have a few bare-metal options available) and follow any installation directions provided. Participants will be allowed to temporarily log into the machine to finalize any installation and configuration, or for debugging installation as needed. For T3 participants running the evaluation themselves, we request remote ssh access and sudo accounts on the systems so that the organizers can verify the system and hardware (such as IPMI support, minimum resource availability such as disk storage for datasets). The evaluation phase will proceed like T1/T2, with a few modifications.

T3 will maintain different leaderboards for each dataset based on the following benchmarks: We will provide the exact details on how we collect and compute these benchmarks as well as additional machine and operating system specification before the competition begins.

Benchmark Datasets

We intend to use the following 6 billion point datasets.

All datasets are in the common binary format that starts with 8 bytes of data consisting of num_points(uint32_t) num_dimensions(uint32) followed by num_pts X num_dimensions x sizeof(type) bytes of data stored one vector after another. Data files will have suffixes .fbin, .u8bin, and .i8bin to represent float32, uint8 and int8 type data. Note that a different query set will be used for evaluation. The details of the datasets along with links to the base, query and sample sets, and the ground truth nearest neighbors of the query set are listed below.

The ground truth binary files for k-NN search consist of the following information: num_queries(uint32_t) K-NN(uint32) followed by num_queries X K x sizeof(uint32_t) bytes of data representing the IDs of the K-nearest neighbors of the queries, followed by num_queries X K x sizeof(float) bytes of data representing the distances to the corresponding points. The distances help identify neighbors tied in terms of distances. In recall calculation, returning a neighbor not in the ground truth set but whose distance is tied with an entry in the ground truth is counted as success.

The ground truth binary files for range search consist of the following information: num_queries(int32_t) followed by the total number of results total_res(int32_t) followed by num_queries X size(int32_t) bytes corresponding to num_results_per_query for each query, followed by total_res X sizeof(int32_t) bytes corresponding to the IDs of the neighbors of each query one after the other.

The ground truth files for the first 10M slice, the first 100M slice, and the complete 1B set of each dataset against the respective query set can be downloaded here(10M), here(100M), and here(1B).

Dataset Datatype Dimensions Distance Range/k-NN Base data Sample data Query data Ground truth Release terms
BIGANN uint8 128 L2 k-NN 1B points 100M base points 10K queries link CC0
Facebook SimSearchNet++* uint8 256 L2 Range 1B points N/A 100k queries link CC BY-NC
Microsoft Turing-ANNS* float32 100 L2 k-NN 1B points N/A 100K queries link link to terms
Microsoft SPACEV* int8 100 L2 k-NN 1B points 100M base points 29.3K queries link O-UDA
Yandex DEEP float32 96 L2 k-NN 1B points 350M base points 10K queries link CC BY 4.0
Yandex Text-to-Image* float32 200 inner-product k-NN 1B points 50M queries 100K queries link CC BY 4.0
* new datasets
We recommend using Axel for downloading BIGANN, Facebook-SSN++, Yandex DEEP1B and T2I datasets.
We recommend using AzCopy for downloading Microsoft datasets.


The competition will measure recall@10 of the algorithms on the 6 data sets a private query set (unreleased) at a fixed query throughput. Track T1 measures recall of algorithms at 10000 Queries/second (on 32 vCPUs), T2 measures recall at 1500 Queries/second, T2 measures recall at 2000 Queries/second. The primary metric for comparison in each track will be the sum of improvements in recall over the baseline at the target QPS over all datasets. Additionally, track T3 will also rank entries by power and cost per query. See this notebook for power and cost analysis. A team has to publish an algorithm and commit to benchmarking on at least 3 datasets to be considered for ranking. Recall regression on a dataset selected by a team will be continued as a negative score. The recall@10(AP for SSN++-1B dataset) of the baseline algorithms on each dataset for the public query set is listed below.
Track Algorithm Search MachineTarget Queries/sec BIGANN-1B SSN++-1B Turing-ANNS-1B SPACEV-1B DEEP-1B Text-to-Image-1B
Track 1 FAISS-CPU Azure F32s_v2 32vCPUs + 64GB RAM 10000 0.634 0.753 0.703 0.728 0.650 0.069
Track 2 DiskANN Azure L8s_v2 8vCPUs + 64GB RAM + 1TB SSD 1500 0.949 0.16274 0.936 0.901 0.937 0.488
Track 3 FAISS-GPU NVIDIA V100 + 700GB RAM 2000 0.927 TBA 0.910 0.850 0.942 0.86

Baseline DiskANN indices for T2 can be downloaded using "azcopy copy '' 'local_folder' --recursive". Note that this would take some time as the indices are large. All indices were built using R and L parameters set to 100. Search for T2 used 16 threads and beamwidth 4. The Ls parameter was varied to tune recall vs QPS.
Update: T2 baseline results have been modified after measuring via pybind11 interface on docker. There was a 30-40% QPS loss using this interface as compared to direct measurements of C++ code from commandline. As a result, the QPS target has now been lowered, and the recall is reported at this threshold.

Call for Participation and Timeline

Participation is open to all teams interested in developing new algorithms or re-implementing existing algorithms more efficiently either in software or hardware. Participants are requested to submit a brief document through CMT for each track they will be competing in. The document should contain the following details:

For Track T3, the document should contain the following additional details to help organizers plan and assess eligibility for seperate leaderboards:

Consent Forms

Please review and complete the consent form for participation in Tracks T1/T2 and Track T3. Note that there are separate consent forms for the standard and custom hardware tracks. Completing the form is necessary for participation.

Timeline (subject to change)

Organizers and Dataset Contributors

Organizers can be reached at

We thank Microsoft Research for help in organizing this competition, and contributing compute credits. We thank Microsoft Bing and Turing teams for creating datasets for release.