Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS'21 competition track

Evaluation framework on github. Subscribe to our Medium channel for discussions and announcements.

NeurIPS'21 leaderboard and session schedule

Track T1/T2 leaderboard.
Track T3 leaderboard.

The NeurIPS session for this competition is scheduled for Dec 8. NeurIPS registration is required for access to the session.
Overview Talk and Break-out session schedule (GMT).
  • 11:05-11:25: Overview Talk (slides)
  • 12:00-12:45: Overview of results presented by organizers, followed by Q&A
    • Standard hardware tracks T1 and T2 results (slides)
    • Custom hardware track T3 results (slides)
  • 12:45-13:20: Invited talk 1 by Prof. Alexandr Andoni: Learning to Hash Robustly, with Guarantees (slides)
  • 13:20-13:55: Invited talk 2 by Prof. Anshumali Shrivastava:Iterative Repartitioning for Learning to Hash and the Power of k-Choices (slides)
  • 13:55-14:30: Talks from track winners.
    • Track 1: kst_ann_t1 Li Liu, Jin Yu, Guohao Dai, Wei Wu, Yu Qiao, Yu Wang, Lingzhi Liu, Kuaishou Technology and Tsinghua University (slides)
    • Track 2: BBANN Xiaomeng Yi, Xiaofan Luan, Weizhi Xu, Qianya Cheng, Jigao Luo, Xiangyu Wang, Jiquan Long, Xiao Yan, Zheng Bian, Jiarui Luo, Shengjun Li, Chengming Li, Zilliz and Southern University of Science and Technology (slides)
    • Track 3: OptaNNe Sourabh Dongaonkar, Mark Hildebrand, Mariano Tepper, Cecilia Aguerrebere, Ted Willke, Jawad Khan, Intel Corporation, Intel Labs and UC Davis (slides)
  • 14:30-15:00: Open discussion on competition and future directions (github thread)

Abstract for Invited talk: "Learning to Hash Robustly, with Guarantees"
There is a gap between the high-dimensional nearest neighbor search (NNS) algorithms achieving the best worst-case guarantees and the top-performing ones in practice. The former are based on indexing via the randomized Locality Sensitive Hashing (LSH), and its derivatives. The latter "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Alas, the latter also almost always come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries (or apply to datasets with an assumed extra structure/model). How can we bridge these two perspectives and bring the best of both worlds? As a step in this direction, we will talk about an NNS algorithm that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We will discuss the algorithm's ability to optimize for a given dataset from both theoretical and practical perspective.

Abstract for Invited talk: "Iterative Repartitioning for Learning to Hash and the Power of k-Choices"
Dense embedding models are commonly deployed in commercial search engines, wherein all the vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this talk, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow-up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high-dimensional embeddings across multiple GPUs without any communication. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. We call our novel system designed on sparse embeddings as IRLI (pronounced `early'), which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. Furthermore, IRLI employs a superior power-of-k-choices based load balancing strategy. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing. IRLI surpasses the best baseline's precision on multi-label classification while being 5x faster on inference. For near-neighbor search tasks, the same method outperforms the state-of-the-art Learned Hashing approach NeuralLSH by requiring only ~ {1/6}^th of the candidates for the same recall. IRLI is both data and model parallel, making it ideal for distributed GPU implementation. We demonstrate this advantage by indexing 100 million dense vectors and surpassing the popular FAISS library by >10%.

Why this competition?

In the past few years, we’ve seen a lot of new research and creative approaches for large-scale ANNS, including:
  • Partition-based, and graph-based indexing strategies (as well as hybrid indexing approaches).
  • Mixing RAM and SSD storage to efficiently store and process large datasets that exceed the size of RAM.
  • Using accelerator hardware such as GPUs, FPGAs, and other custom in-memory silicon.
  • Leveraging machine learning for dimensionality reduction of the original vectors.

In addition to an uptick in academic interest, many implementations of these algorithms at scale now appear in production and high availability datacenter contexts: powering enterprise-grade, mission-critical, and web-scale search applications. In these deployment scenarios, benchmarks such as cost, preprocessing time, power consumption become just as important as the recall-vs-latency tradeoff. Despite this, most empirical evaluations of algorithms have focused on smaller datasets of about a million points, e.g. However, deploying recent algorithmic advances in ANNS techniques for search, recommendation and ranking at scale requires support at billion or substantially larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale.

We believe that this challenge will be impactful in several ways:
  • Provide a comparative understanding of algorithmic ideas and their application at scale.
  • Promote the development of new techniques for the problem and demonstration of their value.
  • Provide a compilation of datasets, many new, to enable future development of algorithms.
  • Introduce a standard benchmarking approach.
By providing a platform for those interested in this problem, we aim to encourge more collaboration and collectively advance the field at a more rapid pace. Researchers can request Azure compute credit from a pool sponsored by Microsoft Research.


Standard Hardware Tracks (T1 and T2)

There are two standard standard hardware tracks:

  • Track 1: In-memory indices with FAISS as the baseline. Search would use Azure Standard_F32s_v2 VMs with 32 vCPUs and 64GB RAM. Index construction would use Azure Standard_F64s_v2 VM with 64vCPUs, 128GB RAM and an additional 4TB of SSD to be used for storing the data, index and other intermediate data.
  • Track 2: Out-of-core indices with DiskANN as the baseline. In addition to the limited DRAM in T1, index can use an SSD for search. Search would use Azure Standard_L8s_v2 VMs with 8 vCPUS, 64GB RAM and a local SSD Index constrained to 1TB. Construction would use Azure Standard_F64s_v2 VM with 64vCPU, 128GB RAM and an additional 4TB of SSD to be used for storing the data, index and other intermediate data.
Participants are expected to release their code for index building and search which the organizers will run on separate machines. Participants provide a configuration for their index build code that would complete in 4 days for each dataset. The protocol for evaluation is as follows:
  • [on indexing machine] participants will be given a local path with 1B vector dataset.
  • [on indexing machine] participants build an index from the 1B vectors and store back to local disk.
  • [on indexing machine] Stored index is copied out to a temporary cloud storage location by the eval framework.
  • [on search machine] organizers load the index from cloud storage to a local path and provide the path to the search code.
  • [on search machine] organizers perform searches with held-out query set and measure recall and time to process the queries with several sets of parameters.

Finalized details for build and search hardware timing will be released along with the the eval framework.

Custom Hardware Track (T3)

Participants can use non-standard hardware such as GPUs, AI accelerators, FPGAs, and custom in-memory silicon. In this track, participants will either 1) send their hardware, such as PCI boards to GSI Technology or 2) evaluate themselves using the scripts made available by the organizers. For T3 participants sending hardware, we will make specific delivery arrangements at participant’s expense. We will install the hardware on a system under the organizers control (we have a few bare-metal options available) and follow any installation directions provided. Participants will be allowed to temporarily log into the machine to finalize any installation and configuration, or for debugging installation as needed. For T3 participants running the evaluation themselves, we request remote ssh access and sudo accounts on the systems so that the organizers can verify the system and hardware (such as IPMI support, minimum resource availability such as disk storage for datasets). The evaluation phase will proceed like T1/T2, with a few modifications.

  • For participants that send their hardware, T3 organizers will provide remote access to a separate indexing machine.
    • [on separate indexing machine] participants download 1B vector dataset and store to local disk
    • [on separate indexing machine] participants build an index from the 1B vectors and store back to local disk
    • Stored index is copied to eval machine
    • [on eval machine] T3 organizers load the index from local disk
    • [on eval machine] T3 organizers provide index with held-out query set and measure recall and time to process the queries with several sets of parameters. Index search code can use internal parallelism to batch process the queries.
  • For participants that give us remote access to systems, participants are responsible for building their index.
    • [on indexing machine] participants download 1B vector dataset and store to local disk
    • [on indexing machine] participants build an index from the 1B vectors and store back to local disk
    • Stored index is copied to eval machine
    • [on eval machine] T3 organizers load the index from local disk
    • [on eval machine] T3 organizers perform searches with held-out query set and measure recall and search time with several sets of parameters.
T3 will maintain different leaderboards for each dataset based on the following benchmarks:
  • Recall vs throughput using the same ranking formula as the T1/T2 track
  • Power- recall vs throughput/watt and a similar ranking formula to the T1/T2 track.
  • Cost measured as cost/watt (measured as queries/second/watt and MSRP/watt)
  • Total cost normalized across all tracks.
We will provide the exact details on how we collect and compute these benchmarks as well as additional machine and operating system specification before the competition begins.

Benchmark Datasets

We intend to use the following 6 billion point datasets.
  • BIGANN consists of SIFT descriptors applied to images from extracted from a large image dataset.
  • Facebook SimSearchNet++ is a new dataset released by Facebook for this competition. It consists of features used for image copy detection for integrity purposes. The features are generated by Facebook SimSearchNet++ model.
  • Microsoft Turing-ANNS-1B is a new dataset being released by the Microsoft Turing team for this competition. It consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in web search queries. An early version of the RNN-based AGI Encoder is described in a SIGIR'19 paper and a blogpost.
  • Microsoft SPACEV-1B is a new web search related dataset released by Microsoft Bing for this competition. It consists of document and query vectors encoded by Microsoft SpaceV Superior model to capture generic intent representation.
  • Yandex DEEP-1B image descriptor dataset consisting of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task.
  • Yandex Text-to-Image-1B is a new cross-model dataset (text and visual), where database and query vectors have different distributions in a shared representation space. The base set consists of Image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. Since the distributions are different, a 50M sample of the query distribution is provided.

All datasets are in the common binary format that starts with 8 bytes of data consisting of num_points(uint32_t) num_dimensions(uint32) followed by num_pts X num_dimensions x sizeof(type) bytes of data stored one vector after another. Data files will have suffixes .fbin, .u8bin, and .i8bin to represent float32, uint8 and int8 type data. Note that a different query set will be used for evaluation. The details of the datasets along with links to the base, query and sample sets, and the ground truth nearest neighbors of the query set are listed below.

The ground truth binary files for k-NN search consist of the following information: num_queries(uint32_t) K-NN(uint32) followed by num_queries X K x sizeof(uint32_t) bytes of data representing the IDs of the K-nearest neighbors of the queries, followed by num_queries X K x sizeof(float) bytes of data representing the distances to the corresponding points. The distances help identify neighbors tied in terms of distances. In recall calculation, returning a neighbor not in the ground truth set but whose distance is tied with an entry in the ground truth is counted as success.

The ground truth binary files for range search consist of the following information: num_queries(int32_t) followed by the total number of results total_res(int32_t) followed by num_queries X size(int32_t) bytes corresponding to num_results_per_query for each query, followed by total_res X sizeof(int32_t) bytes corresponding to the IDs of the neighbors of each query one after the other.

The ground truth files for the first 10M slice, the first 100M slice, and the complete 1B set of each dataset against the respective query set can be downloaded here(10M), here(100M), and here(1B).

Dataset Datatype Dimensions Distance Range/k-NN Base data Sample data Query data Ground truth Release terms
BIGANN uint8 128 L2 k-NN 1B points 100M base points 10K queries link CC0
Facebook SimSearchNet++* uint8 256 L2 Range (squared distance: 96237) 1B points N/A 100k queries link CC BY-NC
Microsoft Turing-ANNS* float32 100 L2 k-NN 1B points N/A 100K queries link link to terms
Microsoft SPACEV* int8 100 L2 k-NN 1B points 100M base points 29.3K queries link O-UDA
Yandex DEEP float32 96 L2 k-NN 1B points 350M base points 10K queries link CC BY 4.0
Yandex Text-to-Image* float32 200 inner-product k-NN 1B points 50M queries 100K queries link CC BY 4.0
* new datasets
We recommend using Axel for downloading BIGANN, Facebook-SSN++, Yandex DEEP1B and T2I datasets.
We recommend using AzCopy for downloading Microsoft datasets.


The competition will measure recall@10 of the algorithms on the 6 data sets a private query set (unreleased) at a fixed query throughput. Track T1 measures recall of algorithms at 10000 Queries/second (on 32 vCPUs), T2 measures recall at 1500 Queries/second, T2 measures recall at 2000 Queries/second. The primary metric for comparison in each track will be the sum of improvements in recall over the baseline at the target QPS over all datasets. Additionally, track T3 will also rank entries by power and cost per query. See this notebook for power and cost analysis. A team has to publish an algorithm and commit to benchmarking on at least 3 datasets to be considered for ranking. Recall regression on a dataset selected by a team will be continued as a negative score. The recall@10(AP for SSN++-1B dataset) of the baseline algorithms on each dataset for the public query set is listed below.
Track Algorithm Search MachineTarget Queries/sec BIGANN-1B SSN++-1B Turing-ANNS-1B SPACEV-1B DEEP-1B Text-to-Image-1B
Track 1 FAISS-CPU Azure F32s_v2 32vCPUs + 64GB RAM 10000 0.634 0.753 0.703 0.728 0.650 0.069
Track 2 DiskANN Azure L8s_v2 8vCPUs + 64GB RAM + 1TB SSD 1500 0.949 0.16274 0.936 0.901 0.937 0.488
Track 3 FAISS-GPU NVIDIA V100 + 700GB RAM 2000 0.927 TBA 0.910 0.850 0.942 0.86

Baseline DiskANN indices for T2 can be downloaded using "azcopy copy '' 'local_folder' --recursive". Note that this would take some time as the indices are large. All indices were built using R and L parameters set to 100. Search for T2 used 16 threads and beamwidth 4. The Ls parameter was varied to tune recall vs QPS.
Update: T2 baseline results have been modified after measuring via pybind11 interface on docker. There was a 30-40% QPS loss using this interface as compared to direct measurements of C++ code from commandline. As a result, the QPS target has now been lowered, and the recall is reported at this threshold.

Call for Participation and Timeline

Participation is open to all teams interested in developing new algorithms or re-implementing existing algorithms more efficiently either in software or hardware. Participants are requested to submit a brief document through CMT for each track they will be competing in. The document should contain the following details:

  • Name, email and affiliation of each participant in the team
  • A name and/or URL for the submission.
  • [Optional] To receive Azure credits for developing new ideas, please submit your request by June 30th with preliminary data on smaller scale datasets and why you think your algorithm will work well at billion scale. This will be used by the organizers to select strong entries. We request teams who already have access to infrastructure (e.g. those from industry or with access to large university clusters) to skip this.
For Track T3, the document should contain the following additional details to help organizers plan and assess eligibility for seperate leaderboards:
  • Type of hardware, e.g., PCIe extension board, rack-mounted system, or other.
  • Evidence of the retail MSRP of the hardware, i.e., pricing on website or copy of the customer invoice.
  • If hardware will be sent to GSI Technology (at the participants expense) or if organizers will given remote access to the systems. For remote system access participants, whether their system supports standard IPMI power monitoring. If not IPMI, then an equivalent power monitoring interface must be available.
  • Operating system requirements.
  • Whether the participant requires a separate machine for index building. We have limited Azure-based Fsv2-series machines and some bare-metal machines managed by the T3 organizers.

Consent Forms

Please review and complete the consent form for participation in Tracks T1/T2 and Track T3. Note that there are separate consent forms for the standard and custom hardware tracks. Completing the form is necessary for participation.

Timeline (subject to change)

  • May: release of data, guidelines, and a call for participation. Registration open.
  • June: Baseline results, testing infrastructure and final ranking metrics released.
  • July 11th: Participants in need of compute resources to submit an expression of interest.
  • Mid-July: Allocation of compute resources.
  • July 30th: Final deadline for participants to submit an expression of interest through CMT.
  • October 22nd: End of competition period. Teams to release of code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
  • October 29th: Participants submit a brief report outlining their algorithm and results.
  • Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
  • Early December: Final results published, and competition results archived (the competition will go on if interest continues).
  • During NeurIPS, organizers will provide an overview of the competition and results. Organizers will also request the best entries (including leaderboard toppers, or promising new approaches) to present an overview for further discussion.