Billion-Scale Approximate Nearest Neighbor Search Challenge:

Watch this space for finalized datasets, baselines and metrics. Subscribe to our Medium channel for discussions and announcements.

Why this competition?

In the past few years, we’ve seen a lot of new research and creative approaches for large-scale ANNS, including:
  • Partition-based, and graph-based indexing strategies (as well as hybrid indexing approaches).
  • Mixing RAM and SSD storage to efficiently store and process large datasets that exceed the size of RAM.
  • Using accelerator hardware such as GPUs, FPGAs, and other custom in-memory silicon.
  • Leveraging machine learning for dimensionality reduction of the original vectors.

In addition to an uptick in academic interest, many implementations of these algorithms at scale now appear in production and high availability datacenter contexts: powering enterprise-grade, mission-critical, and web-scale search applications. In these deployment scenarios, benchmarks such as cost, preprocessing time, power consumption become just as important as the recall-vs-latency tradeoff. Despite this, most empirical evaluations of algorithms have focused on smaller datasets of about a million points, e.g. ann-bechmarks.com. However, deploying recent algorithmic advances in ANNS techniques for search, recommendation and ranking at scale requires support at billion or substantially larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale.

We believe that this challenge will be impactful in several ways:
  • Provide a comparative understanding of algorithmic ideas and their application at scale.
  • Promote the development of new techniques for the problem and demonstration of their value.
  • Provide a compilation of datasets, many new, to enable future development of algorithms.
  • Introduce a standard benchmarking approach.
By providing a platform for those interested in this problem, we aim to encourge more collaboration and collectively advance the field at a more rapid pace. Researchers can request Azure compute credit from a pool sponsored by Microsoft Research.

Tracks

Standard Hardware Tracks (T1 and T2)

There are two standard standard hardware tracks in which the trade-off between recall and throughput will be evaluated.

  • (T1) In-memory indices with FAISS as the baseline. DRAM is constrained to 64GB for search and 128GB for build. 4TB scratch space is provided for build.
  • (T2) Out-of-core indices with DiskANN as the baseline. In addition to the limited DRAM in T1, the machine will also host a ~1TB SSD for search.
Participants are expected to release their code for index building and search which the organizers will run on separate machines. Participants provide a configuration for their index build code that would complete in 4 days on an Azure Standard_F64s_v2 VM with 4TB of SSD to be used for storing the data, index and other intermediate data (details likely to change). For search, participants are allowed up to 10 hyperparameter configurations. The protocol for evaluation is as follows:
  • [on indexing machine] participants will be given a local path with 1B vector dataset.
  • [on indexing machine] participants build an index from the 1B vectors and store back to local disk.
  • [on indexing machine] Stored index is copied out to a temporary cloud storage location by the eval framework.
  • [on search machine] organizers load the index from cloud storage to a local path and provide the path to the search code.
  • [on search machine] organizers perform searches with held-out query set and measure recall and time to process the queries with several sets of parameters.

Finalized details for build and search hardware timing will be released along with the the eval framework.

Custom Hardware Track (T3)

Participants can use non-standard hardware such as GPUs, AI accelerators, FPGAs, and custom in-memory silicon. In this track, participants will either 1) send their hardware, such as PCI boards to GSI Technology or 2) evaluate themselves using the scripts made available by the organizers. For T3 participants sending hardware, we will make specific delivery arrangements at participant’s expense. We will install the hardware on a system under the organizers control (we have a few bare-metal options available) and follow any installation directions provided. Participants will be allowed to temporarily log into the machine to finalize any installation and configuration, or for debugging installation as needed. For T3 participants running the evaluation themselves, we request remote ssh access and sudo accounts on the systems so that the organizers can verify the system and hardware (such as IPMI support, minimum resource availability such as disk storage for datasets). The evaluation phase will proceed like T1/T2, with a few modifications.

  • For participants that send their hardware, T3 organizers will provide remote access to a separate indexing machine.
    • [on separate indexing machine] participants download 1B vector dataset and store to local disk
    • [on separate indexing machine] participants build an index from the 1B vectors and store back to local disk
    • Stored index is copied to eval machine
    • [on eval machine] T3 organizers load the index from local disk
    • [on eval machine] T3 organizers perform searches with held-out query set and measure recall and time to process the queries with several sets of parameters.
  • For participants that give us remote access to systems, participants are responsible for building their index.
    • [on indexing machine] participants download 1B vector dataset and store to local disk
    • [on indexing machine] participants build an index from the 1B vectors and store back to local disk
    • Stored index is copied to eval machine
    • [on eval machine] T3 organizers load the index from local disk
    • [on eval machine] T3 organizers perform searches with held-out query set and measure recall and search time with several sets of parameters.
T3 will maintain different leaderboards for each dataset based on the following benchmarks:
  • Recall vs throughput using the same ranking formula as the T1/T2 track
  • Power- recall vs throughput/watt and a similar ranking formula to the T1/T2 track.
  • Cost measured as cost/watt (measured as queries/second/watt and MSRP/watt)
  • Total cost normalized across all tracks.
We will provide the exact details on how we collect and compute these benchmarks as well as additional machine and operating system specification before the competition begins.

Benchmark Datasets

We intend to use the following 6 billion point datasets (datasets are subject to change).
  • BIGANN consists of SIFT descriptors applied to images from extracted from a large image dataset.
  • Facebook-simsearchnet is a new dataset released by Facebook for this competition. It consists of features used for image copy detection for integrity purposes. The features are generated by Facebook SimSearchNet++ model.
  • Microsoft-Turing-ANNS is a new dataset being released by the Microsoft Turing team for this competition. It consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in web search queries. An early version of the RNN-based AGI Encoder is described in a SIGIR'19 paper and a blogpost.
  • Microsoft SPACEV1B is a new web search related dataset released by Microsoft Bing for this competition. It consists of document and query vectors encoded by Microsoft SpaceV Superior model to capture generic intent representation.
  • Yandex DEEP1B image descriptor dataset consisting of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task.
  • Yandex Text-to-Image-1B is a new cross-model dataset (text and visual), where database and query vectors can potentially have different distributions in a shared representation space. Image embeddings are produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model.

All datasets including ground truth data are in the common binary format that starts with 8 bytes of data consisting of num_points(uint32_t) num_dimensions(uint32) followed by num_pts X num_dimensions x sizeof(type) bytes of data stored one vector after another. Data files will have suffixes .fbin, .u8bin, and .i8bin to represent float32, uint8 and int8 type data. Note that a different query set will be used for evaluation. The details of the datasets along with links to the base, query and sample sets, and the ground truth nearest neighbors of the query set are listed below.

Dataset Datatype Dimensions Distance Range/k-NN Base data Sample data Query data Ground truth Release terms
BIGANN uint8 128 L2 k-NN 1B points 100M points 10K queries link CC0
Facebook-SimSearchNet++* uint8 256 L2 Range 1B points TBA 100k queries link CC BY-NC
Microsoft-Turing-ANNS* float32 100 L2 k-NN 1B points TBA 100K queries link link to terms
Microsoft-SPACEV1B* int8 100 L2 k-NN 1B points 100M points 29.3K queries link O-UDA
Yandex DEEP1B float32 96 L2 k-NN 1B points 350M points 10K queries link CC BY 4.0
Yandex Text-to-Image-1B* float32 200 inner-product k-NN 1B points 50M points 100K queries link CC BY 4.0
* new datasets
We recommend using Axel for downloading BIGANN, Facebook-SSN++, Yandex DEEP1B and T2I datasets.
We recommend using AzCopy for downloading Microsoft datasets.

Call for Participation and Timeline

Participation is open to all teams interested in developing new algorithms or re-implementing existing algorithms more efficiently either in software or hardware. Participants are requested to submit a brief document through CMT for each track they will be competing in. The document should contain the following details:

  • Name, email and affiliation of each participant in the team
  • A name and/or URL for the submission.
  • [Optional] To receive Azure credits for developing new ideas, please submit your request by June 30th with preliminary data on smaller scale datasets and why you think your algorithm will work well at billion scale. This will be used by the organizers to select strong entries. We request teams who already have access to infrastructure (e.g. those from industry or with access to large university clusters) to skip this.
For Track T3, the document should contain the following additional details to help organizers plan and assess eligibility for seperate leaderboards:
  • Type of hardware, e.g., PCIe extension board, rack-mounted system, or other.
  • Evidence of the retail MSRP of the hardware, i.e., pricing on website or copy of the customer invoice.
  • If hardware will be sent to GSI Technology (at the participants expense) or if organizers will given remote access to the systems. For remote system access participants, whether their system supports standard IPMI power monitoring. If not IPMI, then an equivalent power monitoring interface must be available.
  • Operating system requirements.
  • Whether the participant requires a separate machine for index building. We have limited Azure-based Fsv2-series machines and some bare-metal machines managed by the T3 organizers.

Consent Forms

Please review and complete the consent form for participation in Tracks T1/T2 and Track T3. Note that there are separate consent forms for the standard and custom hardware tracks. Completing the form is necessary for participation.

Timeline (subject to change)

  • May: release of data, guidelines, and a call for participation. Registration open.
  • Mid-June: Baseline results, testing infrastructure and final ranking metrics released.
  • June 30th: Participants in need of compute resources to submit an expression of interest.
  • Mid-July: Allocation of compute resources.
  • July 30th: Final deadline for participants to submit an expression of interest through CMT.
  • October 22nd: End of competition period. Teams to release of code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
  • October 29th: Participants submit a brief report outlining their algorithm and results.
  • Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
  • Early December: Final results published, and competition results archived (the competition will go on if interest continues).
  • During NeurIPS, organizers will provide an overview of the competition and results. Organizers will also request the best entries (including leaderboard toppers, or promising new approaches) to present an overview for further discussion.