Evaluation framework on github. Subscribe to our
for discussions and announcements.
Why this competition?
In the past few years, we’ve seen a lot of new research and creative approaches for large-scale ANNS, including:
- Partition-based, and graph-based indexing strategies (as well as hybrid indexing approaches).
- Mixing RAM and SSD storage to efficiently store and process large datasets that exceed the size of RAM.
- Using accelerator hardware such as GPUs, FPGAs, and other custom in-memory silicon.
- Leveraging machine learning for dimensionality reduction of the original vectors.
In addition to an uptick in academic interest, many implementations of these algorithms at scale now appear in production
and high availability datacenter contexts: powering enterprise-grade, mission-critical, and web-scale search applications.
In these deployment scenarios, benchmarks such as cost, preprocessing time, power consumption become just as important as
the recall-vs-latency tradeoff. Despite this, most empirical evaluations of algorithms have focused on smaller datasets
of about a million points, e.g. ann-bechmarks.com. However, deploying recent algorithmic advances in ANNS techniques for
search, recommendation and ranking at scale requires support at billion or substantially larger scale. Barring a few recent
papers, there is limited consensus on which algorithms are effective at this scale.
We believe that this challenge will be impactful in several ways:
- Provide a comparative understanding of algorithmic ideas and their application at scale.
- Promote the development of new techniques for the problem and demonstration of their value.
- Provide a compilation of datasets, many new, to enable future development of algorithms.
- Introduce a standard benchmarking approach.
By providing a platform for those interested in this problem, we aim to encourge more collaboration and collectively advance the field at a more rapid pace.
Researchers can request Azure compute credit from a pool sponsored by Microsoft Research.
Standard Hardware Tracks (T1 and T2)
There are two standard standard hardware tracks:
- Track 1: In-memory indices with FAISS as the baseline.
Search would use Azure Standard_F32s_v2 VMs
with 32 vCPUs and 64GB RAM. Index construction would use Azure
with 64vCPUs, 128GB RAM and an additional 4TB of SSD to be used for storing the data, index and other intermediate data.
- Track 2: Out-of-core indices with DiskANN as the baseline.
In addition to the limited DRAM in T1, index can use an SSD for search.
Search would use Azure
Standard_L8s_v2 VMs with 8 vCPUS, 64GB RAM and a local SSD Index constrained to 1TB.
Construction would use Azure
with 64vCPU, 128GB RAM and an additional 4TB of SSD to be used for storing the data, index and other intermediate data.
Participants are expected to release their code for index building and search which the organizers will run on separate machines.
Participants provide a configuration for their index build code that would complete in 4 days for each dataset
The protocol for evaluation is as follows:
- [on indexing machine] participants will be given a local path with 1B vector dataset.
- [on indexing machine] participants build an index from the 1B vectors and store back to local disk.
- [on indexing machine] Stored index is copied out to a temporary cloud storage location by the eval framework.
- [on search machine] organizers load the index from cloud storage to a local path and provide the path to the search code.
- [on search machine] organizers perform searches with held-out query set and measure recall and time to process the queries with several sets of parameters.
Finalized details for build and search hardware timing will be released along with the the eval framework.
Custom Hardware Track (T3)
Participants can use non-standard hardware such as GPUs, AI accelerators, FPGAs, and custom in-memory silicon.
In this track, participants will either 1) send their hardware, such as PCI boards to GSI Technology or 2) evaluate
themselves using the scripts made available by the organizers. For T3 participants sending hardware,
we will make specific delivery arrangements at participant’s expense. We will install the hardware on a system under
the organizers control (we have a few bare-metal options available) and follow any installation directions provided.
Participants will be allowed to temporarily log into the machine to finalize any installation and configuration,
or for debugging installation as needed. For T3 participants running the evaluation themselves, we request remote ssh
access and sudo accounts on the systems so that the organizers can verify the system and hardware (such as IPMI support,
minimum resource availability such as disk storage for datasets).
The evaluation phase will proceed like T1/T2, with a few modifications.
- For participants that send their hardware, T3 organizers will provide remote access to a separate indexing machine.
- [on separate indexing machine] participants download 1B vector dataset and store to local disk
- [on separate indexing machine] participants build an index from the 1B vectors and store back to local disk
- Stored index is copied to eval machine
- [on eval machine] T3 organizers load the index from local disk
- [on eval machine] T3 organizers provide index with held-out query set and measure recall and time to process the queries with several sets of parameters.
Index search code can use internal parallelism to batch process the queries.
- For participants that give us remote access to systems, participants are responsible for building their index.
- [on indexing machine] participants download 1B vector dataset and store to local disk
- [on indexing machine] participants build an index from the 1B vectors and store back to local disk
- Stored index is copied to eval machine
- [on eval machine] T3 organizers load the index from local disk
- [on eval machine] T3 organizers perform searches with held-out query set and measure recall and search time with several sets of parameters.
T3 will maintain different leaderboards for each dataset based on the following benchmarks:
- Recall vs throughput using the same ranking formula as the T1/T2 track
- Power- recall vs throughput/watt and a similar ranking formula to the T1/T2 track.
- Cost measured as cost/watt (measured as queries/second/watt and MSRP/watt)
- Total cost normalized across all tracks.
We will provide the exact details on how we collect and compute these benchmarks as well as additional machine and operating system specification before the competition begins.
We intend to use the following 6 billion point datasets.
- BIGANN consists of SIFT descriptors applied to images from extracted from a large image dataset.
- Facebook SimSearchNet++ is a new dataset released by Facebook for this competition.
It consists of features used for image copy detection for integrity purposes.
The features are generated by Facebook SimSearchNet++ model.
- Microsoft Turing-ANNS-1B is a new dataset being released by the Microsoft Turing team for this competition.
It consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in
web search queries. An early version of the RNN-based AGI Encoder is described in a
SIGIR'19 paper and a blogpost.
- Microsoft SPACEV-1B is a new web search related dataset
released by Microsoft Bing for this competition.
It consists of document and query vectors encoded by Microsoft SpaceV Superior model to capture generic intent representation.
- Yandex DEEP-1B image descriptor dataset consisting of the projected
and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task.
- Yandex Text-to-Image-1B is a new cross-model dataset (text and visual),
where database and query vectors have different distributions in a shared representation space. The base set consists of Image embeddings produced by the
Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. Since the distributions are different, a 50M sample
of the query distribution is provided.
All datasets are in the common binary format that starts with 8 bytes of data consisting of num_points(uint32_t)
num_dimensions(uint32) followed by num_pts X num_dimensions x sizeof(type) bytes of data stored one vector after another. Data files
will have suffixes .fbin, .u8bin, and .i8bin to represent float32, uint8 and int8 type data. Note that a different query set
will be used for evaluation. The details of the datasets along with links to the base, query and sample sets, and the ground truth nearest neighbors
of the query set are listed below.
The ground truth binary files for k-NN search consist of the following information: num_queries(uint32_t)
K-NN(uint32) followed by num_queries X K x sizeof(uint32_t) bytes of data representing the IDs of the K-nearest neighbors of the
queries, followed by num_queries X K x sizeof(float) bytes of data representing the distances to the corresponding points. The distances
help identify neighbors tied in terms of distances. In recall calculation, returning a neighbor not in the ground truth set but whose distance is tied
with an entry in the ground truth is counted as success.
The ground truth binary files for range search consist of the following information: num_queries(int32_t) followed by the total number
of results total_res(int32_t) followed
by num_queries X size(int32_t) bytes corresponding to num_results_per_query for each query, followed by total_res X sizeof(int32_t)
bytes corresponding to the IDs of the neighbors of each query one after the other.
The ground truth files for the first 10M slice, the first 100M slice, and the complete 1B set of each dataset against the respective query set can be downloaded
* new datasets
We recommend using Axel
for downloading BIGANN, Facebook-SSN++, Yandex DEEP1B and T2I datasets.
We recommend using AzCopy
for downloading Microsoft datasets.
The competition will measure recall@10 of the algorithms on the 6 data sets a private query set (unreleased) at a fixed query throughput.
Track T1 measures recall of algorithms at 10000 Queries/second (on 32 vCPUs), T2 measures recall at 1500 Queries/second, T2 measures recall at 2000 Queries/second.
The primary metric for comparison in each track will be the sum of improvements in recall over the baseline at the target QPS over all datasets.
Additionally, track T3 will also rank entries by power and cost per query. See this
for power and cost analysis.
A team has to publish an algorithm and commit to benchmarking on at least 3 datasets to be considered for ranking. Recall regression on a dataset selected
by a team will be continued as a negative score.
The recall@10(AP for SSN++-1B dataset) of the baseline algorithms on each dataset for the public query set is listed below.
| Track ||Algorithm ||Search Machine||Target Queries/sec
||BIGANN-1B ||SSN++-1B ||Turing-ANNS-1B ||SPACEV-1B ||DEEP-1B ||Text-to-Image-1B
| Track 1 || FAISS-CPU ||Azure F32s_v2 32vCPUs + 64GB RAM
||10000 || 0.634 ||0.753 ||0.703 ||0.728 ||0.650 ||0.069
| Track 2 || DiskANN ||Azure L8s_v2 8vCPUs + 64GB RAM + 1TB SSD ||1500 || 0.949 ||0.16274 ||0.936 ||0.901 ||0.937 ||0.488
| Track 3 || FAISS-GPU ||NVIDIA V100 + 700GB RAM
||2000 ||0.927 ||TBA ||0.910 ||0.850 ||0.942 ||0.86
Baseline DiskANN indices for T2 can be downloaded using "azcopy copy 'https://comp21storage.blob.core.windows.net/publiccontainer/comp21/diskann-T2-baseline-indices' 'local_folder' --recursive".
Note that this would take some time as the indices are large. All indices were built using R and L parameters set to 100.
Search for T2 used 16 threads and beamwidth 4. The Ls parameter was varied to tune recall vs QPS.
T2 baseline results have been modified after measuring via pybind11 interface on docker. There was a 30-40% QPS loss using this interface
as compared to direct measurements of C++ code from commandline. As a result, the QPS target has now been lowered, and the recall is reported at this threshold.
Call for Participation and Timeline
Participation is open to all teams interested in developing new algorithms or re-implementing
existing algorithms more efficiently either in software or hardware. Participants are
requested to submit a brief document through CMT
for each track they will be competing in. The document should contain the following details:
- Name, email and affiliation of each participant in the team
- A name and/or URL for the submission.
- [Optional] To receive Azure credits for developing new ideas, please submit your request
by June 30th with preliminary data on smaller scale datasets and why you think
your algorithm will work well at billion scale. This will be used by the organizers to select strong
entries. We request teams who already have access to infrastructure (e.g. those from industry or
with access to large university clusters) to skip this.
For Track T3, the document should contain the following additional details to help organizers plan
and assess eligibility for seperate leaderboards:
- Type of hardware, e.g., PCIe extension board, rack-mounted system, or other.
- Evidence of the retail MSRP of the hardware, i.e., pricing on website or copy of the customer invoice.
- If hardware will be sent to GSI Technology (at the participants expense) or if organizers will given remote access to the systems.
For remote system access participants, whether their system supports standard IPMI power monitoring.
If not IPMI, then an equivalent power monitoring interface must be available.
- Operating system requirements.
- Whether the participant requires a separate machine for index building. We have limited Azure-based
Fsv2-series machines and some bare-metal machines managed by the T3 organizers.
Please review and complete the consent form for participation in Tracks T1/T2
and Track T3
. Note that there are separate consent forms
for the standard and custom hardware tracks. Completing the form is necessary for participation.
Timeline (subject to change)
- May: release of data, guidelines, and a call for participation. Registration open.
- June: Baseline results, testing infrastructure and final ranking metrics released.
- July 11th: Participants in need of compute resources to submit an expression of interest.
- Mid-July: Allocation of compute resources.
- July 30th: Final deadline for participants to submit an expression of interest through CMT.
- October 22nd: End of competition period. Teams to release of code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
- October 29th: Participants submit a brief report outlining their algorithm and results.
- Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
- Early December: Final results published, and competition results archived (the competition will go on if interest continues).
- During NeurIPS, organizers will provide an overview of the competition and results. Organizers will also request the best entries
(including leaderboard toppers, or promising new approaches) to present an overview for further discussion.
Organizers and Dataset Contributors
- Harsha Vardhan Simhadri, Microsoft Research India
- George Williams, GSI Technology
- Martin Aumüller, IT University of Copenhagen
- Artem Babenko, Yandex
- Dmitry Baranchuk, Yandex
- Qi Chen, Microsoft Research Asia
- Matthijs Douze, Facebook AI Research
- Lucas Hosseini, Facebook AI Research
- Ravishankar Krishnaswamy, Microsoft Research India and IIT Madras
- Gopal Srinivasa, Microsoft Research India
- Suhas Jayaram Subramanya, Carnegie Mellon University
- Jingdong Wang, Microsoft Research Asia
Organizers can be reached at email@example.com.
We thank Microsoft Research for help in organizing this competition, and contributing compute credits.
We thank Microsoft Bing and Turing teams for creating datasets for release.