NeurIPS'23 Competition Track: Big-ANN
New: the latest ongoing leaderboard has been released (Feb 1st, 2024).
See here for the full ongoing leaderboard and rules. Next update: March 1st, 2024.
Note: entries by pinecone and zilliz are not open source.
This challenge is to encourage the development of indexing data structures and search algorithms for practical variants of the Approximate Nearest Neighbor (ANN) or Vector search problem. These variants are increasingly relevant as vector search becomes commonplace. This challenge has four tracks covering sparse, filtered, out-of-distribution and streaming variants of ANNS. These variants require adapted search algorithms and strategies with different tradeoffs. Participants are encouraged to develop and submit new algorithms that improve on the baselines for these variants. This competition aims at being accessible to participants by limiting the scale of the datasets to about 10 million points.
Tracks: Datasets, Metrics and Baselines
The evaluation hardware is normalized to Azure Standard D8lds v5 (8 vCPUs and 16GB DRAM). The index build time on this machine will be limited to 12 hours, except for streaming index which has stricter time limits.
The challenge consists of 4 tracks with separate leaderboards and participants can choose to submit entries to one or more tracks:
- Filtered Search: This task will use a random 10M slice of the YFCC 100M dataset transformed with CLIP embeddings. In addition, we associate with each image a "bag" of tags: words extracted from the description, the camera model, the year the picture was taken and the country. The tags are from a vocabulary of 200386 possible tags. The 100,000 queries consist of one image embedding and one or two tags that must appear in the database elements to be considered.
- Out-Of-Distribution: This task will use the Yandex Text-to-Image 10M, cross-modal dataset where the database and query index have different distributions in the shared vector space. The base set is a 10M subset of the Yandex visual search database of 200-dimensional image embeddings which are produced with the Se-ResNext-101 model. The query embeddings correspond to the user-specified textual search queries. The text embeddings are extracted with a variant of the DSSM model.
- Sparse: This task is based on the common MSMARCO passage retrieval dataset, which has 8,841,823 text passages, encoded into sparse vectors using the SPLADE model. The vectors have a large dimension (about 30,000), but each vector in the base dataset has an average of approximately 120 nonzero elements. The query set contains 6,980 text queries, embedded by the same SPLADE model. The average number of nonzero elements in the query set is approximately 49 (since text queries are generally shorter). Given a sparse query vector, the index should return the top-k results according to the maximal inner product between the vectors.
- Streaming Search: This task uses 30M slice of the MS Turing data set released in the previous challenge. The index starts with zero points and must implement the "runbook" provided -- a sequence of insertion, deletion, and search operations (roughly 4:4:1 ratio) -- within a time bound of 1 hour and 8GB DRAM. The intention is for the algorithm to process the operations and maintain a compact index over the active points rather than index the entire anticipated set of points and use tombstones or flags to mark active elements. More details to come. The runbook is provided in `final_runbook.yaml` which is generated with `final_runbook_gen.py`.
|QPS @ 90% recall
|YFCC-10M + CLIP
|CC BY 4.0
|CC BY 4.0
|MS MARCO / SPLADE
|float32, sparse format
SPLADE: CC BY NC SA
|0.883 recall@10 (45mins)
Track Winners and Presentations
- RoarANN: Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
- Puck: Efficient Multi-level Index Structure for Approximate Nearest Neighbor Search in Practice [slides]
- To participate, please express interest through the CMT portal.
- To request cloud compute credits ($1000) towards development, please select the "Requesting cloud credit" field in your CMT entry and share a brief overview of the ideas you plan to develop with these credits in your CMT entry.
- To get started, please see the instructions in the README file, and submit a Pull Request corresponding to your algorithm(s).
- For questions and discussions, please use the Github issues or the Discord channel.
Timeline (subject to change)
- June: Baseline results, testing infrastructure, CFP and final ranking metrics released.
End-JulyAugust 30th: Suggested deadline for requesting allocation of cloud compute credits for development. Credits will be provided on ongoing basis.
August 30thSeptember 15th: Final deadline for participants to submit an expression of interest through CMT.
- October 30th: End of competition period. Teams to release code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
- Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
- Early December: Final results published, and competition results archived (the competition will go on if interest continues).
- During NeurIPS: Organizers will provide an overview of the competition and results. Organizers will also request the best entries (including leaderboard toppers, or promising new approaches) to present an overview for further discussion.
Organizers and Dataset Contributors
Organizers can be reached at firstname.lastname@example.org. We thank Microsoft Research, Meta, Pinecone, Yandex, and Zilliz for help in preparing and organizing this competition. We thank Microsoft for cloud credits towards running the competition, and AWS and Pinecone for compute credits for participants.