NeurIPS'23 Competition Track: Big-ANN

Supported by

New: the latest ongoing leaderboard has been released (March 1st, 2024).
Top entries:

Filter track			OOD track			Sparse track
Rank	Algorithm	QPS@90% recall	Rank	Algorithm	QPS@90% recall	Rank	Algorithm	QPS@90% recall
1	Pinecone-filter	85,491	1	Pinecone-ood	38,088	1	Zilliz	10,749
2	Zilliz	84,596	2	Zilliz	33,241	2	Pinecone_smips	10,440
3	ParlayANN IVF²	37,902	3	RoarANN	22,555	3	PyANNS	8,732
4	Puck	19,193	4	PyANNS	22,296	4	shnsw	7,137
...	...	...	...	...	...	...	...	...
Baseline	FAISS	3,032	Baseline	Diskann	4,133	Baseline	Linscan	93

Note: entries by pinecone and zilliz are not open source.

Full Leaderboard, Plots, and Rules

This challenge is to encourage the development of indexing data structures and search algorithms for practical variants of the Approximate Nearest Neighbor (ANN) or Vector search problem. These variants are increasingly relevant as vector search becomes commonplace. This challenge has four tracks covering sparse, filtered, out-of-distribution and streaming variants of ANNS. These variants require adapted search algorithms and strategies with different tradeoffs. Participants are encouraged to develop and submit new algorithms that improve on the baselines for these variants. This competition aims at being accessible to participants by limiting the scale of the datasets to about 10 million points.

Tracks: Datasets, Metrics and Baselines

The evaluation hardware is normalized to Azure Standard D8lds v5 (8 vCPUs and 16GB DRAM). The index build time on this machine will be limited to 12 hours, except for streaming index which has stricter time limits.

The challenge consists of 4 tracks with separate leaderboards and participants can choose to submit entries to one or more tracks:

Filtered Search: This task will use a random 10M slice of the YFCC 100M dataset transformed with CLIP embeddings. In addition, we associate with each image a "bag" of tags: words extracted from the description, the camera model, the year the picture was taken and the country. The tags are from a vocabulary of 200386 possible tags. The 100,000 queries consist of one image embedding and one or two tags that must appear in the database elements to be considered.
Out-Of-Distribution: This task will use the Yandex Text-to-Image 10M, cross-modal dataset where the database and query index have different distributions in the shared vector space. The base set is a 10M subset of the Yandex visual search database of 200-dimensional image embeddings which are produced with the Se-ResNext-101 model. The query embeddings correspond to the user-specified textual search queries. The text embeddings are extracted with a variant of the DSSM model.
Sparse: This task is based on the common MSMARCO passage retrieval dataset, which has 8,841,823 text passages, encoded into sparse vectors using the SPLADE model. The vectors have a large dimension (about 30,000), but each vector in the base dataset has an average of approximately 120 nonzero elements. The query set contains 6,980 text queries, embedded by the same SPLADE model. The average number of nonzero elements in the query set is approximately 49 (since text queries are generally shorter). Given a sparse query vector, the index should return the top-k results according to the maximal inner product between the vectors.
Streaming Search: This task uses 30M slice of the MS Turing data set released in the previous challenge. The index starts with zero points and must implement the "runbook" provided -- a sequence of insertion, deletion, and search operations (roughly 4:4:1 ratio) -- within a time bound of 1 hour and 8GB DRAM. The intention is for the algorithm to process the operations and maintain a compact index over the active points rather than index the entire anticipated set of points and use tombstones or flags to mark active elements. More details to come. The runbook is provided in `final_runbook.yaml` which is generated with `final_runbook_gen.py`.

Track	Dataset	Dimensions	Data type	Baseline algo	QPS @ 90% recall	Release terms
Filtered	YFCC-10M + CLIP	192	uint8	filter-FAISS	3200	CC BY 4.0
OOD	Text2Image-10M	200	float32	diskann	4882	CC BY 4.0
Sparse	MS MARCO / SPLADE	~30K	float32, sparse format	Linscan	101	MS-MARCO: Free NC SPLADE: CC BY NC SA
Streaming	MSTuring-30M-clustered	100	float32	fresh-diskann	0.883 recall@10 (45mins)	O-UDA

We recommend using Axel for downloading non-Microsoft datasets. We recommend using AzCopy for downloading Microsoft datasets.

Track Winners and Presentations

Filtered Search

ParlayANN IVF²: Fusing Classic and Spatial Inverted Indices for Fast Filtered ANNS [slides] Authors: Ben Landrum (UMD), Magdalen Dobson Manohar (CMU), Mazin Karjikar (UMD), Laxman Dhulipala (UMD)

Out-Of-Distribution

RoarANN: Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search Authors: Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng. All authors from Fudan University.
PyANNS Authors: Zihao Wang, Shanghai Jiao Tong University^*

Sparse

PyANNS Authors: Zihao Wang, Shanghai Jiao Tong University^*
GrassRMA: GRAph-based Sparse Vector Search with Reducing Memory Accesses
Authors: Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng. All authors from Fudan University.

Streaming Search

Puck: Efficient Multi-level Index Structure for Approximate Nearest Neighbor Search in Practice [slides] Authors: Jie Yin, Ben Huang, Baidu.

^* Zihao Wang is also an employee of Zilliz. However, he declares that the PyANNs entry was created on his time off, without any involvement from Zilliz or any of the other organizers. This entry did not declare conflict with organizers before participating.

Organizer Presentations

Invited Talks

Corey Nolet: Accelerating vector search on the GPU with RAPIDS RAFT
Yury Malkov: Approximate Nearest Neighbor Search in Recommender Systems

Participation

Guidelines

To participate, please express interest through the CMT portal.
To request cloud compute credits ($1000) towards development, please select the "Requesting cloud credit" field in your CMT entry and share a brief overview of the ideas you plan to develop with these credits in your CMT entry.
To get started, please see the instructions in the README file, and submit a Pull Request corresponding to your algorithm(s).
For questions and discussions, please use the Github issues or the Discord channel.

Timeline (subject to change)

June: Baseline results, testing infrastructure, CFP and final ranking metrics released.
~~End-July~~August 30th: Suggested deadline for requesting allocation of cloud compute credits for development. Credits will be provided on ongoing basis.
~~August 30th~~September 15th: Final deadline for participants to submit an expression of interest through CMT.
October 30th: End of competition period. Teams to release code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
Early December: Final results published, and competition results archived (the competition will go on if interest continues).
During NeurIPS: Organizers will provide an overview of the competition and results. Organizers will also request the best entries (including leaderboard toppers, or promising new approaches) to present an overview for further discussion.

Organizers and Dataset Contributors

Organizers can be reached at big-ann-organizers@googlegroups.com. We thank Microsoft Research, Meta, Pinecone, Yandex, and Zilliz for help in preparing and organizing this competition. We thank Microsoft for cloud credits towards running the competition, and AWS and Pinecone for compute credits for participants.