Including recently high-dim filtered vector search methods and datasets.
- ACORN
- CAPS
- FilteredVamana
- StitchedVamana
- 2D-Segment Graph: it introduces a segment graph that compresses multiple graph-based indexes into a single structure, reducing memory consumption while maintaining high query performance for range filtered ANNS.
- SuperPostFilter: it proposes a tree-based framework to efficiently solve the c-approximate window search problem.
- NHQ:
- HSIG:
- NGT:is an ANNS library developed by Yahoo Japan that processes hybrid queries using the post-filtering strategy.
- Vearch: is a high-dimensional vector retrieval system developed by Jingdong that supports hybrid queries through the post-filtering strategy.
- ADBV:is a hybrid analytic engine developed by Alibaba. It enhances PQ for hybrid ANNS and proposes the accuracy-aware, cost-based optimization to generate optimal execution plans.
- Milvus: partitions datasets based on commonly utilized attributes and implements ADBV within each subset.
Referring to this paper, we categorize the datasets into eight types, which RC (Relative Contrast) is mean average distance / nearest neighbor distance, the smaller RC means the dataset is harder; LID (Local Intrinsic Dimensionality) the high LID means the dataset is hard to process:
Name | n (×10^3) | d | RC | LID | Category | Distance |
---|---|---|---|---|---|---|
Nus | 269 | 500 | 1.67 | 24.5 | Image | L2 |
Gist | 983 | 960 | 1.94 | 18.9 | Image | |
Rand | 1,000 | 100 | 3.05 | 58.7 | Synthetic | |
Glove | 1,192 | 100 | 1.82 | 20.0 | Text | |
Cifa | 50 | 512 | 1.97 | 9.0 | Image | |
Mnist | 69 | 784 | 2.38 | 6.5 | Image | L2 |
Sun | 79 | 512 | 1.94 | 9.9 | Image | |
Enron | 95 | 1,369 | 6.39 | 11.7 | Text | L2 |
Trevi | 100 | 4,096 | 2.95 | 9.2 | Image | |
Notre | 333 | 128 | 3.22 | 9.0 | Image | |
SIFT | 994 | 128 | 3.50 | 9.3 | Image | L2 |
Deep1M | 1,000 | 128 | 1.96 | 12.1 | Image | L2 |
Ben | 1,098 | 128 | 1.96 | 8.3 | Image | |
Gauss | 2,000 | 512 | 3.36 | 19.6 | Synthetic | |
Imag | 2,340 | 150 | 2.54 | 11.6 | Image | |
BANN | 10,000 | 128 | 2.60 | 10.3 | Image | L2 |
Audio | 50 | 192 | 2.97 | 5.6 | Audio | |
Msong | 922 | 420 | 3.81 | 9.5 | Audio | |
Yout | 346 | 1,770 | 2.29 | 12.6 | Video | |
UQ-V | 3,038 | 256 | 8.39 | 7.2 | Video | |
Kosarak | 75 | 27983 | - | - | - | Jaccard |
NYTimes | 256 | 290 | - | - | Text | Angular |
Fashion-MNIST | 60 | 784 | - | - | Image | L2 |
Name | Dim | Data-Type | Category | Distance |
---|---|---|---|---|
YFCC-10M + CLIP | 192 | uint8 | Image | L2 |
Name | n (×10^3) | d | Category | Distance |
---|---|---|---|---|
DRP10M | 10,000 | 768 | Text | IP |
Open-images13M | 13,000 | 512 | Image | IP |
RQA10M | 10,000 | 768 | Text | IP |
WIT1M | 1,000 | 512 | Cross-Model | IP |
- Vector Filtering Benchmarks
- DBWangGroupUNSW_NNS_Benchmark
- Xinjing Hu, Xuanhua Shi, Shixuan Sun, et al. CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. arXiv preprint arXiv:2406.19651 (2024)[github][paper]