Skip to content

a benchmark contains all vecor searchs (include dense vector, sparse vector, filter vector search...)

Notifications You must be signed in to change notification settings

RichardWang11/Filtered-ANN-Search-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 

Repository files navigation

Filtered-Vector-Search-Benchmarks & Datasets

Including recently high-dim filtered vector search methods and datasets.

Filter ANNS Papers & Methods

Methods

  • ACORN
  • CAPS
  • FilteredVamana
  • StitchedVamana
  • 2D-Segment Graph: it introduces a segment graph that compresses multiple graph-based indexes into a single structure, reducing memory consumption while maintaining high query performance for range filtered ANNS.
  • SuperPostFilter: it proposes a tree-based framework to efficiently solve the c-approximate window search problem.
  • NHQ:
  • HSIG:
  • NGT:is an ANNS library developed by Yahoo Japan that processes hybrid queries using the post-filtering strategy.
  • Vearch: is a high-dimensional vector retrieval system developed by Jingdong that supports hybrid queries through the post-filtering strategy.
  • ADBV:is a hybrid analytic engine developed by Alibaba. It enhances PQ for hybrid ANNS and proposes the accuracy-aware, cost-based optimization to generate optimal execution plans.
  • Milvus: partitions datasets based on commonly utilized attributes and implements ADBV within each subset.

Datasets

Referring to this paper, we categorize the datasets into eight types, which RC (Relative Contrast) is mean average distance / nearest neighbor distance, the smaller RC means the dataset is harder; LID (Local Intrinsic Dimensionality) the high LID means the dataset is hard to process:

Dense Vector Datasets

Name n (×10^3) d RC LID Category Distance
Nus 269 500 1.67 24.5 Image L2
Gist 983 960 1.94 18.9 Image
Rand 1,000 100 3.05 58.7 Synthetic
Glove 1,192 100 1.82 20.0 Text
Cifa 50 512 1.97 9.0 Image
Mnist 69 784 2.38 6.5 Image L2
Sun 79 512 1.94 9.9 Image
Enron 95 1,369 6.39 11.7 Text L2
Trevi 100 4,096 2.95 9.2 Image
Notre 333 128 3.22 9.0 Image
SIFT 994 128 3.50 9.3 Image L2
Deep1M 1,000 128 1.96 12.1 Image L2
Ben 1,098 128 1.96 8.3 Image
Gauss 2,000 512 3.36 19.6 Synthetic
Imag 2,340 150 2.54 11.6 Image
BANN 10,000 128 2.60 10.3 Image L2
Audio 50 192 2.97 5.6 Audio
Msong 922 420 3.81 9.5 Audio
Yout 346 1,770 2.29 12.6 Video
UQ-V 3,038 256 8.39 7.2 Video
Kosarak 75 27983 - - - Jaccard
NYTimes 256 290 - - Text Angular
Fashion-MNIST 60 784 - - Image L2

Filtered Dataset

Name Dim Data-Type Category Distance
YFCC-10M + CLIP 192 uint8 Image L2

Deep Learning Datasets

Name n (×10^3) d Category Distance
DRP10M 10,000 768 Text IP
Open-images13M 13,000 512 Image IP
RQA10M 10,000 768 Text IP
WIT1M 1,000 512 Cross-Model IP

Other useful links

Additional Benchmarks

About

a benchmark contains all vecor searchs (include dense vector, sparse vector, filter vector search...)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published