Skip to content

Commit 709b448

Browse files
authored
Add parent join support for faiss hnsw (#1398)
* Add patch to support multi vector in faiss (#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]>
1 parent 45e9e54 commit 709b448

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+3333
-99
lines changed

.github/workflows/CI.yml

+9
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,15 @@ jobs:
3838
with:
3939
submodules: true
4040

41+
# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
42+
- name: Apply Git Patch
43+
# Deleting file at the end to skip `git apply` inside CMAKE file
44+
run: |
45+
cd jni/external/faiss
46+
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
47+
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
48+
working-directory: ${{ github.workspace }}
49+
4150
- name: Setup Java ${{ matrix.java }}
4251
uses: actions/setup-java@v1
4352
with:

.github/workflows/test_security.yml

+9
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,15 @@ jobs:
3838
with:
3939
submodules: true
4040

41+
# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
42+
- name: Apply Git Patch
43+
# Deleting file at the end to skip `git apply` inside CMAKE file
44+
run: |
45+
cd jni/external/faiss
46+
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
47+
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
48+
working-directory: ${{ github.workspace }}
49+
4150
- name: Setup Java ${{ matrix.java }}
4251
uses: actions/setup-java@v1
4352
with:

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
1515
## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.12...2.x)
1616
### Features
1717
* Add parent join support for lucene knn [#1182](https://github.com/opensearch-project/k-NN/pull/1182)
18+
* Add parent join support for faiss hnsw [#1398](https://github.com/opensearch-project/k-NN/pull/1398)
1819
### Enhancements
1920
* Increase Lucene max dimension limit to 16,000 [#1346](https://github.com/opensearch-project/k-NN/pull/1346)
2021
* Tuned default values for ef_search and ef_construction for better indexing and search performance for vector search [#1353](https://github.com/opensearch-project/k-NN/pull/1353)

DEVELOPER_GUIDE.md

+7
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,13 @@ For users that want to get the most out of the libraries, they should follow [th
229229
and build the libraries from source in their production environment, so that if their environment has optimized
230230
instruction sets, they take advantage of them.
231231

232+
### Custom patch on JNI Library
233+
If you want to make a custom patch on JNI library
234+
1. Make a change on top of current version of JNI library and push the commit locally.
235+
2. Create a patch file for the change using `git format-patch -o patches HEAD^`
236+
3. Place the patch file under `jni/patches`
237+
4. Make a change in `jni/CmakeLists.txt`, `.github/workflows/CI.yml` to apply the patch during build
238+
232239
## Run OpenSearch k-NN
233240

234241
### Run Single-node Cluster Locally

benchmarks/perf-tool/README.md

+56
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,26 @@ Ingests a dataset of multiple context types into the cluster.
270270
| ----------- | ----------- | ----------- |
271271
| took | Total time to ingest the dataset into the index.| ms |
272272

273+
#### ingest_nested_field
274+
275+
Ingests a dataset with nested field into the cluster.
276+
277+
##### Parameters
278+
279+
| Parameter Name | Description | Default |
280+
| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
281+
| index_name | Name of index to ingest into | No default |
282+
| field_name | Name of field to ingest into | No default |
283+
| dataset_path | Path to data-set | No default |
284+
| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
285+
| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'} | No default |
286+
287+
##### Metrics
288+
289+
| Metric Name | Description | Unit |
290+
| ----------- | ----------- | ----------- |
291+
| took | Total time to ingest the dataset into the index.| ms |
292+
273293
#### query
274294

275295
Runs a set of queries against an index.
@@ -330,6 +350,36 @@ Runs a set of queries with filter against an index.
330350
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
331351
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
332352

353+
354+
#### query_nested_field
355+
356+
Runs a set of queries with nested field against an index.
357+
358+
##### Parameters
359+
360+
| Parameter Name | Description | Default |
361+
| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
362+
| k | Number of neighbors to return on search | 100 |
363+
| r | r value in Recall@R | 1 |
364+
| index_name | Name of index to search | No default |
365+
| field_name | Name field to search | No default |
366+
| calculate_recall | Whether to calculate recall values | False |
367+
| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
368+
| dataset_path | Path to dataset | No default |
369+
| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
370+
| neighbors_path | Path to neighbors dataset | No default |
371+
| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
372+
| query_count | Number of queries to create from data-set | Size of the data-set |
373+
374+
##### Metrics
375+
376+
| Metric Name | Description | Unit |
377+
| ----------- | ----------- | ----------- |
378+
| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
379+
| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
380+
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
381+
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
382+
333383
#### get_stats
334384

335385
Gets the index stats.
@@ -369,6 +419,12 @@ python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dat
369419

370420
After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.
371421

422+
To generate dataset with parent doc id based on vectors only dataset, use following command pattern:
423+
```commandline
424+
python add-parent-doc-id-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_parent_id>
425+
```
426+
This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps.
427+
372428
## Contributing
373429

374430
### Linting

0 commit comments

Comments
 (0)