Skip to content

Commit

Permalink
remove duplicate code (#802)
Browse files Browse the repository at this point in the history
remove duplicate code for dindex
  • Loading branch information
MXueguang authored Oct 5, 2021
1 parent 19cfcfc commit 58d286c
Show file tree
Hide file tree
Showing 7 changed files with 25 additions and 421 deletions.
34 changes: 18 additions & 16 deletions docs/usage-dense-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Pyserini create dense index for collections with JSONL format:
```json
{
"id": "doc1",
"contents": "this is the contents."
"contents": "title\nthis is the contents."
}
```

Expand All @@ -19,12 +19,13 @@ Then, you can invoke the indexer:

Here we provide an example to index collections with DPR passage encoder
```bash
python -m pyserini.dindex --corpus integrations/resources/sample_collection_jsonl \
--encoder facebook/dpr-ctx_encoder-multiset-base \
--index indexes/dindex-sample-dpr-multi \
--batch 64 \
--device cuda:0 \
--title-delimiter '\n'
python -m pyserini.encode input --corpus integrations/resources/sample_collection_jsonl \
--fields title text \ # fields in collection contents
output --embeddings indexes/dindex-sample-dpr-multi \
--to-faiss \
encoder --encoder facebook/dpr-ctx_encoder-multiset-base \
--fields title text \ # fields to encode
--batch 32
```

Once this is done, you can use `SimpleDenseSearcher` to search the index:
Expand All @@ -43,17 +44,18 @@ for i in range(0, 10):
If you want to speed up the passage embedding generation, you can run create the index in shard way.
e.g. the command below create a sub-index for the first 1/4 of the collection.
```bash
python -m pyserini.dindex --corpus integrations/resources/sample_collection_jsonl \
--encoder facebook/dpr-ctx_encoder-multiset-base \
--index indexes/dindex-sample-dpr-multi-0 \
--batch 64 \
--device cuda:0 \
--title-delimiter '\n' \
--shard-id 0 \
--shard-num 4
python -m pyserini.encode input --corpus integrations/resources/sample_collection_jsonl \
--fields title text \ # fields in collection contents
--shard-id 0 \
--shard-num 4 \
output --embeddings indexes/dindex-sample-dpr-multi-0 \
--to-faiss \
encoder --encoder facebook/dpr-ctx_encoder-multiset-base \
--fields title text \ # fields to encode
--batch 32
```
you can run 4 process on 4 gpu to speed up the process by 4 times.
Once it down, you can create the full index by merge the sub-indexes by:
```bash
python -m pyserini.dindex.merge_indexes --prefix indexes/dindex-sample-dpr-multi- --shard-num 4
python -m pyserini.index.merge_faiss_indexes --prefix indexes/dindex-sample-dpr-multi- --shard-num 4
```
90 changes: 0 additions & 90 deletions integrations/dense/test_create_index.py

This file was deleted.

21 changes: 0 additions & 21 deletions pyserini/dindex/__init__.py

This file was deleted.

104 changes: 0 additions & 104 deletions pyserini/dindex/__main__.py

This file was deleted.

Loading

0 comments on commit 58d286c

Please sign in to comment.