-
Hi, I'm currently testing HNSW with scalar quantizer SQ8. Since my dataset is pretty large and can't processed by original number of reducers(and we don't want to increase the number of reducers because that would also affect the search performance), so I tested splitting embeddings to several batches and then encoded and trained them by batch. But the recall dropped a lot by using batch. Just want to check that is batch encoding a feasible way? Or should we encode the whole bunch of embeddings together but not splitting in batch? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @Luciferre , I think splitting embeddings into batches and training them separately is probably creating its own local graph and leading to reduced recall. It's usually better to encode the entire dataset together without splitting it into batches but it can work if the split batches are themselves a good representation of the full dataset or are representing different clusters. One solution could be sampling the data. For eg if the dataset is of 1B and we can train only on 10M, then selecting 10M out of 1B rows using a good sampling technique that extracts rows but still closely representing full dataset. |
Beta Was this translation helpful? Give feedback.
Hi @Luciferre , I think splitting embeddings into batches and training them separately is probably creating its own local graph and leading to reduced recall. It's usually better to encode the entire dataset together without splitting it into batches but it can work if the split batches are themselves a good representation of the full dataset or are representing different clusters.
One solution could be sampling the data. For eg if the dataset is of 1B and we can train only on 10M, then selecting 10M out of 1B rows using a good sampling technique that extracts rows but still closely representing full dataset.