Skip to content

Can faiss batch sa_encode embeddings? #4193

Answered by satymish
Luciferre asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @Luciferre , I think splitting embeddings into batches and training them separately is probably creating its own local graph and leading to reduced recall. It's usually better to encode the entire dataset together without splitting it into batches but it can work if the split batches are themselves a good representation of the full dataset or are representing different clusters.

One solution could be sampling the data. For eg if the dataset is of 1B and we can train only on 10M, then selecting 10M out of 1B rows using a good sampling technique that extracts rows but still closely representing full dataset.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Luciferre
Comment options

Answer selected by Luciferre
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
2 participants
Converted from issue

This discussion was converted from issue #4192 on February 18, 2025 05:34.