-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BM25 does not support generator as corpus #2434
Comments
@Witiko can you please have a look? Models (any models) in Gensim not supporting streaming is definitely a bug. |
I think the tricky problem here, using a generator as a collection of documents, is computing the average document length in the text collection. As I can see, in Python, there's no "proper" way of getting the "length" of a generator. A trivial (and not so efficiently) way could be the following (supposing init = (0, 0)
total_sum, total_docs = reduce(lambda pair1, pair2: (pair1[0] + pair2[0], pair1[1] + pair2[1]), map(lambda d: (len(d), 1), corpus), init)
avgdl = total_sum/total_docs |
We'd definitely want to avoid huge The main question for me is, why does the current implementation ask for a CC @fbarrios @menshikh-iv @horpto do you remember, can you help? Cheers. |
@piskvorky There is no apparent reason why As for why this code was merged in the first place, I can't really tell. It has been in place since Gensim 3.7.2 (see ac7486a). The corresponding pull request (#324) does not discuss the |
Thanks @Witiko , makes sense. @perezzini can you open a PR with a fix, as per @Witiko 's suggestion? |
@piskvorky I'll make a PR as soon as possible! Thanks for replying! |
__init__
method in BM25 class takes a "list of list of str" as corpus instead of a generator. More precisely, this is what it looks like right now:As we know, considering a generator instead of a list would be great to handle large collections of documents that do not fit in memory.
The text was updated successfully, but these errors were encountered: