Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 does not support generator as corpus #2434

Closed
perezzini opened this issue Apr 4, 2019 · 6 comments · Fixed by #2479
Closed

BM25 does not support generator as corpus #2434

perezzini opened this issue Apr 4, 2019 · 6 comments · Fixed by #2479

Comments

@perezzini
Copy link

__init__ method in BM25 class takes a "list of list of str" as corpus instead of a generator. More precisely, this is what it looks like right now:

def __init__(self, corpus):
        """
        Parameters
        ----------
        corpus : list of list of str
            Given corpus.
        """
        self.corpus_size = len(corpus)
        self.avgdl = 0
        self.doc_freqs = []
        self.idf = {}
        self.doc_len = []
        self._initialize(corpus)

As we know, considering a generator instead of a list would be great to handle large collections of documents that do not fit in memory.

@piskvorky
Copy link
Owner

piskvorky commented Apr 4, 2019

@Witiko can you please have a look?

Models (any models) in Gensim not supporting streaming is definitely a bug.

@perezzini
Copy link
Author

I think the tricky problem here, using a generator as a collection of documents, is computing the average document length in the text collection. As I can see, in Python, there's no "proper" way of getting the "length" of a generator. A trivial (and not so efficiently) way could be the following (supposing corpus is a generator of lists):

init = (0, 0)
total_sum, total_docs = reduce(lambda pair1, pair2: (pair1[0] + pair2[0], pair1[1] + pair2[1]), map(lambda d: (len(d), 1), corpus), init)
avgdl = total_sum/total_docs

@piskvorky
Copy link
Owner

piskvorky commented Apr 5, 2019

We'd definitely want to avoid huge reduce / map (not Pythonic), but the code sounds trivial in any case.

The main question for me is, why does the current implementation ask for a list in the first place? What design choices lead to this? How did it pass our reviews? (if indeed true that streaming is not supported)

CC @fbarrios @menshikh-iv @horpto do you remember, can you help? Cheers.

@Witiko
Copy link
Contributor

Witiko commented Apr 7, 2019

@piskvorky There is no apparent reason why corpus can't be an iterable. Instead of initializing self.corpus_size to len(corpus) in __init__, corpus_size can be accumulated in the for document in corpus loop in the _initialize method. The _initialize method only assumes that corpus is iterable.

As for why this code was merged in the first place, I can't really tell. It has been in place since Gensim 3.7.2 (see ac7486a). The corresponding pull request (#324) does not discuss the BM25 class at all. It seems that the entire gensim.summarization.bm25 module was originally designed to be just a private API.

@piskvorky
Copy link
Owner

piskvorky commented Apr 7, 2019

Thanks @Witiko , makes sense.

@perezzini can you open a PR with a fix, as per @Witiko 's suggestion?

@perezzini
Copy link
Author

@piskvorky I'll make a PR as soon as possible!

Thanks for replying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants