BM25 does not support generator as corpus #2434

perezzini · 2019-04-04T12:13:11Z

__init__ method in BM25 class takes a "list of list of str" as corpus instead of a generator. More precisely, this is what it looks like right now:

def __init__(self, corpus):
        """
        Parameters
        ----------
        corpus : list of list of str
            Given corpus.
        """
        self.corpus_size = len(corpus)
        self.avgdl = 0
        self.doc_freqs = []
        self.idf = {}
        self.doc_len = []
        self._initialize(corpus)

As we know, considering a generator instead of a list would be great to handle large collections of documents that do not fit in memory.

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-04-04T12:16:51Z

@Witiko can you please have a look?

Models (any models) in Gensim not supporting streaming is definitely a bug.

perezzini · 2019-04-04T23:38:06Z

I think the tricky problem here, using a generator as a collection of documents, is computing the average document length in the text collection. As I can see, in Python, there's no "proper" way of getting the "length" of a generator. A trivial (and not so efficiently) way could be the following (supposing corpus is a generator of lists):

init = (0, 0)
total_sum, total_docs = reduce(lambda pair1, pair2: (pair1[0] + pair2[0], pair1[1] + pair2[1]), map(lambda d: (len(d), 1), corpus), init)
avgdl = total_sum/total_docs

piskvorky · 2019-04-05T08:33:16Z

We'd definitely want to avoid huge reduce / map (not Pythonic), but the code sounds trivial in any case.

The main question for me is, why does the current implementation ask for a list in the first place? What design choices lead to this? How did it pass our reviews? (if indeed true that streaming is not supported)

CC @fbarrios @menshikh-iv @horpto do you remember, can you help? Cheers.

Witiko · 2019-04-07T18:18:20Z

@piskvorky There is no apparent reason why corpus can't be an iterable. Instead of initializing self.corpus_size to len(corpus) in __init__, corpus_size can be accumulated in the for document in corpus loop in the _initialize method. The _initialize method only assumes that corpus is iterable.

As for why this code was merged in the first place, I can't really tell. It has been in place since Gensim 3.7.2 (see ac7486a). The corresponding pull request (#324) does not discuss the BM25 class at all. It seems that the entire gensim.summarization.bm25 module was originally designed to be just a private API.

piskvorky · 2019-04-07T19:15:39Z

Thanks @Witiko , makes sense.

@perezzini can you open a PR with a fix, as per @Witiko 's suggestion?

perezzini · 2019-04-08T14:07:59Z

@piskvorky I'll make a PR as soon as possible!

Thanks for replying!

saraswatmks mentioned this issue May 6, 2019

generator support in bm25 #2479

Merged

mpenkov closed this as completed in #2479 May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25 does not support generator as corpus #2434

BM25 does not support generator as corpus #2434

perezzini commented Apr 4, 2019

piskvorky commented Apr 4, 2019 •

edited

Loading

perezzini commented Apr 4, 2019

piskvorky commented Apr 5, 2019 •

edited

Loading

Witiko commented Apr 7, 2019 •

edited

Loading

piskvorky commented Apr 7, 2019 •

edited

Loading

perezzini commented Apr 8, 2019

BM25 does not support generator as corpus #2434

BM25 does not support generator as corpus #2434

Comments

perezzini commented Apr 4, 2019

piskvorky commented Apr 4, 2019 • edited Loading

perezzini commented Apr 4, 2019

piskvorky commented Apr 5, 2019 • edited Loading

Witiko commented Apr 7, 2019 • edited Loading

piskvorky commented Apr 7, 2019 • edited Loading

perezzini commented Apr 8, 2019

piskvorky commented Apr 4, 2019 •

edited

Loading

piskvorky commented Apr 5, 2019 •

edited

Loading

Witiko commented Apr 7, 2019 •

edited

Loading

piskvorky commented Apr 7, 2019 •

edited

Loading