Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption of summarizer #2298

Merged
merged 15 commits into from
Jan 18, 2019
89 changes: 88 additions & 1 deletion gensim/summarization/bm25.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,48 @@ def get_scores(self, document):
scores = [self.get_score(document, index) for index in range(self.corpus_size)]
return scores

def get_scores_bow(self, document):
"""Computes and returns BM25 scores of given `document` in relation to
every item in corpus.

Parameters
----------
document : list of str
Document to be scored.

Returns
-------
list of float
BM25 scores.

"""
scores = []
for index in range(self.corpus_size):
score = self.get_score(document, index)
if score > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case len(scores) <= self.corpus_size, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's actually quite sparse array, isn't it?
In summarizer._set_graph_edge_weights such documents with little weight will be dropped anyway, so there is no reason to waste extra memory. And what's more, if we are needed a dense array we can uncompactify this bow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get, how we understand ids of documents that have 0 scores in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy. Like words with 0 weight in bow. We have ids of docs with not zero weight. They are saved in bag-of-docs (I should rename function name part from bow - bag-of-weights to bod - bag-of-docs). If doc id isn't in bag-of-docs, so weight of doc is 0.

scores.append((index, score))
return scores


def _get_scores_bow(bm25, document):
"""Helper function for retrieving bm25 scores of given `document` in parallel
in relation to every item in corpus.

Parameters
----------
bm25 : BM25 object
BM25 object fitted on the corpus where documents are retrieved.
document : list of str
Document to be scored.

Returns
-------
list of (index, float)
BM25 scores in a bag of weights format.

"""
return bm25.get_scores_bow(document)


def _get_scores(bm25, document):
"""Helper function for retrieving bm25 scores of given `document` in parallel
Expand All @@ -183,6 +225,52 @@ def _get_scores(bm25, document):
return bm25.get_scores(document)


def iter_bm25_bow(corpus, n_jobs=1):
"""Yield BM25 scores (weights) of documents in corpus.
Each document has to be weighted with every document in given corpus.

Parameters
----------
corpus : list of list of str
Corpus of documents.
n_jobs : int
The number of processes to use for computing bm25.

Yields
-------
list of (index, float)
BM25 scores in bag of weights format.

Examples
--------
.. sourcecode:: pycon

>>> from gensim.summarization.bm25 import iter_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = iter_bm25_weights(corpus, n_jobs=-1)

"""
bm25 = BM25(corpus)

n_processes = effective_n_jobs(n_jobs)
if n_processes == 1:
for doc in corpus:
yield bm25.get_scores_bow(doc)
return

get_score = partial(_get_scores_bow, bm25)
pool = Pool(n_processes)

for bow in pool.imap(get_score, corpus):
yield bow
pool.close()
pool.join()


def get_bm25_weights(corpus, n_jobs=1):
"""Returns BM25 scores (weights) of documents in corpus.
Each document has to be weighted with every document in given corpus.
Expand Down Expand Up @@ -224,5 +312,4 @@ def get_bm25_weights(corpus, n_jobs=1):
weights = pool.map(get_score, corpus)
pool.close()
pool.join()

return weights
Loading