Module for automatic summarization #324

fedelopez77 · 2015-04-17T01:57:40Z

This adds a module for automatic summarization based on TextRank. It features both uses introduced in the original paper: sentences extraction for summaries and keyword extraction.

The input can be either a gensim corpus or the raw text. The output is a summarized text, a list of sentences or a list of keywords.

from gensim import corpora
from gensim import summarization

text = "Jerry works in his father-in-law's car dealership and has gotten " + \
       "himself in financial problems. He tries various schemes to come " + \
       "up with money needed for a reason that is never really explained. " + \
       "It has to be assumed that his huge embezzlement of money from the " + \
       "dealership is about to be discovered by father-in-law. When all  " + \
       "else falls through, plans he set in motion earlier for two men to " + \
       "kidnap his wife for ransom to be paid by her wealthy father (who " + \
       "doesn't seem to have the time of day for son-in-law). From the " + \
       "moment of the kidnapping, things go wrong and what was supposed " + \
       "to be a non-violent affair turns bloody with more blood added by " + \
       "the minute. Jerry is upset at the bloodshed, which turns loose a " + \
       "pregnant sheriff from Brainerd, MN who is tenacious in attempting " + \
       "to solve the three murders in her jurisdiction."

sentences = text.split(".")
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

print(summarization.textrank_from_corpus(corpus, len(dictionary.token2id)))
[[(3, 1), (7, 1), (9, 1), (18, 1), (19, 1), (25, 1), (26, 2), (31, 1), 
(32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), 
(39, 1), (40, 1), (41, 1), (42, 1)]]

# Or from the raw text
print(summarization.summarize(text))
'It has to be assumed that his huge embezzlement of money from the 
dealership is about to be discovered by father-in-law.'

The summaries generated using the sentence extraction feature were evaluated using the ROUGE toolkit and the 2002 Document Understanding Conference corpus, as in the original paper. The results we found were similar: TextRank performed better than the DUC baseline by 2.3%
We include a test in which we reproduce the results of the paper using a sample article.

piskvorky · 2015-04-22T21:36:14Z

Thanks @fedelopez77 !

This is a meaty addition, the review may take a while :)

In the meanwhile, can you fix the failing unit tests? Travis reports fails on 2.6, 3.3 and 3.4.

nick-magnini · 2015-04-23T06:04:07Z

It would be very interesting to build the doc vectors based on w2v model built on a different corpora and evaluate the summaries and compare with the original results.

ziky90 · 2015-04-26T11:19:23Z

It looks awesome!

I'd have just few minor notes.

In summarization.summarize() would't it be good to also allow users to call it with corpus in addition to text? Or there can be done another method, for example called summarization.summarize_corpora(). It seems to me as a possible simplification of work with summarization module. Does this make a sense?
From a formal point of view, shouldn't every python file start with?:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

Though I'm not sure about the licence line. This is maybe more question for @piskvorky

See the line comment in gensim/summarization/graph.py.

Points 2) and 3) I have prepared in my local cloned version, so if you don't mind I can push it (also if the licence line is correct).
Point 1) I can do and push as well, but I'm better ask if you think that new method should be made or summarization.summarize() should be just able to handle corpora directly?

@fedelopez77
I'd also like to ask you you have made some performance tests (eg. how long does it take to summarize every article in english wikipedia, or at those 567 articles referenced in the paper) and measurements and if you can possibly publish them to the initial comment? (I am currently trying to do something myself as well, so this would save me some time)

Also do you plan to write some more tests? For example for summarization.textrank_from_corpus, you can also try some other inputs, some cases that should not work, etc.?

I have found several problems trying to experiment with "pathological" cases, eg. very short input text.
For example summarization.summarize("Jerry works in his father himself in financial problems. He tries various schemes to come up with money needed for a reason that is never really explained.") throws ZeroDivisionError: float division by zero. I think that this should be at least replaced by some custom gensim exception that would inform that you have to call summarization with longer input text?

Then trying summarization.summarize("Jerry works in his father himself in financial problems. It has to be assumed that his huge embezzlement of money dealership is about to be discovered by father-in-law.") returns empty summarization. Again I am not sure, what should be the proper behaviour in case like this.

ziky90 · 2015-04-26T11:19:35Z

gensim/summarization/graph.py

+from abc import ABCMeta, abstractmethod
+
+
+class IGraph:


Just a formal note, it'd be nice to keep it consistent with rest of gensim and inherit from object class IGraph(object):
And second thing, there are sometimes too many white spaces (based on PEP 8 style guide).

piskvorky · 2015-04-26T14:12:21Z

Great job guys :)

Re. open source license: sure, LGPL, like the rest of gensim.

Re. corner cases: we want these to work too. Or, if there's a reason they can't, a clearer error message, so we don't confuse users. Good catch.

@ziky90, if you have some fixes ready, open a new PR against summanlp's PR branch. I'll play with the code some more today too.

A brief summary of resource use (CPU & RAM: big-O + practical tips) would be nice indeed.

@fedelopez77 @fbarrios, do you think you could write a short article about this new functionality?
It's large enough that I think we ought to give users a short intro and tutorial (motivation, background, solution approach, how to use it, weaknesses, strengths) :)

I'll publish that (or provide a link to your published version) + link there from API docs.

Thanks!

fbarrios · 2015-04-27T15:10:47Z

Thanks for the feedback :)

@ziky90 @piskvorky We have found that the input text must be at least around ten sentences long for the summary to make sense.
The summary length is calculated as the number of sentences times the ratio, and if that equals a number below one, the resulting summary will be empty. Perhaps an exception should be thrown in that case.

The ZeroDivisionError issue is definitely a bug, and it will happen every time a text with just one sentence is provided. It will also happen if the text has sentences without any words in common or meaningless text.

@ziky90 Yes, we didn’t provide enought tests. We’ll plan to add them soon.
About the performance: summarizing the 567 documents takes around 21 seconds on my Athlon II X2 240 CPU. We’ll be testing with Wikipedia and sharing the results when we have them :)

@piskvorky Yes, I agree we should write more documentation. We’re a little busy at the moment with university stuff, but we’ll be working on that as soon as we can.

Consistency with gensim and pep 8

piskvorky · 2015-04-27T15:24:58Z

Thanks @fbarrios !

The "tutorial article" is not blocking. We can review and merge without it. But if you have a presentation / text ready (maybe for your uni?), we could link to that.

Conflicts: gensim/summarization/summarizer.py

fbarrios · 2015-05-11T02:34:44Z

We've got some updates on this, sorry for the delay.

We've been working on your suggestions. The border cases and the bug @ziky90 pointed out are both fixed. Thanks!
We added a few tests for those cases, although we still need to add some more code documentation.

We do have some texts about the project, but most of them are written in spanish:

The project proposal (in spanish).
An article we've written to showcase our work and the improvements we made to the original algorithm (in english), to be sent to the 2015 JAIIO conference.
The complete project report (in spanish).

We made this script to summarize the English Wikipedia.
It's been running since yesterday morning (around 30 hours) and has processed around 50,000 articles so far. Currently, Wikipedia has around 5,000,000 articles, so I don't think the script will end anytime soon.

I'll keep you posted when we make advances with the documentation or have some performance results.

nick-magnini · 2015-05-11T02:45:26Z

Hi guys,

Are you aware of any work or any clue to use Word2Vec for summarization? It would be interesting to have the similarity function as word2vec similarity and see the results. Moreover, do you have any module to convert a text/sentence/document to a vector based on a w2v model so that you can use the model to find the similarity among the sentences?

ziky90 · 2015-06-24T16:34:58Z

@fbarrios I would like to ask how does it look with adding tests? Might I help by writing some? I would like to play a bit with summarization on my toy project, so I would like to ask, how can I help towards the merge of summarization to gensim.

piskvorky · 2015-07-01T19:29:38Z

We're planning to make a new release this weekend. It'd be great to get this in -- what is missing?

@fbarrios @fedelopez77 @ziky90 can you complete the PR, make it merge-able?

Cheers!

fbarrios · 2015-07-05T16:31:50Z

@piskvorky We integrated the development branch to make the changes mergeable.
We also added a detailed documentation to the public methods and a few more tests, but the build is failing because a test from test_models.py is failing.

We still have to write a more detailed description and more tests over the following days, but we think the PR is ready to be merged.

piskvorky · 2015-07-05T18:01:43Z

Great! Merging now, thanks to the entire team!

A tutorial will be very welcome of course, that's what many people need to get started :)

Module for automatic summarization

al7veda · 2015-07-09T00:52:52Z

Hello, GenSim folks.
I'm trying to use summa/textrank-0.07 via command-line option-- cd path/to/folder/summa/
python textrank.py -t FILE -- on the text-set in plain text, 10Mb, windows 8.1,12Gb RAM, but getting a Memory Error:
C:\Python27\Lib\site-packages\summa>python textrank.py -t train_set_1SentByLine_clean.txt
Traceback (most recent call last):
File "textrank.py", line 75, in main()
File "textrank.py", line 71, in main
print textrank(text, summarize_by, ratio, words)
File "textrank.py", line 60, in textrank
return summarize(text, ratio, words)
File "C:\Python27\Lib\site-packages\summa\summarizer.py", line 97, in summarize
_set_graph_edge_weights(graph)
File "C:\Python27\Lib\site-packages\summa\summarizer.py", line 17, in _set_graph_edge_weights
graph.add_edge(edge, similarity)
File "C:\Python27\Lib\site-packages\summa\graph.py", line 177, in add_edge
self.set_edge_properties((u, v), label=label, weight=wt)
File "C:\Python27\Lib\site-packages\summa\graph.py", line 226, in set_edge_properties
self.edge_properties.setdefault((edge[1], edge[0]), {}).update( properties ) MemoryError
I'd greatly appreciate an advice about how to fix this error.
Thank you,
Al

piskvorky · 2015-07-09T09:21:51Z

Hello @al7veda , this is the repository for gensim. For other Python packages, please use their respective support systems directly.

al7veda · 2015-07-10T04:24:30Z

Sorry, I was looking at the summa/textrank.py docs, and then switched to gensim trying to find a solution to my issue but didn't realize that's a different package. I'd be more careful next time.
Best.

piskvorky · 2015-07-10T07:34:19Z

No worries. If something fails in gensim, feel free to report an issue.

fedelopez77 and others added 10 commits April 11, 2015 15:41

Adding summarization package

2327549

Adding function to rank in documents from corpus. WIP

1f18541

Summarizer from corpus and from text with integration to gensim

30598a8

Removing exportating feature

f6ee4d0

Fix ZeroDivisionError bug.

1afdc5d

Adding missing package imports.

1313df3

Adding summarization package test.

10c66fd

Conforming to PEP8.

0db3da0

Fix in keyword extraction

65f1818

Fix in case that document has length of zero

bfc85e5

fbarrios added 6 commits April 23, 2015 00:20

Fixed Python 3 compatibility issues.

a0d7ce2

Organizing imports.

0dd6337

PEP8.

9a76dca

Fixing compatibility issue with Python 2.6.

e768c00

Fixing migration bug in keywords feature.

25da4eb

Fixing invalid dict comprehension for 2.6 compatibility.

68dcc08

ziky90 reviewed Apr 26, 2015
View reviewed changes

cleaned to go with gensim and pep 8 standards

2425874

ziky90 mentioned this pull request Apr 26, 2015

Consistency with gensim and pep 8 summanlp/gensim#1

Merged

Merge pull request #1 from ziky90/summarization_fixes

fad4a10

Consistency with gensim and pep 8

fbarrios added 4 commits May 3, 2015 16:15

Removing unused function.

fb5eaae

Adding BM25 optimization.

ac7486a

Refactor.

5c661f5

Adding licence for consistency.

db978d5

fbarrios added 7 commits May 3, 2015 18:09

Idiomatic refactor.

ede68ee

Adding minimum length validation.

1f7ed2b

Fixing bug that happened when all sentences similarities were zero.

b55e6d7

Adding test for fixed bug.

c5a8d26

Adding test for short texts.

7b373d2

Merge branch 'validations' into develop

e4df767

Conflicts: gensim/summarization/summarizer.py

Bugfix.

c6c4603

fedelopez77 and others added 6 commits July 1, 2015 20:00

Merge from gensim/develop to the fork

7caf535

Fix in test for python 2.6 compatibility

342f10a

Fixing bug that generated the graph two times. Changed method name.

4083b89

Adding documentation. Fixing bug with the word_count parameter.

60d35e8

Adding test for the corpus summarization feature.

da382e9

Adding summarization ratio test.

ef8a12c

piskvorky added a commit that referenced this pull request Jul 5, 2015

Merge pull request #324 from summanlp/develop

d0e5e74

Module for automatic summarization

piskvorky merged commit d0e5e74 into piskvorky:develop Jul 5, 2015

Fil mentioned this pull request Jul 7, 2015

the summarization example doesn't pass #387

Closed

Witiko mentioned this pull request Apr 7, 2019

BM25 does not support generator as corpus #2434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module for automatic summarization #324

Module for automatic summarization #324

fedelopez77 commented Apr 17, 2015

piskvorky commented Apr 22, 2015

nick-magnini commented Apr 23, 2015

ziky90 commented Apr 26, 2015

ziky90 Apr 26, 2015

piskvorky commented Apr 26, 2015

fbarrios commented Apr 27, 2015

piskvorky commented Apr 27, 2015

fbarrios commented May 11, 2015

nick-magnini commented May 11, 2015

ziky90 commented Jun 24, 2015

piskvorky commented Jul 1, 2015

fbarrios commented Jul 5, 2015

piskvorky commented Jul 5, 2015

al7veda commented Jul 9, 2015

piskvorky commented Jul 9, 2015

al7veda commented Jul 10, 2015

piskvorky commented Jul 10, 2015

Module for automatic summarization #324

Module for automatic summarization #324

Conversation

fedelopez77 commented Apr 17, 2015

piskvorky commented Apr 22, 2015

nick-magnini commented Apr 23, 2015

ziky90 commented Apr 26, 2015

ziky90 Apr 26, 2015

Choose a reason for hiding this comment

piskvorky commented Apr 26, 2015

fbarrios commented Apr 27, 2015

piskvorky commented Apr 27, 2015

fbarrios commented May 11, 2015

nick-magnini commented May 11, 2015

ziky90 commented Jun 24, 2015

piskvorky commented Jul 1, 2015

fbarrios commented Jul 5, 2015

piskvorky commented Jul 5, 2015

al7veda commented Jul 9, 2015

piskvorky commented Jul 9, 2015

al7veda commented Jul 10, 2015

piskvorky commented Jul 10, 2015