-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module for automatic summarization #324
Conversation
Thanks @fedelopez77 ! This is a meaty addition, the review may take a while :) In the meanwhile, can you fix the failing unit tests? Travis reports fails on 2.6, 3.3 and 3.4. |
It would be very interesting to build the doc vectors based on w2v model built on a different corpora and evaluate the summaries and compare with the original results. |
It looks awesome! I'd have just few minor notes.
Though I'm not sure about the licence line. This is maybe more question for @piskvorky
Points 2) and 3) I have prepared in my local cloned version, so if you don't mind I can push it (also if the licence line is correct). @fedelopez77 Also do you plan to write some more tests? For example for I have found several problems trying to experiment with "pathological" cases, eg. very short input text. Then trying |
from abc import ABCMeta, abstractmethod | ||
|
||
|
||
class IGraph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a formal note, it'd be nice to keep it consistent with rest of gensim and inherit from object class IGraph(object):
And second thing, there are sometimes too many white spaces (based on PEP 8 style guide).
Great job guys :) Re. open source license: sure, LGPL, like the rest of gensim. Re. corner cases: we want these to work too. Or, if there's a reason they can't, a clearer error message, so we don't confuse users. Good catch. @ziky90, if you have some fixes ready, open a new PR against A brief summary of resource use (CPU & RAM: big-O + practical tips) would be nice indeed. @fedelopez77 @fbarrios, do you think you could write a short article about this new functionality? I'll publish that (or provide a link to your published version) + link there from API docs. Thanks! |
Thanks for the feedback :) @ziky90 @piskvorky We have found that the input text must be at least around ten sentences long for the summary to make sense. The ZeroDivisionError issue is definitely a bug, and it will happen every time a text with just one sentence is provided. It will also happen if the text has sentences without any words in common or meaningless text. @ziky90 Yes, we didn’t provide enought tests. We’ll plan to add them soon. @piskvorky Yes, I agree we should write more documentation. We’re a little busy at the moment with university stuff, but we’ll be working on that as soon as we can. |
Consistency with gensim and pep 8
Thanks @fbarrios ! The "tutorial article" is not blocking. We can review and merge without it. But if you have a presentation / text ready (maybe for your uni?), we could link to that. |
Conflicts: gensim/summarization/summarizer.py
We've got some updates on this, sorry for the delay. We've been working on your suggestions. The border cases and the bug @ziky90 pointed out are both fixed. Thanks! We do have some texts about the project, but most of them are written in spanish:
We made this script to summarize the English Wikipedia. I'll keep you posted when we make advances with the documentation or have some performance results. |
Hi guys, Are you aware of any work or any clue to use Word2Vec for summarization? It would be interesting to have the similarity function as word2vec similarity and see the results. Moreover, do you have any module to convert a text/sentence/document to a vector based on a w2v model so that you can use the model to find the similarity among the sentences? |
@fbarrios I would like to ask how does it look with adding tests? Might I help by writing some? I would like to play a bit with summarization on my toy project, so I would like to ask, how can I help towards the merge of summarization to gensim. |
We're planning to make a new release this weekend. It'd be great to get this in -- what is missing? @fbarrios @fedelopez77 @ziky90 can you complete the PR, make it merge-able? Cheers! |
@piskvorky We integrated the development branch to make the changes mergeable. We still have to write a more detailed description and more tests over the following days, but we think the PR is ready to be merged. |
Great! Merging now, thanks to the entire team! A tutorial will be very welcome of course, that's what many people need to get started :) |
Module for automatic summarization
Hello, GenSim folks. |
Hello @al7veda , this is the repository for |
Sorry, I was looking at the summa/textrank.py docs, and then switched to gensim trying to find a solution to my issue but didn't realize that's a different package. I'd be more careful next time. |
No worries. If something fails in gensim, feel free to report an issue. |
This adds a module for automatic summarization based on TextRank. It features both uses introduced in the original paper: sentences extraction for summaries and keyword extraction.
The input can be either a gensim corpus or the raw text. The output is a summarized text, a list of sentences or a list of keywords.
The summaries generated using the sentence extraction feature were evaluated using the ROUGE toolkit and the 2002 Document Understanding Conference corpus, as in the original paper. The results we found were similar: TextRank performed better than the DUC baseline by 2.3%
We include a test in which we reproduce the results of the paper using a sample article.