-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiple language stopwords with customizable stop word paths #40
Conversation
@@ -6,8 +6,12 @@ | |||
|
|||
module ClassifierReborn | |||
module Hasher | |||
@stopwords_path = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, consts are fun.
This is really awesome! How much does this weigh? Will wait for @Ch4s3's input. |
So when you say 'weigh', you mean does this slow things down any? |
@kreynolds Yes, but also the byte size increase in downloading the gem. |
Its not any slower, just takes up slightly more memory if you are classifying multiple languages in the same runtime. There are 25Kb of stopwords among all of the languages put together. Its probably worth noting that after this patch, I have another set of patches to improve performance, particularly around the Hasher (300% speedup, give or take). |
@@ -18,22 +18,22 @@ def without_punctuation(str) | |||
|
|||
# Return a Hash of strings => ints. Each word in the string is stemmed, | |||
# interned, and indexes to its frequency in the document. | |||
def word_hash(str) | |||
word_hash = clean_word_hash(str) | |||
def word_hash(str, language='en') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we follow the GitHub Ruby Styleguide for new code. It states:
Use spaces around the = operator when assigning default values to method parameters.
Would you mind updating your changes to match this?
Done, and I removed Indonesian, which was empty. |
LGTM. @Ch4s3? Please merge and update the history if you think it's good to merge. |
Looks great. I'll merge it in as soon as I have time to update the history. |
👍 |
This adds the ability to have stop words in multiple languages as well as prepend a custom stopword path. I personally have a much larger stopword list for english that came with this library but I wanted to write everything in a completely backwards compatible way.