Skip to content

Commit

Permalink
Ability to add custom stopwords at classifier initialization (#129)
Browse files Browse the repository at this point in the history
* Abbility to add custom stopwords at classifier initialization

* Downcased custom test stopwords

* Documented and improved custom stopwords handling

* Added test cases for custom stopwords and empty trainings, #125 and #130

* Added documentation for auto-categorization and custom stopwords
  • Loading branch information
ibnesayeed authored and Ch4s3 committed Jan 18, 2017
1 parent 006d31a commit 0567c01
Show file tree
Hide file tree
Showing 6 changed files with 168 additions and 16 deletions.
55 changes: 50 additions & 5 deletions docs/bayes.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend

## Beyond the Basics

Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trival applications.
Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trivial applications.
Consider the following program.

```ruby
Expand All @@ -80,7 +80,7 @@ require 'classifier-reborn'
training_set = DATA.read.split("\n")
categories = training_set.shift.split(',').map{|c| c.strip}

# pass :auto_categorize option to allow feeding previously unknown categories
# Pass :auto_categorize option to allow feeding previously unknown categories
classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true

training_set.each do |a_line|
Expand All @@ -90,7 +90,7 @@ training_set.each do |a_line|
end

puts classifier.classify "I hate bad words and you" #=> 'Uninteresting'
puts classifier.classify "I hate javascript" #=> 'Uninteresting'
puts classifier.classify "I hate JavaScript" #=> 'Uninteresting'
puts classifier.classify "JavaScript is bad" #=> 'Uninteresting'

puts classifier.classify "All you need is ruby" #=> 'Interesting'
Expand All @@ -107,7 +107,7 @@ interesting: The love boat, soon we will be taking another ride
interesting: Ruby don't take your love to town
uninteresting: Here are some bad words, I hate you
uninteresting: Bad bad leroy brown badest man in the darn town
uninteresting: Bad bad Leroy Brown badest man in the darn town
uninteresting: The good the bad and the ugly
uninteresting: Java, JavaScript, CSS front-end HTML
#
Expand All @@ -119,12 +119,57 @@ dog: A good hunting dog is a fine thing
dog: Man my dogs are tired
dog: Dogs are better than cats in soooo many ways

cat: The fuzz ball spilt the milk
cat: The fuzz ball spilled the milk
cat: Got rats or mice get a cat to kill them
cat: Cats never come when you call them
cat: That dang cat keeps scratching the furniture
```

If no categories are specified at initialization then `:auto_categorize` is set to `true` by default.
However, dynamic methods like `train_some_category` or `untrain_some_category` will not work unless corresponding categories exist.

```ruby
require 'classifier-reborn'

classifier = ClassifierReborn::Bayes.new
classifier.train("cat", "I can has cat")
# The above method will work, but the following will throw an error
# classifier.train_cat "I can has cat"
```

## Custom Stopwords

The library ships with stopword files in various languages.
However, in certain situations a custom stopwords list is desired for the domain specific classifiers.
Custom stopwords can be specified at the classifier initialization by supplying an array of stopwords or path to load a stopwords file.
These stopwords will only be applied for the language of the classifier instance.
To disable stopwords completely, pass an empty string (`""`) or empty array (`[]`) as the value of the `:stopwords` parameter.

```ruby
require 'classifier-reborn'

custom_stopwords = ["custom", "stop", "words"]
classifier = ClassifierReborn::Bayes.new stopwords: custom_stopwords
# Or from a file
classifier = ClassifierReborn::Bayes.new stopwords: "/path/to/custom/stopwords/file"
# Or to disable stopwords
classifier = ClassifierReborn::Bayes.new stopwords: ""
# Alternatively, to disable stopwords
classifier = ClassifierReborn::Bayes.new stopwords: []
```

Training and untraing with empty strings or strings that consist of only stopwords will be skipped.
While an attempt to classify such strings will return `nil` or a category with score `Infinity` (based on whether threshold is enabled).

The above method of custom stopwords will overwrite the existing stopwords for the language of the classifier instance.
However, to supplement the existing set of stopwords, more directory paths containing stopwordsword files can be added.
In this case, each stopwords file name needs to be the same as the corresponding language code, such as `en` for English or `ar` for Arabic.


```ruby
ClassifierReborn::Hasher.add_custom_stopword_path(/path/to/additional/stopwords/directory)
```

## Knowing the Score

When you ask a Bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category.
Expand Down
25 changes: 24 additions & 1 deletion lib/classifier-reborn/bayes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
# Copyright:: Copyright (c) 2005 Lucas Carlson
# License:: LGPL

require 'set'

require_relative 'category_namer'
require_relative 'backends/bayes_memory_backend'
require_relative 'backends/bayes_redis_backend'
Expand All @@ -16,10 +18,11 @@ class Bayes
#
# Options available are:
# language: 'en' Used to select language specific stop words
# auto_categorize: false When true, enables ability to dynamically declare a category
# auto_categorize: false When true, enables ability to dynamically declare a category; the default is true if no initial categories are provided
# enable_threshold: false When true, enables a threshold requirement for classifition
# threshold: 0.0 Default threshold, only used when enabled
# enable_stemmer: true When false, disables word stemming
# stopwords: nil Accepts path to a text file or an array of words, when supplied, overwrites the default stopwords; assign empty string or array to disable stopwords
# backend: BayesMemoryBackend.new Alternatively, BayesRedisBackend.new for persistent storage
def initialize(*args)
initial_categories = []
Expand Down Expand Up @@ -51,6 +54,10 @@ def initialize(*args)
initial_categories.each do |c|
add_category(c)
end

if options.key?(:stopwords)
custom_stopwords options[:stopwords]
end
end

# Provides a general training method for all categories specified in Bayes#new
Expand Down Expand Up @@ -236,5 +243,21 @@ def add_category(category)
end

alias_method :append_category, :add_category

private

# Overwrites the default stopwords for current language with supplied list of stopwords or file
def custom_stopwords(stopwords)
unless stopwords.is_a?(Enumerable)
if stopwords.strip.empty?
stopwords = []
elsif File.exist?(stopwords)
stopwords = File.read(stopwords).force_encoding("utf-8").split
else
return # Do not overwrite the default
end
end
Hasher::STOPWORDS[@language] = Set.new stopwords
end
end
end
49 changes: 47 additions & 2 deletions test/bayes/bayesian_common_tests.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ def test_add_category

def test_dynamic_category_succeeds_with_auto_categorize
classifier = auto_categorize_classifier
classifier.train('Ruby', 'I really sweet language')
classifier.train('Ruby', 'A really sweet language')
assert classifier.categories.include?('Ruby')
end

def test_dynamic_category_succeeds_with_empty_categories
classifier = empty_classifier
assert classifier.categories.empty?
classifier.train('Ruby', 'I really sweet language')
classifier.train('Ruby', 'A really sweet language')
assert classifier.categories.include?('Ruby')
assert_equal 1, classifier.categories.size
end
Expand Down Expand Up @@ -133,4 +133,49 @@ def test_untrain
classification_after_untrain = @classifier.classify 'seven'
refute_equal classification_of_bad_data, classification_after_untrain
end

def test_skip_empty_training_and_classification
classifier = empty_classifier
classifier.train('Ruby', '')
assert classifier.categories.empty?
classifier.train('Ruby', 'To be or not to be')
assert classifier.categories.empty?
classifier.train('Ruby', 'A really sweet language')
refute classifier.categories.empty?
assert_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
end

def test_empty_string_stopwords
classifier = empty_string_stopwords_classifier
classifier.train('Stopwords', 'To be or not to be')
refute classifier.categories.empty?
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
end

def test_empty_array_stopwords
classifier = empty_array_stopwords_classifier
classifier.train('Stopwords', 'To be or not to be')
refute classifier.categories.empty?
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
end

def test_custom_array_stopwords
classifier = array_stopwords_classifier
classifier.train('Stopwords', 'Custom stopwords')
assert classifier.categories.empty?
classifier.train('Stopwords', 'To be or not to be')
refute classifier.categories.empty?
assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1]
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
end

def test_custom_file_stopwords
classifier = file_stopwords_classifier
classifier.train('Stopwords', 'Custom stopwords')
assert classifier.categories.empty?
classifier.train('Stopwords', 'To be or not to be')
refute classifier.categories.empty?
assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1]
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
end
end
21 changes: 21 additions & 0 deletions test/bayes/bayesian_memory_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ class BayesianMemoryTest < Minitest::Test

def setup
@classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
@old_stopwords = Hasher::STOPWORDS['en']
end

def teardown
Hasher::STOPWORDS['en'] = @old_stopwords
end

def another_classifier
Expand All @@ -29,4 +34,20 @@ def empty_classifier
def useless_classifier
ClassifierReborn::Bayes.new auto_categorize: false
end

def empty_string_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: ""
end

def empty_array_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: []
end

def array_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"]
end

def file_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en'
end
end
30 changes: 24 additions & 6 deletions test/bayes/bayesian_redis_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ def setup
@redis_backend.instance_variable_get(:@redis).config(:set, "save", "")
@alternate_redis_backend = ClassifierReborn::BayesRedisBackend.new(db: 1)
@classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend: @redis_backend
@old_stopwords = Hasher::STOPWORDS['en']
rescue Redis::CannotConnectError => e
skip(e)
end
end

def teardown
Hasher::STOPWORDS['en'] = @old_stopwords
@redis_backend.instance_variable_get(:@redis).flushdb
@alternate_redis_backend.instance_variable_get(:@redis).flushdb
end
Expand All @@ -34,11 +36,27 @@ def threshold_classifier(category)
ClassifierReborn::Bayes.new category, backend: @alternate_redis_backend
end

def empty_classifier
ClassifierReborn::Bayes.new backend: @alternate_redis_backend
end
def empty_classifier
ClassifierReborn::Bayes.new backend: @alternate_redis_backend
end

def useless_classifier
ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend
end
def useless_classifier
ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend
end

def empty_string_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: "", backend: @alternate_redis_backend
end

def empty_array_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: [], backend: @alternate_redis_backend
end

def array_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"], backend: @alternate_redis_backend
end

def file_stopwords_classifier
ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en', backend: @alternate_redis_backend
end
end
4 changes: 2 additions & 2 deletions test/data/stopwords/en
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
These
these
are
custom
stopwords
stopwords

0 comments on commit 0567c01

Please sign in to comment.