diff --git a/docs/bayes.md b/docs/bayes.md index 136bcd9..284b68e 100644 --- a/docs/bayes.md +++ b/docs/bayes.md @@ -69,7 +69,7 @@ classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend ## Beyond the Basics -Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trival applications. +Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trivial applications. Consider the following program. ```ruby @@ -80,7 +80,7 @@ require 'classifier-reborn' training_set = DATA.read.split("\n") categories = training_set.shift.split(',').map{|c| c.strip} -# pass :auto_categorize option to allow feeding previously unknown categories +# Pass :auto_categorize option to allow feeding previously unknown categories classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true training_set.each do |a_line| @@ -90,7 +90,7 @@ training_set.each do |a_line| end puts classifier.classify "I hate bad words and you" #=> 'Uninteresting' -puts classifier.classify "I hate javascript" #=> 'Uninteresting' +puts classifier.classify "I hate JavaScript" #=> 'Uninteresting' puts classifier.classify "JavaScript is bad" #=> 'Uninteresting' puts classifier.classify "All you need is ruby" #=> 'Interesting' @@ -107,7 +107,7 @@ interesting: The love boat, soon we will be taking another ride interesting: Ruby don't take your love to town uninteresting: Here are some bad words, I hate you -uninteresting: Bad bad leroy brown badest man in the darn town +uninteresting: Bad bad Leroy Brown badest man in the darn town uninteresting: The good the bad and the ugly uninteresting: Java, JavaScript, CSS front-end HTML # @@ -119,12 +119,57 @@ dog: A good hunting dog is a fine thing dog: Man my dogs are tired dog: Dogs are better than cats in soooo many ways -cat: The fuzz ball spilt the milk +cat: The fuzz ball spilled the milk cat: Got rats or mice get a cat to kill them cat: Cats never come when you call them cat: That dang cat keeps scratching the furniture ``` +If no categories are specified at initialization then `:auto_categorize` is set to `true` by default. +However, dynamic methods like `train_some_category` or `untrain_some_category` will not work unless corresponding categories exist. + +```ruby +require 'classifier-reborn' + +classifier = ClassifierReborn::Bayes.new +classifier.train("cat", "I can has cat") +# The above method will work, but the following will throw an error +# classifier.train_cat "I can has cat" +``` + +## Custom Stopwords + +The library ships with stopword files in various languages. +However, in certain situations a custom stopwords list is desired for the domain specific classifiers. +Custom stopwords can be specified at the classifier initialization by supplying an array of stopwords or path to load a stopwords file. +These stopwords will only be applied for the language of the classifier instance. +To disable stopwords completely, pass an empty string (`""`) or empty array (`[]`) as the value of the `:stopwords` parameter. + +```ruby +require 'classifier-reborn' + +custom_stopwords = ["custom", "stop", "words"] +classifier = ClassifierReborn::Bayes.new stopwords: custom_stopwords +# Or from a file +classifier = ClassifierReborn::Bayes.new stopwords: "/path/to/custom/stopwords/file" +# Or to disable stopwords +classifier = ClassifierReborn::Bayes.new stopwords: "" +# Alternatively, to disable stopwords +classifier = ClassifierReborn::Bayes.new stopwords: [] +``` + +Training and untraing with empty strings or strings that consist of only stopwords will be skipped. +While an attempt to classify such strings will return `nil` or a category with score `Infinity` (based on whether threshold is enabled). + +The above method of custom stopwords will overwrite the existing stopwords for the language of the classifier instance. +However, to supplement the existing set of stopwords, more directory paths containing stopwordsword files can be added. +In this case, each stopwords file name needs to be the same as the corresponding language code, such as `en` for English or `ar` for Arabic. + + +```ruby +ClassifierReborn::Hasher.add_custom_stopword_path(/path/to/additional/stopwords/directory) +``` + ## Knowing the Score When you ask a Bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category. diff --git a/lib/classifier-reborn/bayes.rb b/lib/classifier-reborn/bayes.rb index fd64787..da3014e 100644 --- a/lib/classifier-reborn/bayes.rb +++ b/lib/classifier-reborn/bayes.rb @@ -2,6 +2,8 @@ # Copyright:: Copyright (c) 2005 Lucas Carlson # License:: LGPL +require 'set' + require_relative 'category_namer' require_relative 'backends/bayes_memory_backend' require_relative 'backends/bayes_redis_backend' @@ -16,10 +18,11 @@ class Bayes # # Options available are: # language: 'en' Used to select language specific stop words - # auto_categorize: false When true, enables ability to dynamically declare a category + # auto_categorize: false When true, enables ability to dynamically declare a category; the default is true if no initial categories are provided # enable_threshold: false When true, enables a threshold requirement for classifition # threshold: 0.0 Default threshold, only used when enabled # enable_stemmer: true When false, disables word stemming + # stopwords: nil Accepts path to a text file or an array of words, when supplied, overwrites the default stopwords; assign empty string or array to disable stopwords # backend: BayesMemoryBackend.new Alternatively, BayesRedisBackend.new for persistent storage def initialize(*args) initial_categories = [] @@ -51,6 +54,10 @@ def initialize(*args) initial_categories.each do |c| add_category(c) end + + if options.key?(:stopwords) + custom_stopwords options[:stopwords] + end end # Provides a general training method for all categories specified in Bayes#new @@ -236,5 +243,21 @@ def add_category(category) end alias_method :append_category, :add_category + + private + + # Overwrites the default stopwords for current language with supplied list of stopwords or file + def custom_stopwords(stopwords) + unless stopwords.is_a?(Enumerable) + if stopwords.strip.empty? + stopwords = [] + elsif File.exist?(stopwords) + stopwords = File.read(stopwords).force_encoding("utf-8").split + else + return # Do not overwrite the default + end + end + Hasher::STOPWORDS[@language] = Set.new stopwords + end end end diff --git a/test/bayes/bayesian_common_tests.rb b/test/bayes/bayesian_common_tests.rb index 38f26f6..6c86755 100644 --- a/test/bayes/bayesian_common_tests.rb +++ b/test/bayes/bayesian_common_tests.rb @@ -36,14 +36,14 @@ def test_add_category def test_dynamic_category_succeeds_with_auto_categorize classifier = auto_categorize_classifier - classifier.train('Ruby', 'I really sweet language') + classifier.train('Ruby', 'A really sweet language') assert classifier.categories.include?('Ruby') end def test_dynamic_category_succeeds_with_empty_categories classifier = empty_classifier assert classifier.categories.empty? - classifier.train('Ruby', 'I really sweet language') + classifier.train('Ruby', 'A really sweet language') assert classifier.categories.include?('Ruby') assert_equal 1, classifier.categories.size end @@ -133,4 +133,49 @@ def test_untrain classification_after_untrain = @classifier.classify 'seven' refute_equal classification_of_bad_data, classification_after_untrain end + + def test_skip_empty_training_and_classification + classifier = empty_classifier + classifier.train('Ruby', '') + assert classifier.categories.empty? + classifier.train('Ruby', 'To be or not to be') + assert classifier.categories.empty? + classifier.train('Ruby', 'A really sweet language') + refute classifier.categories.empty? + assert_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1] + end + + def test_empty_string_stopwords + classifier = empty_string_stopwords_classifier + classifier.train('Stopwords', 'To be or not to be') + refute classifier.categories.empty? + refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1] + end + + def test_empty_array_stopwords + classifier = empty_array_stopwords_classifier + classifier.train('Stopwords', 'To be or not to be') + refute classifier.categories.empty? + refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1] + end + + def test_custom_array_stopwords + classifier = array_stopwords_classifier + classifier.train('Stopwords', 'Custom stopwords') + assert classifier.categories.empty? + classifier.train('Stopwords', 'To be or not to be') + refute classifier.categories.empty? + assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1] + refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1] + end + + def test_custom_file_stopwords + classifier = file_stopwords_classifier + classifier.train('Stopwords', 'Custom stopwords') + assert classifier.categories.empty? + classifier.train('Stopwords', 'To be or not to be') + refute classifier.categories.empty? + assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1] + refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1] + end end diff --git a/test/bayes/bayesian_memory_test.rb b/test/bayes/bayesian_memory_test.rb index 67a3f00..8077026 100755 --- a/test/bayes/bayesian_memory_test.rb +++ b/test/bayes/bayesian_memory_test.rb @@ -8,6 +8,11 @@ class BayesianMemoryTest < Minitest::Test def setup @classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting' + @old_stopwords = Hasher::STOPWORDS['en'] + end + + def teardown + Hasher::STOPWORDS['en'] = @old_stopwords end def another_classifier @@ -29,4 +34,20 @@ def empty_classifier def useless_classifier ClassifierReborn::Bayes.new auto_categorize: false end + + def empty_string_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: "" + end + + def empty_array_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: [] + end + + def array_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"] + end + + def file_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en' + end end diff --git a/test/bayes/bayesian_redis_test.rb b/test/bayes/bayesian_redis_test.rb index ac4a79a..1e0213d 100644 --- a/test/bayes/bayesian_redis_test.rb +++ b/test/bayes/bayesian_redis_test.rb @@ -12,12 +12,14 @@ def setup @redis_backend.instance_variable_get(:@redis).config(:set, "save", "") @alternate_redis_backend = ClassifierReborn::BayesRedisBackend.new(db: 1) @classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend: @redis_backend + @old_stopwords = Hasher::STOPWORDS['en'] rescue Redis::CannotConnectError => e skip(e) end end def teardown + Hasher::STOPWORDS['en'] = @old_stopwords @redis_backend.instance_variable_get(:@redis).flushdb @alternate_redis_backend.instance_variable_get(:@redis).flushdb end @@ -34,11 +36,27 @@ def threshold_classifier(category) ClassifierReborn::Bayes.new category, backend: @alternate_redis_backend end - def empty_classifier - ClassifierReborn::Bayes.new backend: @alternate_redis_backend - end + def empty_classifier + ClassifierReborn::Bayes.new backend: @alternate_redis_backend + end - def useless_classifier - ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend - end + def useless_classifier + ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend + end + + def empty_string_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: "", backend: @alternate_redis_backend + end + + def empty_array_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: [], backend: @alternate_redis_backend + end + + def array_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"], backend: @alternate_redis_backend + end + + def file_stopwords_classifier + ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en', backend: @alternate_redis_backend + end end diff --git a/test/data/stopwords/en b/test/data/stopwords/en index 271c6a6..e5d3723 100644 --- a/test/data/stopwords/en +++ b/test/data/stopwords/en @@ -1,4 +1,4 @@ -These +these are custom -stopwords \ No newline at end of file +stopwords