Spam Filter
You can create a Naive Bayes Classifier to filter the spam emails by following steps:
- Load the spam dataset by clicking the "Open file..." button.
- Select different attributes, and review the information in the right section.
- Remove the capital_run_length_average, capital_run_length_longest, and capital_run_length_total attributes.
- The remaining attributes represent relative frequencies (Numeric type) of various salient words and characters in emails. We will convert these to Boolean values (Nominal type) instead. To do this, click Choose button in the Filter frame and pick filters -> unsupervised -> attribute -> NumericToBinary. By clicking Apply button, all the numeric attributes are converted to Booleans. Each email is now represented by a 55 dimensional vector representing whether or not a particular word exists in an email.
- You can save this filtered dataset to a new file by clicking Save... button.
- Go to the Classify tab and click Choose button, then select the naive bayes model under classifiers -> bayes -> NaiveBayes.
- Use Percentage split and leave the default settings, then click Start to build the classifier. Study the result in the right section, most importantly the percentages of correctly and incorrectly classified instance.
- Examine the classifier models produced by WEKA.
- Next, we would classify a test data by using this classifier. Download the spam_test.arff. Under the Classify tab, select supplied test set and load the test file by clicking "Set..." -> "Open file...". Finally, run the Naive Bayes Classifier on this test set.