-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
61 lines (48 loc) · 3.03 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
SPAM CLASSIFIER OUTLOOK ADD-IN
================================================
FULL PROJECT CODE AND INSTALLATION INSTRUCTIONS CAN BE FOUND ON GITHUB AT https://github.com/Ochism/InfoSecProject
This software application is a spam email classifier that protects its users from potentially harmful
phishing or spam emails. It is implemented as a Microsoft Outlook Add-in that gets called whenever the user
receives a new mail. The Add-in will classify the mail as either SPAM or NOT SPAM, prepend its classification
and confidence to the email's subject, then move that email into its appropriate folder (Inbox or WatsonSpam).
COMPONENTS
================================================
Classifiers
Two classifiers were trained using IBM Watson's Natural Language Classifier service. The classifiers were
trained using an online corpus of 4327 emails that were split into 80% training data and 20% testing data.
One classifier was responsible for classifying the subjects of emails and the other was used for the email
bodies.
Subject Classifier
- 92.96% Accuracy
- 97.79% Average Confidence
Body Classifier
- 94.77% Accuracy
- 95.55% Average Confidence
The creation, training and testing of these classifiers was done by Kurtis Kuszmaul. Code for these processes
can be found in the ClassifierCreation directory.
The email corpus can be found at http://www.csmining.org/index.php/spam-email-datasets-.html
Outlook Add-in
The Outlook Add-in runs in the background of Outlook and fires whenever new mail is received. It locates the
new mail item, extracts the subject and body from it, then classifies those two text fields using the
classifiers explained above. The confidence of the classifications is weighted and compared to determine a
final classification and confidence level. This classification and confidence percentage is prepended to the
subject, then the appropriate action is taken on the email.
Subject Class = Body Class
- Classification done based on weighted sum of subject and body classifier confidence
- Requires 85% confidence to keep classification
Subject Class != Body Class
- Classification of the higher of the two weighted confidences taken
- Requires 95% confidence to keep classification
The design and development of this Outlook Add-in was done by Gregory Ochs, Ethan Knez, and Kurtis Kuszmaul.
Code for these the add-in can be found in the SpamClassifier directory.
EXTERNAL COMPONENTS (NOT DEVELOPED BY TEAM)
================================================
IBM Watson Natural Language Classifier - https://github.com/watson-developer-cloud/dotnet-standard-sdk/tree/development/src/IBM.WatsonDeveloperCloud.NaturalLanguageClassifier.v1
- Used for custom classifications of text
Visual Studio Tools for Office - https://docs.microsoft.com/en-us/visualstudio/vsto/programming-vsto-add-ins
- Used for integrating custom Add-in functionality with Microsoft Outlook
CONTRIBUTORS
================================================
- Ethan Knez
- Kurtis Kuszmaul
- Gregory Ochs