review-classification

A script in both Python and R to classify reviews with an NLP model from NLTK and tm.

Why TSV instead of CSV?

Its likely that reviews could include commas, effectively ruining the way we read and write to CSV files. A TSV file means tab-separated, meaning that it uses the TAB as a delimiter over commas. This is a good choice, as its unlikely for us to see TABs in reviews.

What is Bag of Words?

Its a method we use to extract all the words from the training set, and turning each of them into columns for use in the NLP model. Before we do this, its necessary to "clean the text" in order to get to a base text that makes something like 'love, lovely, and loving' have just one column.

Fitting the models

For classification, we tested a Naive Bayes model against a Random Forest model, both in Python and R. We found that the Random Forest model attained a better accuracy over the Naive Bayes model, as the confusion matrix scored them 80% vs 73% respectively.

The dataset was obtained from superdatascience.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
Restaurant_Reviews.tsv		Restaurant_Reviews.tsv
my.R		my.R
my.py		my.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

review-classification

Why TSV instead of CSV?

What is Bag of Words?

Fitting the models

About

Releases

Packages

Languages

Vishnu-Adda/review-classification

Folders and files

Latest commit

History

Repository files navigation

review-classification

Why TSV instead of CSV?

What is Bag of Words?

Fitting the models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages