This repository contains my approach for performing sentiment analysis on Amazon Reviews-Polarity. I have performed various steps in for capturing sentiments in this dataset including data preprocessing , data downsampling , exploratory data anlysis , building and training model , evaluating model .
34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored. In the dataset, class 1 is the negative and class 2 is the positive. Each class has 1,800,000 training samples and 200,000 testing samples.
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 or 2), review title and review text. The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
For preprocessing the Dataset we have to apply several transformations to the entire dataset viz:
- Removing Punctuations.
gpp.strip_punctuation
- We need to remove punctuations to treat each word sentiment equally.
- Remove stopwords .
gpp.remove_stopwords
- Words such as the , for , in which does not convey message to the machine should be removed.
- Removing extra whitespaces.
gpp.strip_multiple_whitespaces
- Removing extra white space for saving memory.
- Removing small words.
gpp.strip_short
- removing words with length < 3.
- Stripping HTML Tags.
gpp.strip_tags
- Removing any HTML tags that will not aid in sentiment analysis.
- Removing Numericals
gpp.strip_numeric
- Numericals not necessarily convey any meaning to our sentiments.
Since the whole dataset contains 34,686,770 Amazon reviews I have to downsample the data to as it is not possible to evaluate on the whole dataset. I have downsampled the data with equal distributions of each class
in the dataset.
I have performed EDA with TextBlob
to calculate sentiment scores and found out that there is a high correlation with sentiment of user and the class
in which the review belong.
Since words alone cannot define the sentiment of the user and combination of words can convey different meaning it is quite difficult to perform sentiment analysis with classical ML models.
There are many models that can generalise for sentiment analysis but I found BERT Experts
That has been properly trained specifically for Sentiment analysis and has performed better than any other models.
- I didn't trained the model on the entire dataset.
Data Augmentation
was not applied.
Efficient Feature Engineering
can be done on the data that can make the model to generalize wellData augmentation
can be applied to improve performance and accuracy on the data.- Increasing batch size and training for more iterations can be done.