Skip to content

mukul13/Kaggle---Bag-of-Words-Meets-Bag-of-Popcorns-using-Word2vec-in-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Kaggle Bag of Words Meets Bag of Popcorns using Word2vec in R

An entry to Bag of words meets bag of popcorns using word2vec in R

To get competion data, click here

####Packages needed:

  • rword2vec
  • Rcpp and RcppArmadillo
  • rpart and randomForest
  • tm

####Code Explanation:

  • Word vectors are obtained by using rword2vec package.
  • Binary output file is converted into text file for further processing.
  • To create training dataset for sentiment classification for reviews using word vectors obtained above, two popular methods can be used:
  1. Vector Averaging
  2. Clustering
  • In first methods, we have to do vector averaging for each row of labeled and test dataset. There are many ways to do this but I have done this part using Rcpp and RcppArmadillo (R interface to C++) to avoid these compute intensive operations.
  • In clustering,we are doing bag of centroids instead of bag of words. This part is also done using Rcpp and RcppArmadillo to optimize speed.
  • Finally, classsification is done using random forest.

####Note: I'd recommend to read this python tutorial series first for better understanding of vector averaging and clustering.

####Test dataset results:

image

Classification using Vector Averaging

image2

Classification using Clustering

####Results:

  • Accuracy obtained for averaging and bag of centroids is more than their respective threshold but it is still very less.
  • Accuracy can be improved using different machine learning algorithms like GBM,xgboost,neural networks etc and using techniques like stacking, blending, bagging etc.

About

An entry to Bag of words meets bag of popcorns using word2vec in R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published