Binary Sentiment Analysis on movie reviews
We have replicated the features that were used in the original paper. A movie review is featurized as a bag-of-words, where each feature is the number of times a particular word occurs in the review. Of course, most words don't occur in a single review. So while the dimensionality of the feature vector is the number of words (in this case 74481), most reviews correspond to very sparse vectors in this space.
Classifiers implemented (all except Neural net from scratch) include:
- Margin Perceptron: accuracy of 87.3% on testing data, 87% on validation data
- Average Perceptron: accuracy of 87.6% on testing data, 88.4% on validation data
- SVM: accuracy of 88.89% on testing data, 88.2% on validation data
- Logistic Regression: accuracy of 86.8% on testing data, 86.9% on validation data
- Naive Bayes: assumes a Gaussian kernel, accuracy of 81.6% on testing data, 80.8% on validation data
- Neural Network: accuracy of 86.3% on testing data, 86.2% on validation data
All classifiers are compliant with Scikit-learn's API, and that makes it possible to use Sklearn's CV and Gridsearch functions. This allows for multi-core training and cross-validation capability.
Kaggle Competition gives more details about the dataset. The dataset can also be found at http://ai.stanford.edu/~amaas/data/sentiment/