Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Large Movie Review dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
INTRODUCTION: The Large Movie Review Dataset is a collection of movie reviews used in the research paper “Learning Word Vectors for Sentiment Analysis” by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). The dataset comprises 25,000 highly polar movie reviews for training and 25,000 for testing.
From iteration Take1, we constructed a bag-of-words model and analyzed it with a simple TensorFlow deep learning network. Due to the system’s memory limitation, we had to break up the script processing into two parts. Part A tested the model with the training dataset using a five-fold validation. Part B trained the final model with the entire training dataset and made predictions on a previously unseen test dataset.
This Take2 iteration will construct a word-embedding model and analyze it with a simple TensorFlow deep learning network.
ANALYSIS: From iteration Take1, the bag-of-words model’s performance achieved an average accuracy score of 87.18% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.24%.
In this Take2 iteration, the word-embedding model’s performance achieved an average accuracy score of 88.21% after 10 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.84%.
CONCLUSION: In this modeling iteration, the word-embedding TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.
Dataset Used: Large Movie Review Dataset
Dataset ML Model: Binary class text classification with text-oriented features
Dataset Reference: https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib
One potential source of performance benchmarks: https://ai.stanford.edu/~amaas/data/sentiment/ and https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf
The HTML formatted report can be found here on GitHub.