Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Large Movie Review dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
INTRODUCTION: The Large Movie Review Dataset is a collection of movie reviews used in the research paper “Learning Word Vectors for Sentiment Analysis” by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). The dataset comprises 25,000 highly polar movie reviews for training and 25,000 for testing.
This Take1 iteration will construct a bag-of-words model and analyze it with a simple TensorFlow deep learning network. Due to the system’s memory limitation, we had to break up the script processing into two parts. Part A will test the model with the training dataset using a five-fold validation. Part B will train the model with the entire training dataset and make predictions on a previously unseen test dataset.
ANALYSIS: In this Take1 iteration, the baseline model’s performance achieved an average accuracy score of 87.18% after 20 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.24%.
CONCLUSION: In this modeling iteration, the bag-of-words TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.
Dataset Used: Large Movie Review Dataset
Dataset ML Model: Binary class text classification with text-oriented features
Dataset Reference: https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib
One potential source of performance benchmarks: https://ai.stanford.edu/~amaas/data/sentiment/ and https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf
The HTML formatted report can be found here on GitHub.