Tag: natural language processing

NLP Model for IMDB Movie Sentiment Using TensorFlow Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The IMDB Movie Sentiment dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains 50,000 movie reviews extracted from IMDB. The researchers have annotated the tweets with labels (0 = negative, 1 = positive) to detect the reviews’ sentiment.

From iteration Take1, we created a bag-of-words model to perform binary classification (positive or negative) for the Tweets. The Part A script focused on building the model with the training and validation datasets due to memory capacity constraints. Part B focused on testing the model with the training and test datasets.

In this Take2 iteration, we will create a word-embedding model to perform binary classification for the Tweets.

ANALYSIS: From iteration Take1, the preliminary model’s performance achieved an accuracy score of 88.80% on the validation dataset after ten epochs. Furthermore, the final model processed the test dataset with an accuracy measurement of 89.48%.

In this Take2 iteration, the preliminary model’s performance achieved an average accuracy score of 88.40% on the validation dataset after ten epochs. Furthermore, the final model processed the test dataset with an accuracy measurement of 89.66%.

CONCLUSION: In this iteration, the word-embedding TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: IMDB Movie Sentiment

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

One potential source of performance benchmarks: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

The HTML formatted report can be found here on GitHub.

NLP Model for IMDB Movie Sentiment Using TensorFlow Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The IMDB Movie Sentiment dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains 50,000 movie reviews extracted from IMDB. The researchers have annotated the tweets with labels (0 = negative, 1 = positive) to detect the reviews’ sentiment.

In this Take1 iteration, we will create a bag-of-words model to perform binary classification (positive or negative) for the Tweets. The Part A script will focus on building the model with the training and validation datasets due to memory capacity constraints. Part B will focus on testing the model with the training and test datasets.

ANALYSIS: In this Take1 iteration, the preliminary model’s performance achieved an average accuracy score of 88.80% after 25 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 89.48%.

CONCLUSION: In this iteration, the bag-of-words TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: IMDB Movie Sentiment

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

One potential source of performance benchmarks: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

The HTML formatted report can be found here on GitHub.

NLP Model for Large Movie Review Using TensorFlow Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Large Movie Review dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The Large Movie Review Dataset is a collection of movie reviews used in the research paper “Learning Word Vectors for Sentiment Analysis” by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). The dataset comprises 25,000 highly polar movie reviews for training and 25,000 for testing.

From iteration Take1, we constructed a bag-of-words model and analyzed it with a simple TensorFlow deep learning network. Due to the system’s memory limitation, we had to break up the script processing into two parts. Part A tested the model with the training dataset using a five-fold validation. Part B trained the final model with the entire training dataset and made predictions on a previously unseen test dataset.

This Take2 iteration will construct a word-embedding model and analyze it with a simple TensorFlow deep learning network.

ANALYSIS: From iteration Take1, the bag-of-words model’s performance achieved an average accuracy score of 87.18% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.24%.

In this Take2 iteration, the word-embedding model’s performance achieved an average accuracy score of 88.21% after 10 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.84%.

CONCLUSION: In this modeling iteration, the word-embedding TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: Large Movie Review Dataset

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib

One potential source of performance benchmarks: https://ai.stanford.edu/~amaas/data/sentiment/ and https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf

The HTML formatted report can be found here on GitHub.

NLP Model for Large Movie Review Using TensorFlow Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Large Movie Review dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The Large Movie Review Dataset is a collection of movie reviews used in the research paper “Learning Word Vectors for Sentiment Analysis” by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). The dataset comprises 25,000 highly polar movie reviews for training and 25,000 for testing.

This Take1 iteration will construct a bag-of-words model and analyze it with a simple TensorFlow deep learning network. Due to the system’s memory limitation, we had to break up the script processing into two parts. Part A will test the model with the training dataset using a five-fold validation. Part B will train the model with the entire training dataset and make predictions on a previously unseen test dataset.

ANALYSIS: In this Take1 iteration, the baseline model’s performance achieved an average accuracy score of 87.18% after 20 epochs with ten iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 85.24%.

CONCLUSION: In this modeling iteration, the bag-of-words TensorFlow model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: Large Movie Review Dataset

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib

One potential source of performance benchmarks: https://ai.stanford.edu/~amaas/data/sentiment/ and https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf

The HTML formatted report can be found here on GitHub.

NLP Model for Disaster Tweets Classification Using TensorFlow Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a text classification model using a neural network and document the end-to-end steps using a template. The Disaster Tweets Classification dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Twitter has become an important communication channel in times of emergency. The ubiquitous nature of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter. In this practice Kaggle competition, we want to build a machine learning model that predicts which Tweets are about real disasters and which ones are not. This dataset was created by Figure-Eight and shared initially on their ‘Data for Everyone’ website.

From iteration Take1, we deployed a bag-of-words model to classify the Tweets. We also made predictions on Kaggle’s test dataset and submitted the results for evaluation.

In this Take2 iteration, we will deploy a word-embedding model to classify the Tweets. We will also submit the test predictions to Kaggle and obtain the performance score for the model.

ANALYSIS: From iteration Take1, the bag-of-words model’s performance achieved an average accuracy score of 75.49% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 75.02%.

In this Take2 iteration, the word-embedding model’s performance achieved an average accuracy score of 72.45% after 20 epochs with five iterations of cross-validation. Furthermore, the final model processed the test dataset with an accuracy measurement of 74.65%.

CONCLUSION: In this modeling iteration, the word-embedding TensorFlow model did not do as well as the bag-of-words model. However, we should continue to experiment with both natural language processing techniques for further modeling.

Dataset Used: Sentiment Labelled Sentences

Dataset ML Model: Binary class text classification with text-oriented features

Dataset Reference: https://www.kaggle.com/c/nlp-getting-started/

The HTML formatted report can be found here on GitHub.