Tag: scikit-learn

Regression Tabular Model for Kaggle Playground Series Season 3 Episode 8 Using Python and Scikit-Learn

SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Playground Series Season 3 Episode 8 Dataset is a regression modeling situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions to give the Kaggle community a variety of reasonably lightweight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The dataset for this competition was generated from a deep learning model trained on the Gemstone Price Prediction Dataset. Feature distributions are close to but different from the original.

ANALYSIS: The average performance of the machine learning algorithms achieved an RMSE benchmark of 894 after preliminary training runs. Furthermore, we selected Random Forest Regressor as the final model as it processed the training dataset with an RMSE score of 600. When we tested the final model using the test dataset, the model achieved an RMSE score of 603.

CONCLUSION: In this iteration, the Random Forest model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Playground Series Season 3, Episode 8

Dataset ML Model: Regression with numerical features

Dataset Reference: https://www.kaggle.com/competitions/playground-series-s3e8

One source of potential performance benchmarks: https://www.kaggle.com/competitions/playground-series-s3e8/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Class Tabular Model for Kaggle Playground Series Season 3 Episode 7 Using Python and Scikit-Learn

SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Playground Series Season 3 Episode 7 dataset is a binary-class modeling situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions to give the Kaggle community a variety of reasonably lightweight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The dataset for this competition was generated from a deep learning model trained on the Reservation Cancellation Prediction dataset. Feature distributions are close to but different from the original.

ANALYSIS: The average performance of the machine learning algorithms achieved a ROC/AUC benchmark of 0.8470 after training. Furthermore, we selected Random Forest as the final model as it processed the training dataset with a ROC/AUC score of 0.8891. When we processed the test dataset with the final model, the model achieved a ROC/AUC score of 0.8716.

CONCLUSION: In this iteration, the Random Forest model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Playground Series Season 3, Episode 7

Dataset ML Model: Binary-Class classification with numerical features

Dataset Reference: https://www.kaggle.com/competitions/playground-series-s3e7

One source of potential performance benchmarks: https://www.kaggle.com/competitions/playground-series-s3e7/leaderboard

The HTML formatted report can be found here on GitHub.

Regression Tabular Model for Kaggle Playground Series Season 3 Episode 6 Using Python and Scikit-learn

NOTE: This notebook was produced using Amazon’s SageMaker Studio Lab. I re-created this notebook to compare it with an earlier notebook created using Google’s Colab.

SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Playground Series Season 3 Episode 6 Dataset is a regression modeling situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions to give the Kaggle community a variety of reasonably lightweight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The dataset for this competition was generated from a deep learning model trained on the Paris Housing Price Prediction dataset. Feature distributions are close to but different from the original.

ANALYSIS: The average performance of the machine learning algorithms achieved an RMSE benchmark of 454,623 after preliminary training runs. Furthermore, we selected Random Forest Regressor as the final model as it processed the training dataset with an RMSE score of 133,540. When we tested the final model using the test dataset, the model achieved an RMSE score of 225,268.

CONCLUSION: In this iteration, the Random Forest model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Playground Series Season 3, Episode 6

Dataset ML Model: Regression with numerical features

Dataset Reference: https://www.kaggle.com/competitions/playground-series-s3e6

One source of potential performance benchmarks: https://www.kaggle.com/competitions/playground-series-s3e6/leaderboard

The HTML formatted report can be found here on GitHub. [https://github.com/daines-analytics/tabular-data-projects/tree/master/py_regression_kaggle_playground_series_s3e06]

Regression Tabular Model for Kaggle Playground Series Season 3 Episode 6 Using Python and Scikit-Learn

SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Playground Series Season 3 Episode 6 Dataset is a regression modeling situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions to give the Kaggle community a variety of reasonably lightweight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The dataset for this competition was generated from a deep learning model trained on the Paris Housing Price Prediction Dataset. Feature distributions are close to but different from the original.

ANALYSIS: The average performance of the machine learning algorithms achieved an RMSE benchmark of 454,623 after preliminary training runs. Furthermore, we selected Random Forest Regressor as the final model as it processed the training dataset with an RMSE score of 133,540. When we tested the final model using the test dataset, the model achieved an RMSE score of 225,268.

CONCLUSION: In this iteration, the Random Forest model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Playground Series Season 3, Episode 6

Dataset ML Model: Regression with numerical features

Dataset Reference: https://www.kaggle.com/competitions/playground-series-s3e6

One source of potential performance benchmarks: https://www.kaggle.com/competitions/playground-series-s3e6/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Class Tabular Model for Kaggle Playground Series Season 3 Episode 4 Using Python and Scikit-Learn

SUMMARY: The project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Playground Series Season 3 Episode 4 dataset is a binary-class modeling situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions to give the Kaggle community a variety of reasonably lightweight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The dataset for this competition was generated from a deep learning model trained on the Credit Card Fraud Detection dataset. Feature distributions are close to but different from the original.

ANALYSIS: The average performance of the machine learning algorithms achieved a ROC/AUC benchmark of 0.8760 after training. Furthermore, we selected Extra Trees as the final model as it processed the training dataset with a ROC/AUC score of 0.9100. When we processed the test dataset with the final model, the model achieved a ROC/AUC score of 0.8072.

CONCLUSION: In this iteration, the Extra Trees model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Playground Series Season 3, Episode 4

Dataset ML Model: Binary-Class classification with numerical features

Dataset Reference: https://www.kaggle.com/competitions/playground-series-s3e4

One source of potential performance benchmarks: https://www.kaggle.com/competitions/playground-series-s3e4/leaderboard

The HTML formatted report can be found here on GitHub.