Tag: Kaggle

Binary Classification Model for Customer Transaction Prediction Using Python (XGBoost Tuning Batch #2)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the eXtreme Gradient Boosting algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: We applied different values for the max_depth, min_child_weight, subsample, and colsample_bytree parameters using fixed n_estimators (1000 or 100). The max_depth values vary from 10, 15, 20 to 25. The min_child_weight values vary from 3 to 5 with different learning rates. The subsample and colsample_bytree values vary from 0.6 to 1.0. The following output files are available for comparison.

  • py-classification-santander-kaggle-XGB-take11
  • py-classification-santander-kaggle-XGB-take12
  • py-classification-santander-kaggle-XGB-take13
  • py-classification-santander-kaggle-XGB-take14
  • py-classification-santander-kaggle-XGB-take15
  • py-classification-santander-kaggle-XGB-take16
  • py-classification-santander-kaggle-XGB-take17
  • py-classification-santander-kaggle-XGB-take18
  • py-classification-santander-kaggle-XGB-take19
  • py-classification-santander-kaggle-XGB-take21
  • py-classification-santander-kaggle-XGB-take22
  • py-classification-santander-kaggle-XGB-take23
  • py-classification-santander-kaggle-XGB-take24
  • py-classification-santander-kaggle-XGB-take25
  • py-classification-santander-kaggle-XGB-take26
  • py-classification-santander-kaggle-XGB-take27
  • py-classification-santander-kaggle-XGB-take28
  • py-classification-santander-kaggle-XGB-take29

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (XGBoost Tuning Batch #1)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the eXtreme Gradient Boosting algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: We applied different values for the max_depth, learning_rate, and n_estimators parameters. The max_depth values vary from 3 to 6. The learning_rate values vary from 0.1 to 0.5. The n_estimators values vary from 1000 to 4000. The following output files are available for comparison.

  • py-classification-santander-kaggle-XGB-take1
  • py-classification-santander-kaggle-XGB-take2
  • py-classification-santander-kaggle-XGB-take3
  • py-classification-santander-kaggle-XGB-take4
  • py-classification-santander-kaggle-XGB-take5
  • py-classification-santander-kaggle-XGB-take6
  • py-classification-santander-kaggle-XGB-take7-part1
  • py-classification-santander-kaggle-XGB-take7-part2
  • py-classification-santander-kaggle-XGB-take7-part3
  • py-classification-santander-kaggle-XGB-take7-part4
  • py-classification-santander-kaggle-XGB-take7-part5
  • py-classification-santander-kaggle-XGB-take8-part1
  • py-classification-santander-kaggle-XGB-take8-part2
  • py-classification-santander-kaggle-XGB-take8-part3
  • py-classification-santander-kaggle-XGB-take8-part4
  • py-classification-santander-kaggle-XGB-take8-part5
  • py-classification-santander-kaggle-XGB-take9-part1
  • py-classification-santander-kaggle-XGB-take9-part2
  • py-classification-santander-kaggle-XGB-take9-part3
  • py-classification-santander-kaggle-XGB-take9-part4
  • py-classification-santander-kaggle-XGB-take9-part5
  • py-classification-santander-kaggle-XGB-take10-part1
  • py-classification-santander-kaggle-XGB-take10-part2
  • py-classification-santander-kaggle-XGB-take10-part3
  • py-classification-santander-kaggle-XGB-take10-part4
  • py-classification-santander-kaggle-XGB-take10-part5

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Kaggle Competition: Banco Santander Customer Transaction Prediction Update 3

If you are new to Python machine learning like me, you might find the current Kaggle competition “Santander Customer Transaction Prediction” interesting.

The competition is essentially a binary classification problem with a decently large dataset (200 attributes and 200,000 rows of training data). I have not participated in Kaggle competition before and will use this one to get some learning under the belt.

I had run the training data through a list of machine learning algorithms (see below) and iterate them through three stages. This blog post will serve as the meta post that summarizes the progress.

The current plan with the milestones is as follow:

Stage 1: Gather the Baseline Performance.

  • LogisticRegression: completed and posted on Monday 25 February 2019
  • DecisionTreeClassifier: completed and posted on Wednesday 27 February 2019
  • KNeighborsClassifier: completed and posted on Friday 1 March 2019
  • BaggingClassifier: completed and posted on Sunday 3 March 2019
  • RandomForestClassifier: completed and posted on Monday 4 March 2019
  • ExtraTreesClassifier: completed and posted on Wednesday 6 March 2019
  • GradientBoostingClassifier: completed and posted on Friday 8 March 2019

Stage 2: Feature Selection using the Attribute Importance Ranking technique

  • LogisticRegression: completed and posted on Monday 11 March 2019
  • BaggingClassifier: completed and posted on Wednesday 13 March 2019
  • RandomForestClassifier: completed and posted on Friday 15 March 2019
  • ExtraTreesClassifier: completed and posted on Sunday 17 March 2019
  • GradientBoostingClassifier: completed and posted on Monday 18 March 2019

Stage 3: Over-Sampling (SMOTE) and Balancing Ensembles techniques

  • LogisticRegression: completed and posted on Wednesday 20 March 2019
  • ExtraTreesClassifier: completed and posted on Friday 22 March 2019
  • RandomForestClassifier: completed and posted on Monday 25 March 2019
  • GradientBoostingClassifier: completed and posted on Wednesday 27 March 2019
  • Balanced Bagging: completed and posted on Friday 29 March 2019
  • Balanced Boosting: completed and posted on Sunday 31 March 2019
  • Balanced Random Forest: completed and posted on Monday 1 April 2019
  • XGBoost with Full Feature: completed and posted on Wednesday 3 April 2019
  • XGBoost with SMOTE: completed and posted on Friday 5 April 2019

Stage 4: eXtreme Gradient Boosting Tuning Batches

  • Batch #1: planned for Monday 8 April 2019
  • Batch #2: planned for Monday 10 April 2019

I have posted all Python scripts here on GitHub. The final submission deadline is 10 April 2019.

Feel free to take a look at the scripts and experiment. Who knows, you might have something you can turn in by the time April comes around. Happy learning and good luck!

Binary Classification Model for Customer Transaction Prediction Using Python (eXtreme Gradient Boosting with SMOTE)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the eXtreme Gradient Boosting algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.9129. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.9584. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.6560.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (eXtreme Gradient Boosting with Full Features)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the eXtreme Gradient Boosting (XGBoost) algorithm with the full set of features for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.8315. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8908. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.6496.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.