Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a
predictive model using various machine learning algorithms and to document the
end-to-end steps using a template. The Metro Interstate Traffic Volume dataset
is a regression situation where we are trying to predict the value of a continuous
variable.
INTRODUCTION: This dataset captured the hourly measurement
of Interstate 94 Westbound traffic volume for MN DoT ATR station 301. The
station is roughly midway between Minneapolis and St Paul, MN. The dataset also
included the hourly weather and holiday attributes for assessing their impacts
on traffic volume.
In iteration Take1, we established the baseline mean squared
error without much of feature engineering. This round of modeling also did not
include the date-time and weather description attributes.
In iteration Take2, we included the time stamp feature and
observed its effect on improving the prediction accuracy.
In iteration Take3, we re-engineered (scale and/or
discretize) the weather-related features and observed their effect on the
prediction accuracy.
In this iteration, we will re-engineer (scale and/or
binarize) the holiday and weather-related features and observe their effect on
the prediction accuracy.
ANALYSIS: From iteration Take1, the baseline performance of
the machine learning algorithms achieved an average RMSE of 2099. Two
algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics
after the first round of modeling. After a series of tuning trials, Gradient
Boosting turned in the top overall result and achieved an RMSE metric of 1895.
After applying the optimized parameters, the Gradient Boosting algorithm
processed the testing dataset with an RMSE of 1899, which was slightly better
than the prediction from the training data.
From iteration Take2, the baseline performance of the
machine learning algorithms achieved an average RMSE of 972. Two algorithms
(Random Forest and Gradient Boosting) achieved the top RMSE metrics after the
first round of modeling. After a series of tuning trials, Gradient Boosting
turned in the top overall result and achieved an RMSE metric of 480. After
applying the optimized parameters, the Gradient Boosting algorithm processed
the testing dataset with an RMSE of 479, which was slightly better than the
prediction from the training data.
By including the date_time information and related
attributes, the machine learning models did a significantly better job in
prediction with a much lower RMSE.
From iteration Take3, the baseline performance of the
machine learning algorithms achieved an average RMSE of 814. Two algorithms
(Random Forest and Gradient Boosting) achieved the top RMSE metrics after the
first round of modeling. After a series of tuning trials, Random Forest turned
in the top overall result and achieved an RMSE metric of 474. After applying
the optimized parameters, the Random Forest algorithm processed the testing
dataset with an RMSE of 472, which was slightly better than the prediction from
the training data.
By re-engineering the weather-related features, the average
performance of all models did better. However, the changes appeared to have
little impact on the performance of the ensemble algorithms.
In the current iteration, the baseline performance of the
machine learning algorithms achieved an average RMSE of 814. Two algorithms
(Random Forest and Gradient Boosting) achieved the top RMSE metrics after the
first round of modeling. After a series of tuning trials, Random Forest turned
in the top overall result and achieved an RMSE metric of 411. After applying
the optimized parameters, the Random Forest algorithm processed the testing
dataset with an RMSE of 407, which was slightly better than the prediction from
the training data.
By re-engineering the holiday and other weather-related
features, the average performance of all models did better than baseline.
Moreover, the changes appeared to have a further positive impact on the
performance of the Random Forest ensemble algorithm.
CONCLUSION: For this iteration, the Random Forest algorithm
achieved the best overall training and validation results. For this dataset,
the Random Forest algorithm could be considered for further modeling.
Dataset Used: Metro Interstate Traffic Volume Data Set
Dataset ML Model: Regression with numerical and categorical
attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
One potential source of performance benchmarks:
https://www.kaggle.com/ramyahr/metro-interstate-traffic-volume
The HTML formatted report can be found here on GitHub.
You must be logged in to post a comment.