Tag: R

Web Scraping of O’Reilly Velocity Conference 2019 San Jose Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: The Velocity Conference covers the full range of skills, approaches, and technologies for building and managing large-scale, cloud-native systems. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://conferences.oreilly.com/velocity/vl-ca-2019/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

Multi-Class Classification Model for Sensorless Drive Diagnosis Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.

In iteration Take1, we established the baseline accuracy measurement for comparison with future rounds of modeling.

In this iteration, we will standardize the numeric attributes and observe the impact of scaling on modeling accuracy.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.53%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

In this iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.34%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

By standardizing the dataset features, the ensemble algorithms continued to perform well. However, standardizing the features appeared to have little impact on the overall modeling accuracy.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, Random Forest could be considered for further modeling.

Dataset Used: Sensorless Drive Diagnosis Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Sensorless Drive Diagnosis Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.

In this iteration, we will establish the baseline accuracy measurement for comparison with future rounds of modeling.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 85.53%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, Random Forest could be considered for further modeling.

Dataset Used: Sensorless Drive Diagnosis Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The HTML formatted report can be found here on GitHub.

Web Scraping of O’Reilly Software Architecture Conference 2019 San Jose Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: The Software Architecture Conference covers the full range of topics in the software architecture discipline. Those topics include leadership and business skills, product management, and domain-driven design. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-ca-2019/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

Regression Model for Metro Interstate Traffic Volume Using R Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Metro Interstate Traffic Volume dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset captured the hourly measurement of Interstate 94 Westbound traffic volume for MN DoT ATR station 301. The station is roughly midway between Minneapolis and St Paul, MN. The dataset also included the hourly weather and holiday attributes for assessing their impacts on traffic volume.

In iteration Take1, we established the baseline mean squared error without much of feature engineering. This round of modeling also did not include the date-time and weather description attributes.

In iteration Take2, we included the time stamp feature and observed its effect on improving the prediction accuracy.

In iteration Take3, we re-engineered (scale and/or discretize) the weather-related features and observed their effect on the prediction accuracy.

In this iteration, we will re-engineer (scale and/or binarize) the holiday and weather-related features and observe their effect on the prediction accuracy.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 2099. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 1895. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an RMSE of 1899, which was slightly better than the prediction from the training data.

From iteration Take2, the baseline performance of the machine learning algorithms achieved an average RMSE of 972. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 480. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an RMSE of 479, which was slightly better than the prediction from the training data.

By including the date_time information and related attributes, the machine learning models did a significantly better job in prediction with a much lower RMSE.

From iteration Take3, the baseline performance of the machine learning algorithms achieved an average RMSE of 814. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 474. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an RMSE of 472, which was slightly better than the prediction from the training data.

By re-engineering the weather-related features, the average performance of all models did better. However, the changes appeared to have little impact on the performance of the ensemble algorithms.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 814. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 411. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an RMSE of 407, which was slightly better than the prediction from the training data.

By re-engineering the holiday and other weather-related features, the average performance of all models did better than baseline. Moreover, the changes appeared to have a further positive impact on the performance of the Random Forest ensemble algorithm.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, the Random Forest algorithm could be considered for further modeling.

Dataset Used: Metro Interstate Traffic Volume Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

One potential source of performance benchmarks: https://www.kaggle.com/ramyahr/metro-interstate-traffic-volume

The HTML formatted report can be found here on GitHub.