Month: October 2018

Updated Machine Learning Templates for R

As I work on practicing and solving machine learning (ML) problems, I find myself repeating a set of steps and activities repeatedly.

Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a set of project templates that can be used to support regression ML problems using R.

Version 5 of the templates contain several minor adjustments and corrections to address discrepancies in the prevision versions of the template.

The new templates also standardized the dataframes used in the script as follow:

originalDataset: This dataframe contains the original data imported from the data source.

xy_train: Training dataframe that has the attributes and the target/class variable.

x_train: Training dataframe that has the attributes only.

y_train: Training dataframe that has the target/class variable only.

xy_test: Test dataframe that has the attributes and the target/class variable.

x_test: Test dataframe that has the attributes only.

y_test: Test dataframe that has the target/class variable only.

You will find the R templates from the Machine Learning Project Templates page.

Honest Signals

In his podcast, Akimbo [https://www.akimbo.me/], Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

Through the evolution process, animals have developed various traits that serve as signals to others. Some signals are useful to fend off predators, and some signals are useful for mating and reproduction purposes. We the humans, through the conditioning of culture, also send many signals to other humans.

Cal Newport has written about how ineffective open offices are. For many types of work that require highly concentrated effort, open offices do not make sense. So why build one? Newport argued that it is a signaling strategy to investors and people we are going to hire next. The signal is that we are so smart, so productive, and making a ruckus, we can afford to cram people into this bullpen, so do you want to join us? Even though it is less productive, it appeals to a certain kind of employee and a certain kind of investor.

The first consideration for signaling strategy is are they honest signals or dishonest signals. We have a choice when we send signals to the world, and it is challenging to send dishonest signals. If we are bluffing and it fails to work, we lose creditability and very difficult to undo these days. When we go into the marketplace, we must decide whether we have built up enough of a resource, skill, capital, or reputation for sending the honest signals? Will we choose to invest or to bluff?

The second consideration is that many people, particularly people without a lot of experience, don’t know the signals they are supposed to send. As a result, they have talent that does not get found. If you have talent, passion, and skill, signals matter. Figure out what the signals are and over-invest in them. When we over-invest in our signals, they are more likely to work.

If we are looking for a resource, the obvious way to get a bigger return on investment is to ignore the expensive signals our competitors look for and to find other signals. This is the theory behind Michael Lewis’s Moneyball. Billy Beane came up with a new statistical way to find talent, and he was able to get players who were way cheaper. He was looking for signals that correlated with their real skills. The insightful thing to do is to look at which signals we think matter and get to those who are signaling the ones that do matter.

Thanks to the Internet, it is easier than ever to figure out which signals that the masses are looking at and to fake them. We harm ourselves all the time in the world of social media because people are intentionally or artificially boosting the signals. Once we realize that a signal has been corrupted, we have a choice. We can embrace the fact that other people are still looking at the old, corrupted signal. Or we can walk away and invent new, more honest signals that we want to live with. At times, we may need to do both. As diversity creates more value, we are going to have to discard many of the old signals and embrace new ones, ones that are more relevant and useful going forward.

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we explored was to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%, which was slightly worse than the accuracy from the training data and possibly due to over-fitting.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 91.59%. the Random Forest and Stochastic Gradient Boosting algorithms achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.74%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.42%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 90.62%. Three algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 97.75%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 87.21%. The accuracy on the validation dataset was noticeably worse than the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 41 after eliminating 520 variables. The processing time went from 8 hours 16 minutes in iteration Take1 down to 2 hours and 25 minutes in iteration Take4. That was a noticeable reduction in comparison to Take2, which reduced the processing time down to 7 hours 15 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 5 hours 22 minutes.

In conclusion, the attribute importance ranking technique helped by cutting down the attributes and reduce the training time. Furthermore, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm with attribute importance ranking should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using Python Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance in terms of accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we will explore is to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time managable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take1, the baseline performance of the ten algorithms achieved an average accuracy of 84.68%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.43%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.23%, which was even better than the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 83.54%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 93.34%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 93.82%, which was slightly better than the accuracy from the training data.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 85.49%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.52%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.06%, which was slightly better than the accuracy from the training data.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 86.76%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 95.83%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 94.19%, which was slightly below the accuracy from the training data.

From the model-building activities, the number of attributes went from 561 down to 50 after eliminating 511 variables that fell below the required importance. The processing time went from 8 hours 16 minutes in iteration Take1 down to 1 hours and 16 minutes in iteration Take4. That was a minor reduction in comparison to Take2, which reduced the processing time down to 2 hours 7 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 8 hours and 9 minutes.

In conclusion, the importance ranking technique should have benefited the tree methods the most, but the Linear Discriminant Analysis algorithm held its own for this modeling iteration. Furthermore, by reducing the collinearity, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Linear Discriminant Analysis and Support Vector Machine algorithms should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.

日常

(從我的一個喜歡與尊敬的作家,賽斯 高汀

你每一天都能為你自己建立資產嗎?

每一天嗎?

是什麼造就了屬於你的另一部分知識產權?

是什麼使您擁有的資產轉換成更有價值的東西?

你真正學到了些什麼?

每一天可以累積成很多天。從長遠來看,這也很容易讓我們自已再拖延一天。

但是長遠是從短期持之以恆來累積出來的。