Month: August 2018

Binary-Class Classification Model for Seismic Bumps Take 2 Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

INTRODUCTION: Mining activity has always been connected with the occurrence of dangers which are commonly called mining hazards. A special case of such a threat is a seismic hazard which frequently occurs in many underground mines. Seismic hazard is the hardest detectable and predictable of natural hazards, and it is comparable to an earthquake. The complexity of seismic processes and big disproportion between the number of low-energy seismic events and the number of high-energy phenomena causes the statistical techniques to be insufficient to predict seismic hazard. Therefore, it is essential to search for new opportunities for better hazard prediction, also using machine learning methods.

In iteration Take1, we had three algorithms with high accuracy results but with dismal Kappa scores. For this iteration, we will examine the viability of using the ROC scores to rank and choose the models.

CONCLUSION: From the previous Take1 iteration, the baseline performance of the eight algorithms achieved an average accuracy of 93.11%. Three algorithms (Random Forest, Support Vector Machine, and Adaboost) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, all three algorithms turned in the identical accuracy result of 93.42%, with an identical Kappa score of 0.0. With an imbalanced dataset we have on-hand, we will need to look for another metric or another approach to evaluate the models.

From the current iteration, the baseline performance of the eight algorithms achieved an average ROC score of 71.99%. Three algorithms (Random Forest, Adaboost, and Stochastic Gradient Boosting) achieved the top three ROC scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the best ROC result of 78.59, but with a dismal sensitivity score of 0.88%.

The ROC metric has given us a more viable way to evaluate the models, other than using the accuracy scores. However, with an imbalanced dataset that we have on-hand, we still need to look for another approach to further validate our modeling effort.

The HTML formatted report can be found here on GitHub.

Entrepreneurial Strategies, Part 2

In his book, Innovation and Entrepreneurship, Peter Drucker presented how innovation and entrepreneurship can be a purposeful and systematic discipline. That discipline is still as relevant to today’s business environment as when the book was published back in 1985. The book explains the challenges faced by many organizations and analyzes the opportunities which can be leveraged for success.

Drucker wrote that entrepreneurship requires two combined approaches: entrepreneurial strategies and entrepreneurial management. Entrepreneurial management is practices and policies that live internally within the enterprise. Entrepreneurial strategies, on the other hand, are practices and policies required for working with the external element, the marketplace.

Drucker further believed that there are four important and distinct entrepreneurial strategies we should be aware of. These are:

  1. Being “Fustest with the Mostest”
  2. “Hitting Them Where They Ain’t”
  3. Finding and occupying a specialized “ecological niche.”
  4. Changing the economic characteristics of a product, a market, or an industry.

These four strategies need not be mutually exclusive. A successful entrepreneur often combines two, sometimes even three elements, in one strategy.

“Hitting Them Where They Ain’t” manifests in one of the two ways: creative imitation and entrepreneurial judo.

Creative imitation describes a strategy where the entrepreneur does something somebody else has already done, but the entrepreneur makes the innovation better than the people who innovated originally. The strategy of “creative imitation” waits until somebody else has established the new market, but only “approximately.” It then goes to work and, within a short time, comes out with something similar that would greatly satisfy the customer. The creative imitation has then set the standard and takes over the market.

Like being “Fustest with the Mostest,” creative imitation is a strategy aimed at market or industry leadership, but it is much less risky. By the time the creative imitator moves, the market has been established. There is usually more demand for it than the original innovator can supply, so the creative imitator perfects and positions it. As such, creative imitation starts out with markets rather than with products, and with customers rather than with producers. It is both market-focused and market-driven.

Creative imitation does not exploit the failure of the pioneers as failure is commonly understood. On the contrary, the pioneer must be successful. But the original innovators failed to understand their success completely. This failure gives room for the creative innovator to exploit the success of others.

The strategy of creative imitation also requires a rapidly growing market. Creative imitators do not succeed by taking away customers from the pioneers who have first introduced a new product or service. Instead, they serve markets the pioneers have created but do not adequately service. Creative imitation satisfies a demand that already exists rather than creating one.

The strategy has its risks, and they are considerable. Creative imitators are easily tempted to splinter their efforts in the attempt to hedge their bets. Another danger is to misread the trend and imitate creatively what then turns out not to be the winning development in the marketplace.

Binary-Class Classification Model for Seismic Bumps Take 1 Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Seismic Bumps Data Set is a binary-class classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Mining activity has always been connected with the occurrence of dangers which are commonly called mining hazards. A special case of such a threat is a seismic hazard which frequently occurs in many underground mines. Seismic hazard is the hardest detectable and predictable of natural hazards, and it is comparable to an earthquake. The complexity of seismic processes and big disproportion between the number of low-energy seismic events and the number of high-energy phenomena causes the statistical techniques to be insufficient to predict seismic hazard. Therefore, it is essential to search for new opportunities for better hazard prediction, also using machine learning methods.

CONCLUSION: The baseline performance of the eight algorithms achieved an average accuracy of 93.11%. Three algorithms (Random Forest, Support Vector Machine, and Adaboost) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, all three algorithms turned in the identical accuracy result of 93.42%, with an identical Kappa score of 0.0.

With an imbalanced dataset we have on-hand, we will need to look for another metric or another approach to evaluate the models.

Dataset Used: Seismic Bumps Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

The HTML formatted report can be found here on GitHub.

值得付出的代價

(從我的一個喜歡與尊敬的作家,賽斯 高汀

當您將一個產品或服務帶入市場時,市場會決定它的價值。如果你不想被視為普通貨物(只靠廉價位來競爭),我們有兩條路逕:

一)通過稀缺性:這是值得的,因為它不是很多,或者我們是唯一能提供的人。

二)通過連接性:這是值得的,因為有其他人已經在使用它。

一點點或很多。

很少有能來替補,無論是因為它是很難獲得,還是因為你已經擁有了很多在使用的人。

我們不介意額外的支付,因為它是唯一,因為我們非常口渴而且沒有其他地方可以買水,因為我們認為它會增加價值,因為它是我們有限選擇的最佳選擇。就在這裡,現在,你是最好的選擇。換句話說,你是一個稀缺。

要么…

因為我們不想被遺棄在後面。因為它能讓我們與其他使用者連接,所以價值更多。

價值不光是利潤。廣泛而廉價的創新確實很有價值。然而,利潤往往有不同的微積成分,一個創造最終還是看誰認為值得付出額外的代價。

Multi-Class Classification Model for Letter Recognition Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Letter Recognition Data Set is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The objective is to identify each of many black-and-white rectangular-pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.

CONCLUSION: The baseline performance of the eight algorithms achieved an average accuracy of 80.98%. Three algorithms (k-Nearest Neighbors, Support Vector Machine, and Extra Trees) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 97.37%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 97.46%, which was even slightly better the accuracy of the training data.

For this project, the Support Vector Machine algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Letter Recognition

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition

One potential source of performance benchmarks: https://www.kaggle.com/c/ci-letter-recognition

The HTML formatted report can be found here on GitHub.