Demystifying State Machines
“Delorean." That single word makes you think of a time machine and an unimaginable number of possible time travel destinations.…
Read MoreMachine learning (ML) algorithms have had a considerable impact on the technology industry over the past several years. Their applications continue to increase within diverse industry verticals including robotics, healthcare, fraud detection, personalized marketing, recommendation engines, and autonomous vehicles, among others. While the innovation potential of this technology is undisputed, the wide variety of emergent applications raises new challenges and brings up important considerations. One of those challenges arises in the process of teaching an algorithm through an unbalanced set of data.
Unbalanced datasets are those with a disproportionate class or target variable ratio between observations. It is common to find this kind of dataset in cases examining fraudulent vs. non-fraudulent bank transactions, malignant vs. benign tumors, or predicting whether a product is defective within a production run.
To answer this question it is important to start by defining a guideline to rate the performance of a machine learning algorithm in a given task. To an inexperienced eye, a good performance criterion of our algorithm could be accuracy; that is, the percentage of observations that our algorithm manages to classify in the appropriate class. However, this criterion may be inappropriate when we are faced with an unbalanced case of data. Assume a fictitious data set of 1000 mammogram observations made to people in a community through an early cancer screening campaign. In this fictitious data, we found 10 persons with malignant tumors and the remaining 990 are healthy. A machine learning algorithm is trained from this data set with the objective of analyzing observations obtained from a neighboring community in which there are 12 persons with malignant tumors and 988 healthy. If our goal is to implement a machine learning algorithm capable of learning from the data in the first community to predict cases in the second, and if our algorithm is not properly trained, it could make the mistake of learning only through the Majority Class (in this case, observations of healthy people) classifying the total data set as healthy. Using the accuracy as performance criteria the results would take the following form:
These results could lead us to the erroneous conclusion that we are using a good algorithm since it has an accuracy of 98.8%. On second glance, however, it is easy to notice that the algorithm is deeply flawed since what it does is generalize the dominant class and fails to properly identify the truly important class, which is the 12 cases with malignant tumors. A similar scenario can be observed in real-time with the ongoing Coronavirus outbreak. While cases of infection remain a small percentage of the overall population, looking at the data this way does not tell an accurate story of the virus’s underlying dangers and is certainly not helpful in determining how to address it. Facing this challenge is important to identify other criteria to rate the performance of our algorithm. In order to identify and handle this scenario adequately, there’s a variety of metrics we can turn to that will more accurately rate and provide insight into the performance of our algorithm, which we’ll go over in the next section.
A very useful criterion for measuring performance in a binary classification task is the confusion matrix, which allows us to easily compare the number of predicted observations in each class with respect to the actual values of these observations. Figure 1 shows how a confusion matrix is obtained from the real value of our observation present in the test set, and the predictions for those observations obtained from our algorithm.
Figure 1: Confusion Matrix for Binary Classification Task
From the confusion matrix it is possible to obtain other useful metrics among which are:
Let’s examine some examples of how to handle cases of unbalanced data.
1) Random Resampling Techniques: This approach aims to balance classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm. The main objective of balancing classes is to either increase the frequency of the minority class or decrease the frequency of the majority class. This is done in order to obtain approximately the same number of instances for both the classes.
Advantages
Disadvantages
2) Clustering the majority class: This approach aims to obtain a sub-sample of the majority class but instead of relying on random samples to cover the variety of training samples, it implies clustering the abundant class into “n groups”, with n being the number of cases in the minority class. For each group, only the centroid (center of cluster) is kept. The model is then trained with the rare class and the centroids only.
Advantages
Disadvantages
3) Informed Over Sampling: Synthetic Minority Oversampling Technique (SMOTE): This is an over-sampling technique where non-random samples of the minority class are replicated, but new synthetic samples of the minority class are created and added to the original dataset. These synthetic observations are generated by interpolation of the features of the minority class.
This approach aims to mitigate the overfitting effect which occurs when exact replicas of minority instances are added to the main dataset. SMOTE method could be implemented easily imbalanced-learn library which is a powerful python library for handling imbalanced data.
Advantages
Disadvantages
At Growth Acceleration Partners, we have extensive expertise in many verticals. We can provide your organization with resources in the following areas:
If you have any further questions regarding our services, please reach out to us.
REFERENCES
[1] Imbalanced-learn https://imbalancedlearn.readthedocs.io/en/stable/over_sampling.html
Emanuel Hernández Castillo is a Data Scientist at Growth Acceleration Partners. He is passionate about robotics and reinforcement learning, highly experienced in supervised/unsupervised learning, modeling techniques, and developing machine learning algorithms using Python. Outside of work he enjoys traveling, hiking and spending time in nature. You can connect with Emanuel on LinkedIn or by sending him an email.
“Delorean." That single word makes you think of a time machine and an unimaginable number of possible time travel destinations.…
Read MoreLearn how you can implement a Repository Pattern using ASP.NET Core Web API and Entity Framework Core in a real…
Read More