The Machine Learning challenge: “Measure what is measurable, and make measurable what is not.”
This quote from Galileo applies perfectly to Machine Learning Algorithms. How can we make the difference between a well-performing Machine Learning Algorithm and a poor-performing one? Is there a metric used as a magical wand to separate the wheat from the chaff?
Metrics are amazingly powerful. They are figures and statistics. As such, they have a tremendous influence. They appear as “truths.”
When, in substance, their real power is to tell a story. Their importance is demonstrated by how much damage they can do. Whatever the use case, whatever the application field.
Enron and Indicators
Consider Enron, for instance. As a reminder, Enron was a Wall Street star company employing thousands of people until it collapsed due to a massive scam. They lied about their financial performances.
Why? Revenue and earnings are the key metrics by which financial analysts and investors assess corporate performance. Therefore, it’s not a surprise that companies such as Enron might surrender to the pressure to announce profits in what appears to be the best looking way.
Investors asked Enron to tell a happy story. That’s what they did. That’s what metrics are: tools to tell a story. They can tell a happy story or a sad story.
Yet metrics can also be used to understand rather than evaluate and judge. For us, that’s where their benefit lies. Their power stems from the intrinsic ability of metrics to trigger debate.
Metrics are a genuine reason (or should we write pretext) to ask the right questions. According to us, they are not self-sufficient pillars of truths but merely indicators. We have decided to take the time to explain the main metrics on Machine Learning Algorithms.
What is the definition of a metric?
According to the Cambridge English dictionary, it is “a set of numbers that give information about a particular process or activity.”
Machine Learning Algorithms: Classification
Classification and Regression are often used to categorize Machine Learning Algorithms.
A clear example is classifying emails as “spam” or “not spam.” There are two possible outcomes, hence the term ‘binary’ classification. Classification refers to predicting to which predefined class a given example of input data belongs. Other examples of binary classification problems include:
- Given a handwritten character, classify it as one known characters or not,
- Given recent user behavior, classify it as churn or not,
- Given a family history, is the new baby going to be a girl or a boy,
- Given a set of symptoms and demographic data on a specific person, is this person infected with COVID-19?
Binary classification applies to a bi-dimensional space with independent variables.
In a previous article about Covid-19 and general practitionners , we have shown how it is possible to use Machine Learning to diagnose COVID-19 based on a set of real-world symptoms with few input data. But exactly how accurate are the predictions of our Machine Learning Algorithm for this use case? We used which metrics? Why?
We shall call COVID-19 infected people “positives” and non-infected ones “negatives.” When the algorithm predicts that a patient is “infected,” and the medical testing confirms it, we call this case a “true positive” (TP). Predicting “not infected” and getting a medical confirmation is a “true negative” (TN).
Conversely, predicting “infected” when the person is not is defined as a “false positive” (FP). It is crying wolf, i.e., raising a false alarm. And predicting that the person is not infected when she is, is a false negative (FN), i.e., a real error and medically the worst possible outcome. The ‘confusion matrix’ summarizes the four values: TP, FP, TN, FN as shown here:
This representation is named a confusion matrix because it shows how much the infected people (positives) are confused with the non-infected ones (negatives). The predicted values and the real-life values should be dependent variables. They’re not always.
Welcome to the (confusion) Matrix
What does a confusion matrix look like in a real case of predicting COVID-19 infections? In a previous article, we have explained how we have used our Machine Learning Algorithms on the real world Open Data database issued by the Mexican government.
The training set used is a part of this database. Among the people checking into a hospital, the proportion of COVID-19 infected people is higher than among the general population. Therefore, the sample we considered is biased, with over 65% of people infected. Let’s think of an imaginary pessimistic person.
When asked to predict whether the next subject tested will be positive (i.e., COVID-19 infected), the pessimist will say ‘yes,’ implying a 100% chance of COVID-19 infections. The confusion matrix (assuming 100 people extracted from the dataset under consideration and being representative) will be:
- TP: 65,
- FP: 35,
- TN: 0,
- FN: 0.
Conversely, if an optimistic person looks at the same dataset and has to predict the next subject’s infection status, the answer will be a ‘negative’ guess. The resulting confusion matrix is:
- TP: 0,
- FP: 0,
- TN: 35,
- FN: 65.
The pessimist and the optimist are two different algorithms. Which one is the best?
Metrics at work
The simplest way to estimate these predictions’ performance is to count how many times they were right. Accuracy is the metric that counts the proportion of correct predictions.
On the one hand, the pessimist made 65 correct predictions over 100 hence a 65% accuracy. On the other hand, the optimist’s accuracy is 35% because of 35 accurate predictions over 100. The pessimist is a better Machine Learning Algorithm.
The two classes (infected and not-infected) are balanced (65 and 35). There are cases where the classes are unbalanced, i.e., one class represents 5% and the other 95%. The ‘small’ class is the key one to identify, i.e., the target variable.
For instance, let’s consider a ‘normal’ population sample of the southeast of France. The goal is to predict who was already infected in real-life once by COVID-19 and who was not.
The COVID-19 prevalence in this area is around 5%. In this context, the optimist and the pessimist’s accuracy are respectively 95% and 5%.
The optimist algorithm looks great. Yet it missed the most important 5% of true positives. In such a case, it is pertinent to compute the accuracy of positives and negatives separately as if they were independent variables. The positive accuracy is called the sensitivity.
And similarly, the negative accuracy is the specificity.
Sensitivity indicates how well we detect positives (hits over misses) and the specificity of how well we detect negatives (rejections over false-alarms). In our first example, based on Mexico’s dataset, the pessimist’s sensitivity would be 100% and the specificity 0%, with 65% accuracy. The optimist (with an accuracy of 95%) has a sensitivity of 0% and a 100% specificity.
“Recall” is another name for sensitivity. Recall helps when the price of false negatives is high. What if we need to detect planes used as a means of terrorist attack? A false negative has disastrous outcomes. Get it wrong, and we get 9/11.
It is possible to combine sensitivity and specificity into a single value to get a summary. The geometric mean is the square root of the product sensitivity and specificity; hence it represents an average of both. In the four cases we have seen so far, the geometric mean is 0.
Another way to quantify positives is to use an indicator called ‘Precision.’ When the model predicts positive, how frequently is it right? Precision measures the proportion of positives, which are true positives.
So let’s assume the problem involves the detection of breast cancer. If the model has very low precision, many patients will be told that they have breast cancer, including some misdiagnoses. When false positives are too high, those who monitor the outcomes will ignore them after being besieged with false alarms.
F1 score is an overall data science measure of a model’s accuracy that mixes precision and recall.
A sound F1 score suggests that you have low false positives and small false negatives. So you’re accurately classifying real perils and are not confused by false alarms. An F1 score is perfect when it’s 1, while the model is a total failure when it’s 0. Of course, this never happens.
All models generate some false positives and false negatives, except our imaginary optimist and pessimist, who exist solely for the sake of this explanation. In our first example (the dataset from hospitals in Mexico), our pessimist has an F1 score of 78%, our optimist 0.
There is another solution to measuring a binary classification algorithm’s performance: to approach the real-life class and the predicted class as two independent variables and calculate their correlation coefficient. In data science, this correlation is the Matthews Correlation Coefficient (MCC) for binary classification. The higher the correlation between actual and predicted values, the better the prediction.
MCC is always between -1 and 1. If MCC’s value is 0, it means that the classifier is no better than a random dice throwing. When the classifier is perfect (FP = FN = 0) the value of MCC is 1.
Now comes the last and most exhaustive data science metric. We’ve generated a machine-learning algorithm to predict the likelihood of death in a set of COVID-19 infected patients. For each person, the model gives a probability of how likely they are to die.
After several days of disease, the True Positive Rate (TPR) is the proportion of actual deaths that the model correctly predicted. The False Positive Rate (FPR) is the proportion of patients whom the model predicted would die but did not. Remember, the model provides a probability, not a final result.
If we were to choose 0.9 as a probability, everyone with a death probability above 0.9 dies, everyone below lives. We could then compute the TPR and FPR and measure how well our model performed at this 0.9 decision boundary. If we choose 0.5 to increase the TPR, it will come at the expense of a larger FPR, i.e., more people are likely to die, but we would also be wrong more often. The three parameters are joined to one another and make models challenging to read and review.
The Receiver Operating Characteristic Curve (ROC) is an uncomplicated way to bypass this. It is a graphical description of the equilibrium between TPR and FPR at each achievable decision boundary. The AUC is a unique number that can assess a model’s performance, regardless of the chosen decision boundary.
An excellent machine learning model will have an AUC of 1.0, while a random one will have an AUC of 0.5. An acceptable model will be over 0.7; a great one will be over 0.85. The AUC is an excellent way to compare models across patient cohorts (for healthcare, for instance) and deliver a sense of how dependable it is in general.
The AUC gives a transparent, easy-to-interpret way to evaluate a model. Of course, it has limitations. For example, the ROC curve’s usefulness begins to break down with heavily imbalanced classes, obviously a big problem for healthcare data in the real world.
Take away and next steps
There are numerous data science metrics to measure the performances of binary classification and Machine Learning Algorithms. The confusion matrix is an excellent one in most cases. Whether we seek optimisation of true positives, false negatives, or false positives, the sensitivity, the recall, and the precision are good candidates. The F1 score and the MCC are means to aggregate these metrics.
A very effective way to compare a model’s performance among different data sets using different stress levels is the ROC and AUC.
In our next article, we will discuss the metrics involved in the Machine Learning Algorithms related to Regression.