Machine Learning algorithms are everywhere now. From satellite image classification to Employee retention or Customer satisfaction, they help business experts be more effective and make operational gains. But how can we be sure that we can trust them?

Enter metrics. These statistical indicators help us assess the quality of Machine Learning Algorithms and determine how trustworthy they are. Knowing what they mean and measure is key to a Small Data project.

**Basics of Binary Classification Metrics**

A Classification challenge means that the Machine Learning algorithms must determine what class a specific item belongs to. For instance, in our delivery delay use case, there were two classes: “In time” and “Delayed”. In that case, it is binary classification. If there are more than two classes, we talk about multi-classification. But what binary classification metrics do we use?

To train the algorithm, we used past data that has been labeled as “In time” and “Delayed”. When an item is “In time”, it is considered a “positive” and when it is “Delayed”, it’s a “negative”. So when the prediction is “In time” and the actual result is similar, we have a True Positive (TP). In the same way, a “Delayed” prediction for an actual “Delayed” package is a True Negative (TN).

Conversely, predicting “In time” when the package is not is defined as a “false positive” (FP). And predicting that the package is “Delayed” as it is not is a false negative (FN). The Confusion matrix summarizes the four values as shown below.

**Metrics at work**

Once we have the Confusion matrix, we can compute the actual Binary Classification metrics. The first and simplest one is Accuracy, the metric that counts the proportion of correct predictions.

It is the proportion of True predictions on the total of predictions made, as summarized in the below figure.

Accuracy provides a good first indication of the quality of Machine Learning Algorithms. However, as classes can be unbalanced, we need to go a bit further.

Indeed, if we go back to our example, what we want to predict are the “Delayed” packages – and they were much less represented than “In time” ones. A good Accuracy can then mean that only “In time” packages are predicted well – and that the model is not that good.

To check this, we use more advanced Binary classification metrics such as Sensitivity and Specificity. Sensitivity indicates how well we detect positives (hits over misses). Specificity indicates how well we detect negatives (rejections over false-alarms).

Sensitivity is the proportion of True Positives among the total number of Positives, i.e. the sum of True Positives and False Negatives, as illustrated below:

Similarly, Specificity (a.k.a. Recall) is the proportion of True Negatives among the total number of Negatives, i.e. True Negatives and False Positives.

“Precision” is an additional metric that quantifies positives. It is the proportion of True Positives among the total of Positive Predictions.

**When Precision is not enough**

Accuracy, Sensitivity, Sensibility and Precision paint a good picture of Machine Learning Algorithm quality. However, they can prove themselves inefficient when the problem is quite complex.

Here, let’s assume that we want our model to detect Breast Cancer. With low precision, many patients will be diagnosed as ill, even people who are not. In that case, we can not trust precision as a key metric and might want to use even more advanced Binary classification metrics to measure the quality of the Machine Learning Algorithm.

Enter F1 score, an overall measure of a model’s accuracy that mixes precision and recall.

F1-score goes from 0 to 1. The higher it is, the less there are false positives and false negatives. It means that you are accurately classifying real threats and are not confused by false alarms. In Blueguard, the F1-score was the key metric as we wanted to minimize the number of false negatives.

Another advanced approach is to measure the correlation between real-life observations and predictions. Known as the Matthews Correlation Coefficient (MCC), its measurements are between -1 and 1. The higher the value is, the better the prediction is.

Finally, we can also use the Area Under Curve (AUC) metric to determine the quality of a model generated by Machine Learning Algorithms. AUC studies the balance between True positives and False Positives for every value possible of the data set. Its values are between 0 and 1. The higher it is, the better the model is.

**Metrics for the other Machine Learning challenges**

We hope that this article helped you get a better understanding of binary classification metrics and how they indicate how good a model is. Of course, binary classification is not the only type of prediction that Machine Learning algorithms perform. You can have a look at our articles about Regression and Multi-classification metrics. And feel free to contact us if you want to discuss your Small Data challenges!