
Multi classification metrics are quite close to the binary classification metrics we explored in another article. So it’s no wonder that we find the same suspects as before: confusion matrix, Accuracy, Recall, Precision, MCC, F1 score. However, since multi classification involves more than two classes, the calculation of multi classification metrics gets more complex as the number of classes increases.
From Binary to Multi Classification Metrics
In Multi classification, the basis for metrics is once again a Confusion Matrix. The only difference is that we have n rows and columns instead of 2 for binary classification, n being the number of classes of the outcome.
For the sake of the explanation, let’s consider a model with 3 classes. To compute the accuracy, we will first build a Confusion Matrix as below.

The green cells represent the number of individuals for which the prediction and the actual house choice match. The Accuracy is then the ratio of correct classifications over the total.

The Recall is the ratio of True Positives (TP) over the total number predicted for one class, i.e., True Positives and False Negatives (FN). But for multi classification metrics, the Macro-Recall becomes the average of each category’s recall. The Recall for each of one the different classes is:

The Macro-Recall is 69%.
We compute Precision the same way: TP over TP plus False Positives (FP). First, we compute it per class, then we average the classes to get the Macro-Precision.

These Metrics can be generalized to n classes whatever the number of elements to classify.
Macro F1-Score and beyond
The F1 score is known as:

The same approach is used for the F1-score. Each class is computed and then averaged to get the Macro-F1.
With our current values, we would get the following results:

As a consequence:

So far, the metrics used to evaluate sorting algorithms are familiar. However, Kappa Score is a new metric specifically used for multiclass classification.
From Psychology to Machine Learning: The Kappa Score
Psychologists initially designed The Kappa score in psychology to measure the Agreement between two experts when rating subjects. It has since been used in Machine Learning and Computer Science to estimate performance within multi classification metrics.
Let’s use an example to understand how to compute it. Two persons are to rate 74 items as “Good”, “Average” or “Bad”. The more compatible their ratings, the higher their “Agreement” is. Furthermore, the higher the “Agreement” rate is, the more we can trust the ratings’ validity.
The Kappa score benchmarks the level of Agreement between the two persons. First, we need to calculate the Chance Agreement, i.e. the level of Agreement between the two persons if they were rating items randomly.
Chances and agreements
We can compile the results into a matrix that presents the number of items for which the persons agree in diagonal as well as the disagreements, as shown below.

So now, how do we determine and compare the Agreement and Chance Agreement? Let’s start with the Chance Agreement.
The chance that the person A picks randomly Good is the number of times it chose Good over the total number of items:

Applying the same approach, we can calculate ChanceB(Good), i.e. 0.20, as well as ChanceA(Bad), ChanceA(Average), ChanceB(Bad) and finally ChanceB(Average).
Then, the Chance Agreement for “Good” is the probability that Person A and B both pick randomly a Good item. It is the product of ChanceA(Good) and ChanceB(Good). In our exemple, it will give 0,03. We repeat the operation for “Bad” with a result of 0.29 and for “Average” with a result of 0.08.
Finally, we get a Chance Agreement at 0.4.
Then, we calculate the Agreement i.e. the sum of all the grades both people agreed on divided by the total number of occurrences.

Once we have computed the Agreement and the Chance Agreement, we have the Kappa score:

The Kappa score for our people rating the 74 items is then 0.57.
In Machine Learning terms, “Good”, “Bad” and “Average” are classes, and the two people are actually the reality and the prediction. This is how we translate the Kappa Score to Machine Learning.
When applied to multi classification metrics, the Kappa score reveals how well the model has predicted data assignments in different classes compared to a random class assignment. The closer it gets to 1, the better the algorithm determines the classes.
Multi classification metrics in a nutshell
Classic binary classification metrics stay readable and interpretable when they become multi classification metrics, even though they complexify as classes expand.
We have also discovered the Kappa Score, a new metric specific to multi classification, which correlates the level of True Positives obtained by the algorithm under scrutiny with a random procedure.
Of course these multi classification metrics can feel disorientating. This is why we incorporate them within Quaartz Enterprise, a platform that helps companies build and operate predictive models to improve business efficiency. Do you have business topics where you think Machine Learning and multi classification can help? We are here for you!