Sorting algorithms are to Machine Learning what the sorting hat is to students in the Harry Potter series: a way to assign each individual to a class (or a house). In Harry Potter, there were four different houses: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff. The magical sorting hat assigns every first-year student into one of the four houses. The sorting hat ‘classifies’ the students into four different categories.

We have previously discussed binary classification. When the number of classes exceeds two, the terminology used is ‘multiclass’ classification.

We explained the various metrics used in binary classification, namely: the confusion matrix, Accuracy, Recall, Precision, MCC, F1 score. All these metrics apply to multiclass classification. Let’s see how.

**Usual Suspects in Sorting Algorithms Metrics: Accuracy and Co.**

Let’s consider a group of freshmen and assume that we have built a model to predict their house assignment. We want to evaluate the Accuracy of this prediction versus the actual house assignment. Let’s assume further that there are three houses (i.e., three classes): house G, house R, house H. Suppose we get the following confusion matrix:

The diagonal represents the number of individuals for which the prediction and the actual house choice match.

The Accuracy is the ratio of correct classifications over the total. Here:

The accuracy, in this case, is 74%.

The Recall is the number of True Positives (TP) over the total number predicted for one class, i.e., True Positives and False Negatives (FN). But in a multiclass configuration, the Macro-Recall becomes the average of the Recall for each category. The Recall for each of one the different houses are:

The Macro-Recall is 67%.

The Precision is calculated in the same way: TP over TP plus False Positives (FP). First, it is computed per class, and then the Macro-Precision is averaged over the three categories.

Metrics are explained here with three classes, and can be generalized to n classes whatever the number of elements to classify.

The F1 score is known as:

The same approach is used for the F1-score. It is calculated for each house (i.e., class) and averaged over the three classes to get the Macro-F1.

In our case of sorting students into houses, we would get the following values:

House | Recall | Precision | F1-score |

H | 0.46 | 0.7 | 0.55 |

G | 0.86 | 0.78 | 0.79 |

R | 0.8 | 0.69 | 0.74 |

As a consequence:

So far, the metrics used to evaluate sorting algorithms are familiar. However, there is another metric used specifically for multiclass classification, the Kappa score.

**The Kappa Score (aka Cohen’s Kappa Coefficient)**

The Kappa score was initially established in the field of psychology. It is used for measuring the Agreement between two experts (i.e., psychologists) when rating subjects (patients). The Machine Learning community adopted it to estimate classification performance. It has since been used often in computer science.

Here is an example of the Kappa score as specified at first. Let’s suppose we attend a culinary contest. Two expert chefs, Chef A and Chef B, are judging the dishes cooked by the contestants. The chefs can rank each recipe as ‘Exquisite,’ ‘No,’ ‘Maybe.’ The two chefs might agree on some dishes and disagree on some others.

The more compatible their ratings, the higher is their ‘Agreement.’ A high level of Agreement confirms that the grades are consistent and make sense, while a low level of Agreement sheds doubts on the grades’ validity. The worst case occurs when they disagree on every dish.

The Kappa score’s concept is to benchmark the level of Agreement, i.e., identical grades assigned by Chef A and Chef B, which were obtained versus the Chance Agreement. The Chance Agreement is the level of Agreement we would get if the two chefs were evaluating the dishes entirely randomly. The number of elements to classify in this example is 74.

For instance, Chef A and B could issue the following grades for the first 10 dishes (among 74 evaluated):

Dish | Chef A | Chef B |

1 | Exquisite | No |

2 | No | No |

3 | No | Maybe |

4 | Maybe | Maybe |

5 | Exquisite | Exquisite |

6 | Maybe | Exquisite |

7 | No | Exquisite |

8 | Maybe | No |

9 | No | Exquisite |

10 | Exquisite | Exquisite |

And proceed with the remainder of the 74 dishes up for culinary competition. We can compile the results into a matrix that presents the number of recipes for which the chefs agreed (diagonal) and the number of dishes of disagreement, as shown below.

Chef A | ||||

Chef B | Exquisite | No | Maybe | |

Exquisite | 7 | 6 | 2 | |

No | 2 | 32 | 5 | |

Maybe | 1 | 3 | 16 |

So now, how do we determine and compare the Agreement and Chance Agreement? Let’s start with the Chance Agreement.

The chance that Chef A picks randomly Exquisite is the number of times Chef A chose Exquisite over the total number of dishes:

Applying the same approach, we can calculate ChanceB(Exquisite), i.e., 0.28, also as the number of Exquisite selected by B divided by the total number of dishes.

Both Chef A and Chef B pick exquisite simultaneously by chance with a probability, which is the product of ChanceA(Exquisite) and ChanceB(Exquisite), i.e., 0.04. Following this method, we can calculate Chance(No), i.e., 0.28, and Chance(Maybe), i.e., 0.08. The overall chance agreement is the sum of these three chances: Chance(Exquisite), Chance(No), and Chance(Maybe), i.e., 0.4.

The Agreement is the sum of all the grades the chefs agreed on (the diagonal in the previous matrix) divided by the total number of occurrences, here it is:

Once we have computed both the Agreement and the Chance Agreement, the Kappa score is defined as:

The Kappa score for our chefs rating of the 74 dishes is 0.56. In the worst case, the Agreement is identical to the Chance Agreement.

The three options: Exquisite, No, and Maybe can be interpreted as three classes. And the two chefs, A and B, can be transposed in the machine learning world into ‘Predicted’ and ‘Actual.’ That’s how the Kappa score is transcribed from the world of psychology to Machine Learning and sorting algorithms. And that’s how computer science has adopted a concept stemming from psychology.

When applied to multiclass classification, the Kappa score reveals how well the model has predicted data assignments in different classes compared to a random class assignment. If the predictions have well identified the categories, the Kappa Score gets close to 1. On the contrary, if the sorting algorithm does not determine the classes well, the Kappa Score becomes close to 0 in the worst case.

## Multiclass and sorting algorithms in a nutshell

We have reviewed the metrics considered for binary classification and applied them to multiclass classification regardless of the number of elements. These metrics stay readable and interpretable, even in the context of an expanded number of classes.

We have also discovered a new metric, borrowed from the realm of psychology, the Kappa score. This metric correlates the level of True Positives obtained by the algorithm under scrutiny, versus a random procedure.

The metrics available to estimate the performance of sorting algorithms are varied. Some metrics are more appropriate when the classes are well balanced; others emphasize the algorithm’s effectiveness with unbalanced classes. In a single figure, the Kappa score shows how well the model performs compared to a random algorithm.

## References

https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1