Binary Classification Metrics

Author: Helen Cheyne, Data Scientist

Say you have a dataset with binary labels that you would like to predict, and some related measurements, called features, with which to predict. You split up your data into Train, Validation, and Test sets. Maybe you train and validate a few times to tune your model parameters and find the ideal threshold probability to define the predictions, until you have the BEST version of that model. Then you repeat on a few different types of models. Finally, you use each of your models to make predictions on your, as of yet, untouched test dataset so that you can compare them and decide which is the overall BEST model for your problem.

Sounds easy enough right? But how do you define the BEST model? All models make some mistakes. So you need to decide which mistakes matter most, if some correct predictions matter more than others, and in what combination you prefer these correct and incorrect predictions. And that depends on the underlying problem and the intended application of the model.

1. The Confusion Matrix

This is the first installment in a series that will explain various ways that the quality of a binary classification model can be summarized as metrics. Before such metrics can be discussed the output from these models must be understood and organized.

1.1 Model output

A binary classification problem needs observations related to known, binary, ground truth labels. For simplicity call the labels 0 and 1, with 1 being the label of interest, such as the presence of a disease. Once trained, a model will then provide both a probability for each label and a predicted label for each observation. For binary classification only the probability of being labeled 1 matters.

  • $X$ is the matrix of $n$ observations of $m$ features, $X_{i,j}$ is the value of the $j^{th}$ feature for the $i^{th}$ observation

  • $y$ or y_true is the vector of n known labels, $y_i$ is the known label for the $i^{th}$ observation

  • $p$ or y_prob is the vector of $n$ probabilities of being labeled 1, $p_i$ is the probability for the $i^{th}$ observation

    • in scikit-learn these can be calculated using y_prob = model.predict_probas(X)[:,1]

  • $\hat{y}$ or y_pred is the vector of $n$ predicted labels, $\hat{y}_i$ is the predicted class for the $i^{th}$ observation.

    • in scikit-learn this can be calculated using y_pred = model.predict(X) which uses the most likely label, that is $\hat{y}_i = 1$ if $p_i \geq 0.5$ and $\hat{y}_i = 0$ otherwise, or

    • y_pred = (y_prob>p)*1 for some chosen threshold probability p, assuming y_prob is a numpy ndarray.

1.2 Confusion Matrix Values

All observations have both a $y_i$ and $\hat{y}_i$, both of which can be 0 or 1. This pairing produces four possible outcomes for each observation: true positive, true negative, false negative, or false positive. The total number of each of these is depicted with their capitalized initials: TP, TN, FN, and FP respectively, and used to populate the confusion matrix.

Confusion Matrix figure 1.png

Figure 1 shows how the aforementioned four values are organized into the confusion matrix. This version also includes the marginal sums on the top and lefthand side, as well as the total number of observations in the top left corner. The colour coding is representative of the combination of categories that make up each interior value. For example, true positive, green, is made up of observations that test positive, yellow, and have the disease, cyan; cyan and yellow make green.

Each model will have four values depicting its specific performance, so models can not be ordered without making some decisions about the importance of each value, to combine them into a one dimensional metric. Classification metrics are calculated from the numbers in the confusion matrix so understanding metrics starts with understanding the confusion matrix and the impact of each of its values.

True Positive: This is the most universally understood as a success. The disease is, or will be, present and the model labeled it as such.

True Negative: This is sometimes equivalently considered a success, but sometimes it is a secondary success, and other times it is not important at all. The disease is not, or will not be, present and the model labeled it as such. For rare diseases it is nice to predict this class correctly but it is very easy to get caught labeling all cases as negative if it is given too much value.

False Positive: In healthcare this is the case that a patient is misdiagnosed as having or developing the disease, or needing treatment. At best this is a weight on the healthcare system having to monitor more patients than necessary, at worst it means that a patient will undergo a risky and costly procedure unnecessarily.

False Negative: In healthcare this case is where a patient that has, or will have, a disease is missed. This could result in a contagious patient spreading a disease, or a delayed diagnosis and treatment resulting in worse patient outcomes.

In the next instalment we will look into the most popular, and most often misused metric, accuracy, and some of its improvements.

Previous
Previous

NEWS: Nature: Tapping into the drug discovery potential of AI

Next
Next

Feature Selection Using Contingent AI: Going Beyond Mutual Information