Evaluating Classification Models: Choose Right Metric (Accuracy to AUC)
In classification problems, evaluation metrics help measure how well a model is performing. These metrics go beyond simple accuracy, especially when the classes are imbalanced.
To understand True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), let’s first define them in simple terms:
- True Positive (TP): The model predicted “Yes” (positive) and the actual answer was also “Yes.”
- True Negative (TN): The model predicted “No” (negative) and the actual answer was also “No.”
- False Positive (FP): The model predicted “Yes,” but the actual answer was “No.” (Also called a “Type I error”)
- False Negative (FN): The model predicted “No,” but the actual answer was “Yes.” (Also called a “Type II error”)
Example 1: Medical Diagnosis (Detecting Disease)
- Scenario: You’re trying to detect whether a person has a disease.
- True Positive (TP): The test says the person has the disease, and they actually do.
- True Negative (TN): The test says the person doesn’t have the disease, and they actually don’t.
- False Positive (FP): The test says the person has the disease, but they don’t. (A wrong diagnosis)
- False Negative (FN): The test says the person doesn’t have the disease, but they do. (A missed diagnosis)
Example 2: Spam Email Detection
- Scenario: You’re trying to classify emails as spam or not spam.
- True Positive (TP): The system marks an email as spam, and it really is spam.
- True Negative (TN): The system marks an email as not spam, and it’s not spam.
- False Positive (FP): The system marks an email as spam, but it’s not (it’s a good email).
- False Negative (FN): The system marks an email as not spam, but it actually is spam.
Example 3: Loan Approval Prediction
- Scenario: You’re predicting if someone should get a loan.
- True Positive (TP): The model predicts the person will repay the loan, and they do.
- True Negative (TN): The model predicts the person won’t repay the loan, and they don’t.
- False Positive (FP): The model predicts the person will repay the loan, but they don’t.
- False Negative (FN): The model predicts the person won’t repay the loan, but they do.
Example 4: Fraud Detection in Transactions
- Scenario: You’re trying to detect fraudulent transactions.
- True Positive (TP): The system flags a transaction as fraudulent, and it really is.
- True Negative (TN): The system says a transaction is legitimate, and it is.
- False Positive (FP): The system flags a legitimate transaction as fraudulent.
- False Negative (FN): The system says a fraudulent transaction is legitimate.
In these examples, TP and TN indicate correct predictions, while FP and FN indicate errors.
Here’s a breakdown of the most common evaluation metrics used in classification:
Let us consider following confusion matrix –
Using the previous example, your confusion matrix would look like this:
Predicted Spam | Predicted Not Spam | |
---|---|---|
Actual Spam | 80 | 10 |
Actual Not Spam | 20 | 900 |
Summary Table of Metrics
Metric | Definition | When to Use |
---|---|---|
Accuracy | Ratio of correct predictions to total predictions. | When classes are balanced. |
Precision | Ratio of correct positive predictions to total positive predictions. | When false positives are costly (e.g., spam filters, medical tests). |
Recall | Ratio of correct positive predictions to total actual positives. | When false negatives are costly (e.g., detecting rare diseases). |
F1-Score | Harmonic mean of precision and recall. | When both precision and recall are important (e.g., imbalanced data). |
Confusion Matrix | Summary of correct and incorrect predictions for each class. | To understand the types of errors your model makes. |
ROC-AUC | Measures the ability to distinguish between classes (higher is better). | When you need to compare models’ performance across all thresholds. |
Specificity | Ratio of correct negative predictions to total actual negatives. | When true negatives are important (e.g., avoiding false alarms). |
Conclusion:
Different metrics help evaluate classification models based on the task at hand. For balanced datasets, accuracy may be enough. For imbalanced datasets or when specific errors are more costly, metrics like precision, recall, F1-score, and the ROC-AUC curve become more critical.