In this post we describe the problem of class imbalance in classification datasets, how it affects classifier learning as well as various evaluation metrics, and some ways to handle the problem. We show several example with Python code.
Many datasets have an imbalanced distribution of classes, with many examples of the frequent negative, and few examples of the positive class. For example, many classification datasets deal with rare events:
- Will a stock fall more than 20%?
- Does this person have a rare disease?
- Is this a fraudulent transaction?
For many classifiers, if the negative class has many more instances than the positive class (say 90% or more of the frequency), the classifier will perform well at out of sample prediction when the true class is the negative class, but poorly when the true class is the positive class.
How will this Affect Accuracy Metrics?
Naively fitting standard classification metrics will affect accuracy metrics in different ways. Here is a list of some of them as well as how they are likely to be affected. Here TP=true positives, FN=false negatives, TN=true negatives, FP=false positives.
- Sensitivity: true positive rate, TP/(TP+FN)
- This will generally be low, as the imbalance will lead to many false negatives and missing most of the true positives. For instance, in the credit card case, if we fit a naive model we will likely miss many fraudulent transactions.
- Specificity: true negative rate, TN/(TN+FP)
- This will likely be high, many false negatives, few to no false positives. It will be rare to misclassify a non-fraudulent transactions as fraudulent.
- Precision: positive predicted value, TP/(TP+FP)
- Will be somewhat low, not many true positives, but low false positives as well, somewhat difficult to determine.
- Recall (sensitivity): already covered
- F1 Score, 2TP/(2TP+FP+FN)
- Low true positives, high false negatives, low false positives, overall likely lower than ‘good’ setting
- Accuracy: (TP+TN)/(TP+TN+FP+FN)
- accuracy may be high, despite classifier having problems
The big one to look at here is sensitivity or recall, as this includes true positives and false negatives, the latter being the major issue
Don’t Put Too Much Stock Into ROC Curves
For many classification problems, we look at the ROC (receiver operating characteristic) curve and the AUC (area under the curve). Intuitively, this tells us how much we ‘pay’ in false positives to achieve true positives. However, for class imbalance, if the positive class is the rare one, false positives aren’t the problem: false negatives are. Thus we should be careful of putting too much stock into ROC curves when dealing with imbalanced datasets.
Introducing Balanced Accuracy
Accuracy is generally calculated as (TP+TN)/(TP+TN+FP+FN). However, for imbalanced datasets, balanced accuracy, given by , where TP/(TP+FN) and TN/(TN+FP). Balanced accuracy will not have very high numbers simply due to class imbalance and is a better metric here.
Comparing Losses by Loss Function
In some cases we may care specifically about the test/out of sample total loss under different techniques, which may have an obvious interpretation, such as dollars lost. For instance, in credit card fraud detection, a false positive may have some cost associated with calling the customer and checking if they really intended to make that transaction. A false negative, on the other hand, will lead to the cost of the transaction being lost (assuming that the bank assumes the risk rather than the customer). We would like to evaluate a technique based on the total dollars lost.
How this Affects Two Classifiers: Naive Bayes and Logistic Regression
Let’s look at how class imbalance can affect two classifiers: (Gaussian) Naive Bayes and Logistic regression. To simplify matters, we will not do any feature selection.
In Gaussian Naive Bayes, let be the class for observation , and be the feature vector for the same observation. Then the conditional class probability is
for highly imbalanced datasets, will be very high for the negative class and very low for the positive class. Let’s look at this in practice. We will download the following dataset and load it https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcardfraud.zip/3, and then fit a Gaussian Naive Bayes model. This analyzes credit card fraud: it has 492 frauds out of 284,807 transactions. For privacy reasons, most features other than time and amount are simply principal components of the original features (so some feature selection has essentially already been done). First we load some libraries and functions
import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn import tree from sklearn.model_selection import train_test_split import numpy as np from sklearn.metrics import confusion_matrix from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB
Next we setup training and testing data with a 70/30 split. Note that if we were doing this for a real problem, we would also want to use validation data. However, since this is simply an illustrative example we skip that.
df_credit = pd.read_csv('creditcard.csv') col_names = df_credit.columns feature_cols = col_names[:-1] df_negative = df_credit[df_credit[col_names[-1]]==0] df_positive = df_credit[df_credit[col_names[-1]]==1] X = df_credit[feature_cols] y = df_credit[col_names[-1]] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) print(1.np.sum(y_train)/len(y_train)) print(1.np.sum(y_test)/len(y_test))
We can see that the positive class (fraudulent transactions) has about .17% marginal probability in both train and test. Let’s now fit Gaussian Naive Bayes and look at the class priors, which should be the same as the train marginal probabilities.
gnb = GaussianNB() gnb.fit(X_train,y_train) gnb.class_prior_
Now let’s make predictions on the test set and evaluate their accuracy and sensitivity
y_pred_nb = gnb.predict(X_test) def display_summary(true,pred): tn, fp, fn, tp = confusion_matrix(true,pred).ravel() print('confusion matrix') print(np.array([[tp,fp],[fn,tn]])) print('sensitivity is %f',1.*tp/(tp+fn)) print('specificity is %f',1.*tn/(tn+fp)) print('accuracy is %f',1.*(tp+tn)/(tp+tn+fp+fn)) print('balanced accuracy is %',1./2*(1.*tp/(tp+fn)+1.*tn/(tn+fp))) print('Gaussian NB') display_summary(y_test,y_pred_nb)
Our results are as follows
Gaussian NB confusion matrix [[ 95 543] [ 52 84753]] ('sensitivity is %f', 0.6462585034013606) ('specificity is %f', 0.9936339335959482) ('accuracy is %f', 0.9930362932013155) ('balanced accuracy is %', 0.8199462184986543)
Our confusion matrix has the following structure
|true positive||false positive|
|false negative||true negative|
As we can see, sensitivity is rather low: we have a large number of false negatives relative to true positives. Balanced accuracy is also not high.
In logistic regression, we posit the following model
where , the link function, is the logit function and . Class imbalance towards the class will lead to having large negative magnitude. Let’s see what happens in practice.
logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred_logistic = logreg.predict(X_test) logreg.intercept_ print('Logistic Regression') display_summary(y_test,y_pred_logistic)
array([-3.29981971]) Logistic Regression confusion matrix [[ 77 11] [ 70 85285]] ('sensitivity is %f', 0.5238095238095238) ('specificity is %f', 0.9998710373288313) ('accuracy is %f', 0.9990519995786665) ('balanced accuracy is %', 0.7618402805691775)
With an intercept of , with all features fixed to , the probability of a is . However, we don’t know what the distribution of features is, so this is not very useful information. What we can see from the evaluation is that sensitivity is even lower at 0.52, as is balanced accuracy at 0.76. Clearly we should not use logistic regression naively as our final model.
Other: Random Forest
We can also try out random forest.
clf_random = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0) clf_random = clf_random.fit(X_train, y_train) y_pred_rf = clf_random.predict(X_test) print('Random Forest') display_summary(y_test,y_pred_rf)
Random Forest confusion matrix [[ 79 13] [ 68 85283]] ('sensitivity is %f', 0.5374149659863946) ('specificity is %f', 0.9998475895704371) ('accuracy is %f', 0.9990519995786665) ('balanced accuracy is %', 0.7686312777784159)
This performance is quite poor: about the same as logistic regression.
Method 1: Use a Classifier that is Robust to Class Imbalance
The simplest way to tackle the class imbalance problem is by using a classifier that is somewhat robust to class imbalance. The most obvious is the decision tree. If the rare class lies in a specific region of feature space, or at least it usually does, then most or all of the rare classes will lie in a single node of the decision tree.
Let’s fit a decision tree and look at the accuracy metrics.
clf = tree.DecisionTreeClassifier() clf = clf.fit(X_train, y_train) y_pred_tree = clf.predict(X_test) print('Decision Tree') display_summary(y_test,y_pred_tree)
Decision Tree confusion matrix [[ 108 28] [ 39 85268]] ('sensitivity is %f', 0.7346938775510204) ('specificity is %f', 0.9996717313824799) ('accuracy is %f', 0.9992158515033414) ('balanced accuracy is %', 0.8671828044667502)
The sensitivity has gone up a lot! The balanced accuracy has as well. The sensitivity was 0.52 and 0.65 for logistic regression and Naive Bayes, respsectively and is now 0.73. The balanced accuracy was 0.76 and 0.82, and is now 0.87.
Method 2: Change the Objective Function
We can change the objective function to take account that the cost of a false positive and a false negative may not be the same, and the cost of both may vary based on the details of the example. An obvious example of this is in a disease: the cost of a false positive may be the cost of more lab tests, while the cost of a false negative may be that the disease worsens, and very expensive treatments become necessary.
In the example of credit card fraud we mentioned, the cost of a false positive may be the cost of further inspection (similar to diseases), while the cost of a false negative is the transaction amount. We can change our objective function to reflect this. For instance, the standard logistic regression loss function is the negative log-likelihood
where is under parameters .
However, if we say that each observation has some true positive , true negative , false positive , and false negative cost associated with it, we can use the following objective function
this takes into account the actual financial costs associated with different scenarios. See (https://pdfs.semanticscholar.org/e133/e196ad6186557c447e8a986fdd670f5d987d.pdf) for details.
Applying the COSTCLA Package
The Python package costcla can fit cost-sensitive logistic regression and decision trees (as well as several other models). We will load several functions from it to start with
from costcla.metrics import cost_loss, savings_score from costcla.models import CostSensitiveLogisticRegression, CostSensitiveDecisionTreeClassifier
We then set the cost of both false positives and true positives to be the cost of investigation. We’ll set these two to . Further, we set the cost of false negatives to be the transaction amount, and the cost of true negatives to be .
cost_mat_train = np.zeros((len(y_train),4)) #false positives cost 5 cost_mat_train[:,0]=5 #false negatives cost the transaction amount cost_mat_train[:,1]=X_train['Amount'] #true positives also cost 5 cost_mat_train[:,2]=5 cost_mat_test = np.zeros((len(y_test),4)) cost_mat_test[:,0]=5 cost_mat_test[:,1]=X_test['Amount'] cost_mat_test[:,2]=5
Next we fit both cost sensitive logistic regression and a cost sensitive decision tree.
f = CostSensitiveLogisticRegression() f.fit(np.array(X_train),np.array(y_train), cost_mat_train) y_pred_logistic_cslr = f.predict(X_test) g = CostSensitiveRandomForestClassifier() g.fit(np.array(X_train), np.array(y_train), cost_mat_train) y_pred_rf_cslr=g.predict(np.array(X_test)) h = CostSensitiveDecisionTreeClassifier() h.fit(np.array(X_train), np.array(y_train), cost_mat_train) y_pred_tree_cslr = h.predict(np.array(X_test))
We can then compare the performance on the cost-sensitive loss between these classifiers and the naive unweighted versions.
print('naive: logistic regression') print(cost_loss(y_test,y_pred_logistic,cost_mat_test)) print('naive: random forest') print(cost_loss(y_test,y_pred_rf,cost_mat_test)) print('naive: decision tree') print(cost_loss(y_test,y_pred_tree,cost_mat_test)) print('logistic: cost sensitve learning') print(cost_loss(y_test,y_pred_logistic_cslr,cost_mat_test)) print('random forest: cost sensitve learning') print(cost_loss(y_test,y_pred_rf_cslr,cost_mat_test)) print('decision tree: cost sensitve learning') print(cost_loss(y_test,y_pred_tree_cslr,cost_mat_test))
naive: logistic regression 12798.5 naive: random forest 10727.92 naive: decision tree 8517.279999999999 logistic: cost sensitve learning 14558.420000000002 random forest: cost sensitve learning 7806.55 decision tree: cost sensitve learning 7833.9
We see that for both random forest and decision trees, the naive model has higher loss, while for logistic regression, surprisingly, cost sensitive learning actually does slightly worse. Further, this matches our earlier finding that among the naive methods, the decision tree is somewhat robust to class imbalance.
There are a number of further methods: most of them involve either oversampling the minority class or undersampling the majority class. We may discuss these in the future.
In this post we discussed how to deal with class imbalance, including two approaches: one is simply the choice of classifier, and the second was cost sensitive learning. Each (usually, the one exception was cost sensitive learning for logistic regression) improved the score on evaluation metrics we used.