Imbalanced Classes: Predicting Hotel Cancellations with Support Vector Machines
Published Jun 27, 2023 by Michael Grogan
When attempting to build a classification algorithm, one must often contend with the issue of an unbalanced dataset.
An unbalanced dataset is one where there is an unequal sample size between classes, which induces significant bias into the predictions of the classifier in question.
Using the hotel booking dataset from Antonio, Almeida and Nunes (2019), a support vector machine (SVM) classification model is used to classify hotel booking customers in terms of cancellation risk, i.e. 1 if the model predicts that the customer will cancel their booking, 0 if the customer will follow through with the booking.
The H1 dataset is used to train and validate the model, while the predictions from the resulting model are then tested using the H2 data.
In this particular dataset, the sample size for the non-cancellation class (0) is significantly greater than the cancellation class (1). In a previous example, this was dealt with by removing numerous 0 entries in order to have an equal sample size between the two classes. However, this is not necessarily the best approach, as many data points are discarded during this process.
Instead, the SVM model can be modified to penalise wrong predictions on the minor class. Let’s see how this affects the analysis.
Feature Selection
The identified features to be included in the analysis using both the ExtraTreesClassifier and forward and backward feature selection methods are as follows:
Lead time
Country of origin
Market segment
Deposit type
Customer type
Required car parking spaces
Arrival Date: Year
Arrival Date: Month
Arrival Date: Week Number
Arrival Date: Day of Month
What Is an SVM?
An SVM is a supervised learning model that can be used for both classification and regression tasks.
The SVM model provides an assessment of the importance of each training point when defining the decision limit between two classes.
The few training points selected that lie on the decision boundary between the two classes are called support vectors.
Precision vs. Recall and f1-score
When comparing the accuracy scores, we see that numerous readings are provided in each confusion matrix.
However, a particularly important distinction exists between precision and recall.
Precision = ((TP)/(TP + FP))
Recall = ((TP)/(TP + FN)) where TP = True Positive, FP = False Positive, FN = False Negative
The two readings are often at odds with each other, i.e. it is often not possible to increase precision without reducing recall, and vice versa.
An assessment as to the ideal metric to use depends in large part on the specific data under analysis. For example, cancer detection screenings that have false negatives (i.e. indicating patients do not have cancer when in fact they do), is a big no-no. Under this scenario, recall is the ideal metric.
However, for emails — one might prefer to avoid false positives, i.e. sending an important email to the spam folder when in fact it is legitimate.
The f1-score takes both precision and recall into account when devising a more general score.
Which would be more important for predicting hotel cancellations?
Well, from the point of view of a hotel — they would likely wish to identify customers who are ultimately going to cancel their booking with greater accuracy — this allows the hotel to better allocate rooms and resources. Identifying customers who are not going to cancel their bookings may not necessarily add value to the hotel’s analysis, as the hotel knows that a significant proportion of customers will ultimately follow through with their bookings in any case.
SVM and Unbalanced Datasets
The relevant features as outlined above are included for determining whether the customer will cancel their booking.
= y
y1 = np.column_stack((leadtime,countrycat,marketsegmentcat,deposittypecat,customertypecat,rcps,arrivaldateyear,arrivaldatemonthcat,arrivaldateweekno,arrivaldatedayofmonth))
x1 = sm.add_constant(x1, prepend=True) x1
The data is then split into training and validation data:
= train_test_split(x1, y1, random_state=0) x1_train, x1_val, y1_train, y1_val
A ‘balanced’ class weight can be added to the SVM configuration, which adds a greater penalty to incorrect classifications on the minor class (in this case, the cancellation class).
from sklearn import svm
= svm.SVC(gamma='scale',
clf ='balanced')
class_weight
clf.fit(x1_train, y1_train) = clf.predict(x1_val)
prclf prclf
Here is the classification performance of this model on the validation set:
>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(y1_val,prclf))
>>> print(classification_report(y1_val,prclf))
5959 1307]
[[1073 1676]]
[-score support
precision recall f1
0 0.85 0.82 0.83 7266
1 0.56 0.61 0.58 2749
0.76 10015
accuracy 0.70 0.71 0.71 10015
macro avg 0.77 0.76 0.77 10015 weighted avg
Recall for class 1 comes in at 61%, while the f1-score accuracy comes in at 76%.
k-fold Cross Validation
For the above example, we have seen that a simple train-validation split was used to gauge how accurately the training model could make predictions across the validation set.
However, this approach comes with limitations. We want to be able to ensure that the trained model will make accurate forecasts on completely unseen data.
In this regard, it is possible that the model may have made strong predictions on the validation set by chance, but there is no guarantee that this would translate into making strong predictions on real-world data.
To attempt to mitigate this risk, a technique called k-fold cross validation can be used. This technique works by partitioning the data into a specified number of folds. Let us say that we wish to split the data into five separate groups, i.e. k=5.
By doing this, one part of the data is withheld as the test set — while the remaining four parts of the dataset are used as the training data. By alternating the choice of test set in each instance, we are now training five separate models to make predictions on five separate test sets.
Let us perform this technique and see what we come up with:
>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(clf, x1_train, y1_train, cv=5)
>>> cv_results
'fit_time': array([15.81104469, 15.48872566, 16.87576699, 16.3652153 , 15.98278832]),
{'score_time': array([2.93023157, 3.10848165, 3.12357092, 3.08169103, 3.51586366]),
'test_score': array([0.75852887, 0.76718256, 0.76035946, 0.7538692 , 0.76052588])}
When running the above, we can see that we have a test score of at least 0.75 in all five instances.
In this regard, we can now be more confident that our trained SVM model will be able to perform well at predicting unseen data — as the test results across the five folds have been consistent and in line with what we previously saw when using a standard train-test split.
Test Set
Now, let’s test the prediction performance on H2 (the test set).
>>> prh2 = clf.predict(a)
>>> prh2
0, 1, 1, ..., 0, 0, 0])
array([
>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(b,prh2))
>>> print(classification_report(b,prh2))
34581 11647]
[[11247 21855]]
[-score support
precision recall f1
0 0.75 0.75 0.75 46228
1 0.65 0.66 0.66 33102
0.71 79330
accuracy 0.70 0.70 0.70 79330
macro avg 0.71 0.71 0.71 79330 weighted avg
We see that recall for class 1 is now up to 66%, while f1-score accuracy comes in at 71%. Notably, we can see that the f1-score is lower on the test set, but the recall is now higher than on the validation set.
In this regard, if it is assumed that false positives are more tolerable than false negatives in this situation — then one could argue that the model has performed quite well on this basis.
Conclusion
In this example, we have seen how support vector machines can be used to handle unbalanced datasets, and how to interpret confusion matrices for classification accuracy.
The datasets and notebooks for this example are available at the MGCodesandStats GitHub repository, along with further research on this topic.
References
Antonio, Almedia and Nunes (2019). Hotel Booking Demand Datasets
Cross Validated: How is the train_score from sklearn.model_selection.cross_validate calculated?
Elite Data Science: How to Handle Imbalanced Classes in Machine Learning
Machine Learning Mastery: A Gentle Introduction to k-fold Cross-Validation