XGBoost and Imbalanced Classes: Predicting Hotel Cancellations
Published June 27, 2023 by Michael Grogan
Boosting is often referred to as an ensemble method. This is a technique whereby a series of individual models (or weak learners) are combined to build a model that yields superior predictive power (strong learner).
XGBoost is quite a popular boosting method — it stands for “extreme gradient boosting” and is an extension to gradient boosted decision trees.
In this example, boosting techniques are used to determine whether a customer will cancel their hotel booking or not.
Specifically, XGBoost will be used to build a model to predict hotel cancellations across the hotel booking dataset by Antonio, Almedia and Nunes (2019).
Data Overview and Feature Selection
The training data is imported from an AWS S3 bucket as follows:
import boto3
import botocore
import pandas as pd
from sagemaker import get_execution_role
= get_execution_role()
role = 'yourbucketname'
bucket = 'H1.csv'
data_key_train = 's3://{}/{}'.format(bucket, data_key_train)
data_location_train = pd.read_csv(data_location_train) train_df
Hotel cancellations represent the response (or dependent) variable, where 1 = cancel, 0 = follow through with booking.
The features for analysis are as follows.
Interval
= train_df['LeadTime']
leadtime = train_df['ArrivalDateYear']
arrivaldateyear = train_df['ArrivalDateWeekNumber']
arrivaldateweekno = train_df['ArrivalDateDayOfMonth']
arrivaldatedayofmonth = train_df['StaysInWeekendNights']
staysweekendnights = train_df['StaysInWeekNights']
staysweeknights = train_df['Adults']
adults = train_df['Children']
children = train_df['Babies']
babies = train_df['IsRepeatedGuest']
isrepeatedguest = train_df['PreviousCancellations']
previouscancellations = train_df['PreviousBookingsNotCanceled']
previousbookingsnotcanceled = train_df['BookingChanges']
bookingchanges = train_df['Agent']
agent = train_df['Company']
company = train_df['DaysInWaitingList']
dayswaitinglist = train_df['ADR']
adr = train_df['RequiredCarParkingSpaces']
rcps = train_df['TotalOfSpecialRequests'] totalsqr
Categorical
= train_df.ArrivalDateMonth.astype("category").cat.codes
arrivaldatemonth =pd.Series(arrivaldatemonth)
arrivaldatemonthcat=train_df.Meal.astype("category").cat.codes
mealcat=pd.Series(mealcat)
mealcat=train_df.Country.astype("category").cat.codes
countrycat=pd.Series(countrycat)
countrycat=train_df.MarketSegment.astype("category").cat.codes
marketsegmentcat=pd.Series(marketsegmentcat)
marketsegmentcat=train_df.DistributionChannel.astype("category").cat.codes
distributionchannelcat=pd.Series(distributionchannelcat)
distributionchannelcat=train_df.ReservedRoomType.astype("category").cat.codes
reservedroomtypecat=pd.Series(reservedroomtypecat)
reservedroomtypecat=train_df.AssignedRoomType.astype("category").cat.codes
assignedroomtypecat=pd.Series(assignedroomtypecat)
assignedroomtypecat=train_df.DepositType.astype("category").cat.codes
deposittypecat=pd.Series(deposittypecat)
deposittypecat=train_df.CustomerType.astype("category").cat.codes
customertypecat=pd.Series(customertypecat)
customertypecat=train_df.ReservationStatus.astype("category").cat.codes
reservationstatuscat=pd.Series(reservationstatuscat) reservationstatuscat
The identified features to be included in the analysis using both the ExtraTreesClassifier and forward and backward feature selection methods are as follows:
- Lead time
- Country of origin
- Market segment
- Deposit type
- Customer type
- Required car parking spaces
- Arrival Date: Year
- Arrival Date: Month
- Arrival Date: Week Number
- Arrival Date: Day of Month
Boosting Techniques
XGBoost is a boosting technique that has become renowned for its execution speed and model performance, and is increasingly being relied upon as a default boosting method — this method implements the gradient boosting decision tree algorithm which works in a similar manner to adaptive boosting, but instance weights are no longer tweaked at every iteration as in the case of AdaBoost. Instead, an attempt is made to fit the new predictor to the residual errors that the previous predictor made.
Precision vs. Recall and f1-score
When comparing the accuracy scores, we see that numerous readings are provided in each confusion matrix.
However, a particularly important distinction exists between precision and recall.
- Precision = ((True Positive)/(True Positive + False Positive))
- Recall = ((True Positive)/(True Positive + False Negative))
The two readings are often at odds with each other, i.e. it is often not possible to increase precision without reducing recall, and vice versa.
An assessment as to the ideal metric to use depends in large part on the specific data under analysis. For example, cancer detection screenings that have false negatives (i.e. indicating patients do not have cancer when in fact they do), is a big no-no. Under this scenario, recall is the ideal metric.
However, for emails — one might prefer to avoid false positives, i.e. sending an important email to the spam folder when in fact it is legitimate.
The f1-score takes both precision and recall into account when devising a more general score.
Which would be more important for predicting hotel cancellations?
Well, from the point of view of a hotel — they would likely wish to identify customers who are ultimately going to cancel their booking with greater accuracy — this allows the hotel to better allocate rooms and resources. Identifying customers who are not going to cancel their bookings may not necessarily add value to the hotel’s analysis, as the hotel knows that a significant proportion of customers will ultimately follow through with their bookings in any case.
Analysis
The data is firstly split into training and validation data for the H1 dataset, with the H2 dataset being used as the test set for comparing the XGBoost predictions with actual cancellation incidences.
Here is an implementation of the XGBoost algorithm:
import xgboost as xgb
= xgb.XGBClassifier(learning_rate=0.001,
xgb_model = 1,
max_depth = 100,
n_estimators =3)
scale_pos_weight xgb_model.fit(x_train, y_train)
Note that the scale_pos_weight parameter in this instance is set to 3. The higher the weight, the greater the penalty imposed for errors on the minor class, in this case any incidences of 1 in the response variable, i.e. hotel cancellations. The reason for doing this is because there are more 0s than 1s in the dataset — i.e. more customers follow through on their bookings than cancel.
Therefore, in order to have an unbiased model, errors on the minor class need to be penalised more severely.
Performance on Validation Set
Here is the accuracy on the training and validation set:
>>> import xgboost as xgb
>>> xgb_model = xgb.XGBClassifier(learning_rate=0.001, max_depth = 1, n_estimators = 100, scale_pos_weight=3)
>>> xgb_model.fit(x1_train, y1_train)
>>> print("Accuracy on training set: {:.3f}".format(xgb_model.score(x1_train, y1_train)))
>>> print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x1_val, y1_val)))
set: 0.579
Accuracy on training set: 0.571 Accuracy on validation
The predictions are generated:
>>> xgb_predict=xgb_model.predict(x1_val)
>>> xgb_predict
1, 1, 1, ..., 0, 1, 1]) array([
Here is a confusion matrix comparing the predicted vs. actual cancellations on the validation set:
>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(y1_val,xgb_predict))
>>> print(classification_report(y1_val,xgb_predict))
3159 4107]
[[194 2555]]
[ -score support
precision recall f1
0 0.94 0.43 0.59 7266
1 0.38 0.93 0.54 2749
0.57 10015
accuracy 0.66 0.68 0.57 10015
macro avg 0.79 0.57 0.58 10015 weighted avg
Note that while the accuracy in terms of the f1-score (57%) is modest — the recall score for class 1 (cancellations) is 93%. This means that the model is generating many false positives which reduces the overall accuracy — but this has had the effect of increasing recall to 93%, i.e. the model is 93% successful at identifying all the customers who will cancel their booking, even if this results in some false positives.
k-fold Cross Validation
In the above analysis, we can see that the accuracy and recall was gauged by a standard train-validation split, i.e. train the model on the training data and then compare the predictions to the validation data.
However, there is the risk that the model may have made strong predictions on the validation set by chance, but there is no guarantee that this would translate into making strong predictions on real-world data.
To attempt to mitigate this risk, a technique called k-fold cross validation can be used. This technique works by partitioning the data into a specified number of folds. Let us say that we wish to split the data into five separate groups, i.e. k=5.
By doing this, one part of the data is withheld as the test set — while the remaining four parts of the dataset are used as the training data. By alternating the choice of test set in each instance, we are now training five separate models to make predictions on five separate test sets.
Let us perform this technique and see what we come up with:
>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(xgb_model, x1_train, y1_train, cv=5)
>>> cv_results
'fit_time': array([0.09879589, 0.06915689, 0.06884503, 0.06944084, 0.06797481]),
{'score_time': array([0.00231791, 0.0023036 , 0.00233889, 0.00230122, 0.00229812]),
'test_score': array([0.57547013, 0.58745216, 0.58029622, 0.57779997, 0.57580296])}
We can see that the test score is at least 0.57 across all five instances. Therefore, we can be more confident that the 0.57 f1-score for accuracy that was originally yielded would hold when testing across unseen data.
Given that recall is a metric of importance for this example, we can also gauge the recall score when performing k-fold cross validation:
>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(xgb_model, x1_train, y1_train, scoring="recall", cv=5)
>>> cv_results
'fit_time': array([0.09710097, 0.06710696, 0.06669283, 0.0674212 , 0.06662488]),
{'score_time': array([0.00415516, 0.00414419, 0.00426912, 0.00424576, 0.00424981]),
'test_score': array([0.92358209, 0.93373134, 0.92298507, 0.92532855, 0.92353644])}
The test score for recall is 0.92 or higher across all five trials. As a result, we can be relatively more confident that such a recall score would hold when testing across unseen data.
Performance on Test Set
As previously, the test set is also imported from the relevant S3 bucket:
= 'H2.csv'
data_key_test = 's3://{}/{}'.format(bucket, data_key_test)
data_location_test = pd.read_csv(data_location_test) h2data
Here is the subsequent classification performance of the XGBoost model on H2, which is the test set in this instance.
>>> prh4 = xgb_model.predict(a)
>>> prh4
0, 1, 1, ..., 1, 1, 1])
array([
>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(b,prh4))
>>> print(classification_report(b,prh4))
12650 33578]
[[1972 31130]]
[ -score support
precision recall f1
0 0.87 0.27 0.42 46228
1 0.48 0.94 0.64 33102
0.55 79330
accuracy 0.67 0.61 0.53 79330
macro avg 0.70 0.55 0.51 79330 weighted avg
The accuracy as indicated by the f1-score is slightly higher at 44%, but the recall accuracy for class 1 is at 100% once again.
Calibration: scale_pos_weight
In this instance, it is observed that using a scale_pos_weight of 3 resulted in a 94% recall while yielding an f1-score accuracy of 55%.
However, a high recall score can also be unreliable. For instance, suppose that the scale_pos_weight was set even higher — which meant that almost all of the predictions indicated a response of 1, i.e. all customers were predicted to cancel their booking.
This model has no inherent value if all the customers are predicted to cancel, since there is no longer any way of identifying the unique attributes of customers who are likely to cancel their booking versus those who do not.
In this regard, a more balanced solution is to have a high recall while also ensuring that the overall accuracy does not fall excessively low.
Here are the confusion matrix results for when respective weights of 2, 4, and 5 are used.
scale_pos_weight = 2
36926 9302]
[[12484 20618]]
[-score support
precision recall f10 0.75 0.80 0.77 46228
1 0.69 0.62 0.65 33102
0.73 79330
accuracy 0.72 0.71 0.71 79330
macro avg 0.72 0.73 0.72 79330 weighted avg
scale_pos_weight = 4
1926 44302]
[[ 0 33102]]
[ -score support
precision recall f10 1.00 0.04 0.08 46228
1 0.43 1.00 0.60 33102
0.44 79330
accuracy 0.71 0.52 0.34 79330
macro avg 0.76 0.44 0.30 79330 weighted avg
scale_pos_weight = 5
1926 44302]
[[ 0 33102]]
[ -score support
precision recall f10 1.00 0.04 0.08 46228
1 0.43 1.00 0.60 33102
0.44 79330
accuracy 0.71 0.52 0.34 79330
macro avg 0.76 0.44 0.30 79330 weighted avg
When the scale_pos_weight was set to 3, recall came in at 94% while accuracy was at 55%. When the scale_pos_weight parameter is set to 5, recall is at 100% while the f1-score accuracy falls to 44%. Additionally, note that increasing the parameter from 4 to 5 does not result in any change in either recall or overall accuracy.
In this regard, using a weight of 3 allows for a high recall, while still allowing overall classification accuracy to remain above 50% and allows the hotel a baseline to differentiate between the attributes of customers who cancel their booking and those who do not.
Conclusion
In this example, you have seen the use of various boosting methods to predict hotel cancellations. As mentioned, the boosting method in this instance was set to impose greater penalties on the minor class, which had the result of lowering the overall accuracy as measure by the f1-score since there were more false positives present.
However, the recall score increased vastly as a result — if it is assumed that false positives are more tolerable than false negatives in this situation — then one could argue that the model has performed quite well on this basis. For reference, an SVM model run on the same dataset demonstrated an overall accuracy of 63%, while recall on class 1 decreased to 75%.
We have also seen how k-fold cross validation can be used to determine whether the accuracy and recall readings from testing the model remain consistent when testing across several folds.
The datasets and notebooks for this example are available at the MGCodesandStats GitHub repository, along with further research on this topic.
Useful References
- Antonio, Almedia and Nunes (2019). Hotel Booking Demand Datasets
- Classification: Precision and Recall
- Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurélien Geron
- Machine Learning Mastery: A Gentle Introduction to XGBoost for Applied Machine Learning
- ProjectPro: How To Check a Model’s Recall Score Using Cross-Validation in Python?
- What is LightGBM, How to implement it? How to fine tune the parameters?
- xgboost.readthedocs.io: Introduction to Boosted Trees