Forecasting Hotel Revenue: Predicting ADR Fluctuations with ARIMA
Published Jan 13, 2021 by Michael Grogan
Average daily rates (henceforth referred to as ADR) represent the average rate per day paid by a staying customer at a hotel. This is an important metric for a hotel, as it represents the overall profitability of each customer.
In this example, average daily rates for each customer are averaged over a weekly basis and then forecasted using an ARIMA model.
The below analysis is based on data from Antonio, Almeida and Nunes (2019): Hotel booking demand datasets.
Data Manipulation
In this particular dataset, the year and week number for each customer (along with each customer’s recorded ADR value) is provided separately.
Here are the date columns:
Here is the ADR column:
Firstly, the year and week number is combined into one string:
= df['ArrivalDateYear'].map(str) + df['ArrivalDateWeekNumber'].map(str)
df1 print (df1)
=pd.DataFrame(df1) df1
The new column is then concatenated with the ADR column using pandas:
= DataFrame(c, columns= ['ADR'])
df2 =pd.concat([df1, df2], axis = 1)
df3= ['FullDate', 'ADR'] df3.columns
These values are then sorted by date:
'FullDate','ADR'], ascending=True) df3.sort_values([
The next step is to then obtain the average ADR value per week, e.g. for the entry 201527 (which is week 27, year 2015), all the ADR values are averaged, and so on for each subsequent week.
= df3.groupby('FullDate').agg("mean")
df4 'FullDate'], ascending=True) df4.sort_values([
ARIMA
Using this newly formed time series, an ARIMA model can now be used to make forecasts for ADR.
For this model, the first 100 weeks of data are used as training data, while the last 15 weeks are used as test data to be compared with the predictions made by the model.
Here is a plot of the newly formed time series:
From an initial view of the graph, seasonality does seem to be present. However, an autocorrelation function is generated on the training data to verify this.
=60, zero=False); plot_acf(train_df, lags
We can see that the correlations (after seeing a dip roughly between weeks 10 to 45), the autocorrelation peaks once again at lag 52, which implies yearly seasonality. This essentially means that there is a strong correlation between the ADR values recorded every 52 weeks.
Using this information, m=52 is set as the seasonal component in the ARIMA model, and pmdarima is used to automatically select the p, d, q parameters for the model.
>>> Arima_model=pm.auto_arima(train_df, start_p=0, start_q=0, max_p=10, max_q=10, start_P=0, start_Q=0, max_P=10, max_Q=10, m=52, stepwise=True, seasonal=True, information_criterion='aic', trace=True, d=1, D=1, error_action='warn', suppress_warnings=True, random_state = 20, n_fits=30)
Performing stepwise search to minimize aic0,1,0)(0,1,0)[52] : AIC=422.399, Time=0.38 sec
ARIMA(1,1,0)(1,1,0)[52] : AIC=inf, Time=5.43 sec
ARIMA(0,1,1)(0,1,1)[52] : AIC=inf, Time=6.19 sec
ARIMA(0,1,0)(1,1,0)[52] : AIC=inf, Time=4.16 sec
ARIMA(0,1,0)(0,1,1)[52] : AIC=inf, Time=2.92 sec
ARIMA(0,1,0)(1,1,1)[52] : AIC=inf, Time=6.10 sec
ARIMA(1,1,0)(0,1,0)[52] : AIC=414.708, Time=0.22 sec
ARIMA(1,1,0)(0,1,1)[52] : AIC=inf, Time=5.70 sec
ARIMA(1,1,0)(1,1,1)[52] : AIC=inf, Time=8.11 sec
ARIMA(2,1,0)(0,1,0)[52] : AIC=413.878, Time=0.35 sec
ARIMA(2,1,0)(1,1,0)[52] : AIC=inf, Time=7.61 sec
ARIMA(2,1,0)(0,1,1)[52] : AIC=inf, Time=8.12 sec
ARIMA(2,1,0)(1,1,1)[52] : AIC=inf, Time=9.30 sec
ARIMA(3,1,0)(0,1,0)[52] : AIC=414.514, Time=0.37 sec
ARIMA(2,1,1)(0,1,0)[52] : AIC=415.165, Time=0.78 sec
ARIMA(1,1,1)(0,1,0)[52] : AIC=413.365, Time=0.48 sec
ARIMA(1,1,1)(1,1,0)[52] : AIC=415.351, Time=24.42 sec
ARIMA(1,1,1)(0,1,1)[52] : AIC=inf, Time=16.10 sec
ARIMA(1,1,1)(1,1,1)[52] : AIC=inf, Time=26.36 sec
ARIMA(0,1,1)(0,1,0)[52] : AIC=411.433, Time=0.36 sec
ARIMA(0,1,1)(1,1,0)[52] : AIC=413.418, Time=18.48 sec
ARIMA(0,1,1)(1,1,1)[52] : AIC=inf, Time=24.16 sec
ARIMA(0,1,2)(0,1,0)[52] : AIC=413.343, Time=0.50 sec
ARIMA(1,1,2)(0,1,0)[52] : AIC=415.196, Time=1.16 sec
ARIMA(0,1,1)(0,1,0)[52] intercept : AIC=413.377, Time=0.62 sec
ARIMA(
0,1,1)(0,1,0)[52]
Best model: ARIMA(178.417 seconds Total fit time:
An ARIMA model configuration of ARIMA(0, 1, 1)(0, 1, 0)[52] is indicated as the model of best fit according to the lowest AIC value.
Now, the model can be used to make predictions for the next 15 weeks, with those predictions compared to the test set:
=pd.DataFrame(Arima_model.predict(n_periods=15), index=test_df)
predictions=np.array(predictions) predictions
Before comparing the predictions with the values from the test set, the predictions are reshaped so as to be in the same format as the test set:
>>> predictions=predictions.reshape(15,-1)
>>> predictions
88.0971519 ],
array([[ 103.18056307],
[117.93678827],
[121.38546969],
[112.9812769 ],
[120.69309927],
[144.4014371 ],
[166.36546077],
[181.69684755],
[190.12507961],
[204.36831063],
[218.85150166],
[216.59090879],
[197.74194692],
[156.98273524]]) [
Now, the test values can be compared with the predictions on the basis of the root mean squared error (RMSE) — with a lower value indicating less error.
>>> mse = mean_squared_error(test_df, predictions)
>>> rmse = math.sqrt(mse)
>>> print('RMSE: %f' % rmse)
10.093574
RMSE:
>>> np.mean(test_df)
160.492142162915
An RMSE of 10 is obtained compared to a mean of 160. This represents an error of just over 6% of the mean, indicating that the model predicts ADR trends with quite high accuracy.
Here is a plot of the predicted vs. actual values.
Based on the RMSE reading and also a visual scan of the data, we can see that the ARIMA model did quite well in predicting ADR trends across the test set.
Conclusion
In this example, we have seen:
How to manipulate time series data using Python
The use of auto_arima to make time series forecasts
How to determine model accuracy using the root mean squared error
The datasets and notebooks for this example are available at the MGCodesandStats GitHub repository, along with further research on this topic.