Keywords

1 Introduction

Compared to other branches, tourism is characterized as a highly volatile business with significant demand fluctuations (Chen et al., 2019). Long-term trends, like the increase of tourism demand over time, are typically overlapped by seasonal, i.e., cyclic, demand fluctuations. Furthermore, crises like SARS or the Covid-19 pandemic, as well as natural disasters, like volcanic eruptions (e.g., Eyjafjallajökull), additionally cause irregular and, as in case of Covid-19, even dramatic fluctuations of tourism demand, thus, threatening the existence of tourism business (Gretzel et al., 2020; UNWTO, 2022).

While long-term trends and seasonal fluctuations of tourism demand can adequately be forecasted even by relatively simple approaches, like the seasonal naïve method or by more complex time series models, like (S)ARIMA (Höpken et al., 2021), irregular demand fluctuations, like those caused by the Covid-19 pandemic, can hardly be predicted by autoregressive approaches since the past demand is an ineligible predictor in such extraordinary situations. Even if a crisis already takes place, the current extent of decline of demand is typically unknown since tourism statistics are only available with a huge time delay. Thus, what is missing from a business perspective is a real-time monitoring of tourism demand, making demand fluctuations transparent even in extraordinary situations, such as global pandemics.

User generated content (UGC) and especially online reviews provided on online platforms like TripAdvisor or Booking.com, are increasingly being used by tourists to provide feedback during or shortly after their trip (Dedeoğlu et al., 2020). In general, 5–10% of customers provide feedback on online platforms (Hester, 2021). TripAdvisor, for example, covers over 1 billion reviews and opinions (PRNewswire, 2022) for tourism businesses and tourists typically provide their feedback within 24–48 h (TripAdvisor, 2022). As UGC usage can be observed directly on online review platforms, like TripAdvisor, the UGC volume can be monitored in real-time. If we additionally hypothesize that a certain and stable fraction of tourists provide feedback on online review platforms - a hypothesis which is validated by our study -, UGC constitutes a promising input to estimate the current tourism demand on a short-term basis as a mean of near real-time monitoring of tourism demand.

Based on these considerations, this study proposes an extended time series analysis approach which utilizes UGC data from TripAdvisor as input to estimate current tourism demand with a short time delay. The approach is validated both, under normal circumstances without any extraordinary demand fluctuations as well as for a period including the Covid-19 crisis, thus, considering extraordinary demand fluctuations. The approach is compared with the autoregressive time series approach seasonal Naïve, as the baseline approach to estimate current arrivals based on past values. By this, the study will answer the research question whether UGC from tourists’ online reviews enables a short-term estimation of current tourism arrivals with superior accuracy compared to a seasonal Naïve autoregressive prediction in case of extraordinary demand fluctuations.

The paper is structured as follows. First, we offer a brief literature review on tourism demand forecasting, especially focusing on approaches using UGC as input to forecast tourist demand. The method section presents the techniques used for collecting tourism demand and UGC data as well as the methods to estimate tourism demand based on UGC volume. The next section discusses the empirical findings and the outlook section sketches an agenda for future research activities.

2 Related Work

Tourism demand forecasting is considered a thought-provoking science and art (Li et al., 2018). In fact, following Song et al. (2019), there is no single model that consistently outperforms other models in all situations. Consequently, the adopted methods used in modeling and forecasting tourism demand are highly diverse. In previous tourism literature, non-causal time series models and causal econometric models are the two dominant approaches used for quantitative demand modeling and forecasting (Li & Jiao, 2020). Moro and Rita (2016) show that time series-based approaches are the most strongly applied forecasting methods and that the seasonality phenomena in tourism continues to justify this use. More concretely, autoregressive integrated moving average (ARIMA) models (Box & Jenkins, 1970) appear most frequently in the literature (Song et al., 2019). Moreover, exponential smoothing models (Cho, 2003) as well as shift-share techniques (Fuchs et al., 2000) are additionally found for modeling and forecasting tourism demand through non-causal time series-based methods. By contrast, econometric approaches offer the advantage to enable the analysis of causal relationships between tourism demand (i.e., the dependent variable) and its explanatory variables (Höpken et al., 2021, p. 1000). Literature proposed a broad range of determinants explaining tourism demand, like destinations’ consumer price index, substitute prices, gross domestic product/capita, currency exchange rates, interest and unemployment rates as well as ex-/import rates (Athanasopoulos et al., 2018). Additionally, mega-events and advertising investments (Kronenberg et al., 2016), financial crises, terrorist attacks (Smeral, 2009; 2017), disasters and pandemics, like SARS and Covid-19, have shown to significantly impact tourism demand (Zhang, Song et al., 2021).

Following Önder (2017), a major challenge in tourism demand forecasting is the access to timely and cost-effective data. Other challenges comprise demand volatility and the lack of historical time series data (Song et al., 2019; Zhang, 2020). Especially the challenge of lacking historical time series data became critical after the breakdown of international tourism in the aftermath of the Covid-19 pandemic (Zhang, Song et al., 2021). In fact, traditional approaches for tourism demand modeling and forecasting (both, non-causal and econometric models) stop working reliably in case of extraordinary demand fluctuations. Therefore, new data sources and demand modeling techniques are crucial concerns of contemporary tourism research (Li et al., 2021). Particularly big data sources, like web and search engine traffic (Önder, 2017; Höpken et al., 2021; Zhang, Li et al., 2021) are showing the capacity to overcome the challenge of lacking historical time series data due to extraordinary demand fluctuations. Only recently, big data-based approaches employing both UGC from social media and review platforms as well as online news media have been utilized to estimate tourism demand.

Fronzetti Colladon et al. (2019) apply the social network method and semantic analysis to extract variables from UGC on TripAdvisor which were subsequently integrated in traditional forecasting models to prognosticate arrivals to several European city destinations. Results highlight communication network centralization and language complexity as key predictors. Notably, the extracted social media-based variables could improve accuracy compared with a model containing only volume-based web search query data as a predictor in most cases and over nearly all forecasting horizons. The study by Park et al. (2021) tests the role of online news in forecasting tourist arrivals in Hong Kong by employing structural topic modeling. Again, the inclusion of extracted news topics in seasonal ARIMA models significantly improved forecasting performance, especially when the Hong Kong destination was experiencing social unrest. The research by Hu et al. (2022) incorporates tourist-generated online review data regarding tourist attractions, hotels, and shopping markets to forecast tourist arrivals in Hong Kong. Findings indicate that mixed-data sampling models outperform other approaches especially when high-frequency online review data are included in traditional time-series models. Finally, Wu et al. (2022) explore the potential of sentiment information from customer reviews to enhance hotel demand forecast in Macau. A deep learning-based model is employed to extract sentiment information from reviews of the two major travel-related social media platforms in China, i.e., Ctrip.com and Qunar.com. Subsequently, sentiment indices (i.e., a bullish, average and variance index, respectively) are constructed. Findings indicate that the inclusion of the sentiment indices into an ARIMA model could significantly improve forecast accuracy.

While the approaches above utilize past UGC to predict future tourism demand, our approach intends to estimate current tourism arrivals by the current UGC volume as a mean of real-time monitoring for extraordinary demand fluctuations. To the best of our knowledge, no similar approaches exist in tourism literature so far.

3 Method

3.1 Data Collection

The dataset regarding tourism arrivals has been extracted from the Statistical Information System Berlin-Brandenburg (StatIS-BBB), an information service which provides official statistical data from various areas for the German states Berlin and Brandenburg. The extracted dataset consists of a total amount of 132 entries, containing 13 numerical and one nominal attribute. Tourism arrivals are aggregated on a monthly level, spanning over a period of 11 total years. Each entry consists of information on monthly district-distributed tourism arrivals (e.g., Spandau) and of the total guest number (calculated based on the sum of all districts) of the state Berlin. In this study, only the total number of guests on a monthly basis is used.

The user generated content (i.e. hotel reviews) has been extracted from the travel platform TripAdvisor. TripAdvisor offers a huge portfolio of user reviews regarding hotels, bookings, trips or sightseeing attractions in general. Although reviews can be submitted for all kinds of purposes, in the context of this study, only reviews regarding hotels in the state of Berlin have been extracted using a web crawler. The extracted dataset consists of around 360,000 entries, each entry composed of 8 attributes, falling into the categories of general information (review title, review date, review text), trip information (trip type, trip date), and hotel information (hotel name, hotel-id, hotel-rating). In this study, we only used the amount of hotel reviews aggregated on a monthly level to match the structure of the dataset tourism arrivals. Figure 1 shows the two time series hotel reviews and tourism arrivals for the time period 2010–2020, including the period of the Covid-19 pandemic.

The hotel review time series shows a significant decline beginning in 2017/18, corresponding to the figures of TripAdvisor’s average monthly unique users having fallen from 490 million in 2018 to 411 million in the first quarter of 2019 (Dedeoğlu et al., 2020), probably caused by a raising competition by Google as a main competitor. Furthermore, from January 2020 onwards there is a noticeable drop in tourism arrivals and hotel reviews attributed to the Covid-19 crisis and its corresponding regulations and travel stops. As mentioned, the time series tourism arrivals and hotel reviews have been prepared accordingly to represent both a “normal” and a “crisis” period, respectively.

Fig. 1.
figure 1

Original time series of tourism arrivals and hotel reviews

3.2 Data Preparation

Basic pre-processing steps, such as removing missing values or aggregating the hotel reviews on a monthly level, have been executed using the data mining tool set RapidMiner©. Furthermore, hotel reviews have been classified into different categories (i.e., friends and family or couples) to represent the total quantity of visitors more accurately. To this extent, reviews of the category friends and family have been multiplied by factor four whereas reviews of the category couples have been multiplied by factor two. The resulting visitor quantity has been added to the dataset as alternative dependent variable.

Additionally, two separate datasets representing both “normal” and “crisis” periods have been prepared. While the first dataset representing “normal” times contains hotel reviews ranging from the years 2010–2017, the latter represents hotel reviews ranging from the years 2010–2020, thus, including the Covid-19 decline of tourism arrivals.

3.3 Component Model

Time series are commonly split into several different components. In the context of this study, both datasets representing “normal” and “crisis” periods, have been split into their corresponding trend-, seasonal- and irregular components based on the component model approach (Harvey, 2001) and the following steps have been executed:

  1. 1.

    Estimation of the trend component based on the moving average approach with a time window of 12 months

  2. 2.

    Calculation of a rigid seasonal figure based on the moving average of 2 months

  3. 3.

    Subtraction of the determined components from the original time series, resulting in the irregular component

By decomposing the time series, interfering patterns that could affect the results of the analysis are removed. Additionally, a greater understanding regarding the time series can be attained, whereas spurious correlations are avoided. From a statistical perspective, subtracting the trend and seasonal component leads to a stationary time series, which is a prerequisite to execute typical time series analysis methods.

3.4 Estimation of Tourism Arrivals Based on UGC

The correlation between the datasets tourism arrivals and hotel reviews has been determined using the Pearson correlation (Li & Jiao, 2020). Furthermore, the estimation of tourist arrivals based on the amount of hotel reviews was executed using linear regression. The linear regression models have been validated using a split validation, randomly selecting 70% of data entries as training data and 30% as test data, respectively. In total, four regression models were built based on each of the two datasets:

  1. 1.

    The first regression model was used to measure the explanation power of hotel reviews as input to estimate tourism arrivals. The resulting performance measurements are intended to provide a general explanation of the extent to which the two original time series (hotel reviews and tourism arrivals) are interrelated and influence each other.

  2. 2.

    For the second and third regression model (trend-adjusted and seasonal adjusted), the trend or the seasonal component was removed from the original time series, in order to identify the goodness of estimating tourism arrivals without one of these components, or, put differently, to identify their contribution to the explanation power of model one.

  3. 3.

    Ultimately, the fourth and final regression model was built based on the irregular component of the original time series, thus, subtracting both trend and seasonal component. The results of this model represent the explanation power of the estimation based solely on the stationary component of the original time series, without any non-stationary influences, like long-term or seasonal trends.

In order to be able to better compare the results of the fourth model, based on the irregular component, with the actual tourism arrivals, and thus, with the first model, we additionally transformed the estimated results of the fourth model back into the original value domain by adding the trend and seasonal component, again.

3.5 Seasonal Naïve Arrivals Prediction

In the context of this study, the seasonal naive prediction method was utilized to compare the results of the different linear regression models described above, with a simple autoregressive time series prediction method as baseline. The seasonal naive approach simply extrapolates the long-term trend and seasonal fluctuations into the future. This comparison was conducted on the two different, previously mentioned datasets representing both “normal” and “crisis” time periods, respectively.

4 Findings

4.1 Component Model

As described in the methods section, the time series regarding hotel reviews and tourism arrivals have been split into their corresponding components based on the additive component model. Figure 2 illustrates the original time series hotel reviews and tourism arrivals together with their trend, seasonal and irregular components, showcasing data spanning from the period 2010–2020.

Fig. 2.
figure 2

Original time series and components for tourism arrivals and hotel reviews

4.2 Estimation of Tourism Arrivals Based on UGC

In the following section, the results of the Naïve baseline, the correlation analysis and the linear regression models are compared. First, the results based on the dataset spanning from the years 2010–2017 are presented. Afterwards, the results of the models based on the dataset comprising the crisis (2010–2020) are showcased and further analyzed.

Correlation and regression analysis – The normal case. Table 1 summarizes the key insights gained based on the time series representing a normal time period, covering data from the years 2010–2017.

Table 1. Correlation and regression analysis results for normal case

The first row of Table 1 shows the Pearson correlation coefficients between the tourism arrivals and hotel reviews for the original time series as well as the different components. The extent to which the trend and the seasonal component affect the correlation and regression results can be seen by the trend- and seasonally-adjusted time series, in comparison to the original time series. In this context, it can be observed that both time series show a strong positive linear correlation. The irregular component, on the other hand, shows only little to no correlation and contains fluctuations that are probably caused by individual actions, such as an increase in the number of guests due to events, like concerts, etc. Ultimately, however, such fluctuations have little to no impact on the amount of UGC.

The second and third row of Table 1 show the squared correlation and the root mean squared error (RMSE) of the different regression models. The overall prediction model, making use of the irregular component and adding the trend and seasonal component afterwards, can predict tourism arrivals based on UGC with a squared correlation of 97.6% whilst having a RMSE of around 25,049 tourism arrivals. The prediction based on the seasonal naive baseline on the other hand is able to predict tourism arrivals with a RMSE of around 25,052.

Based on these results it can be concluded that the model offers very little to no improvement over established forecasting methods such as the seasonal Naïve approach. Thus, the irregular component is worthless whilst arrivals can be completely derived from the trend and seasonal component, underpinned by a squared correlation of zero of the regression model built on the irregular component alone.

Correlation and Regression Analysis – The Crisis Case.

Table 2 summarizes the key insights gained based on the time series comprising a crisis (i.e., the Covid-19 pandemic) including data from the years 2010–2020.

Table 2. Correlation and regression analysis results for crisis case

Again, the first row of Table 2 shows the Pearson correlation coefficients between the tourism arrivals and hotel reviews for the original time series as well as the different components. When comparing results of the crisis case to the normal case presented above, it can be observed that the correlation coefficient of the irregular component has increased dramatically, although all other influences (i.e., the trend and seasonal components) have been removed. Thus, in the crisis case, the irregular component seems to represent an appropriate input to estimate tourism arrivals.

The second and third row of Table 2, again, show the squared correlation and the root mean squared error (RMSE) of the different regression models. The overall prediction model is now able to predict tourism arrivals based on UGC with a squared correlation of 80.5% whilst having a RMSE of around 118,844 tourism arrivals. The seasonal Naïve baseline however shows worse results. Arrivals can only be predicted with a squared correlation of around 62.4% and a RMSE of around 162,068. When comparing the results of the model with the Naïve baseline, predictions regarding tourism arrivals have become more accurate by 26.6%.

We can, thus, conclude that in a case of extraordinary demand fluctuations our model offers a significant improvement over established forecasting methods, such as the seasonal Naïve approach and the irregular component now has a clear impact in the regression model, underpinned by the squared correlation of 0.421 based on the irregular component alone.

Additionally, the analyses described above have been repeated for the feedback-prepared datasets as well (i.e. multiplying the amount of hotel reviews by the presumed number of visitors). While in normal times the overall model to predict tourist arrivals slightly increased by 2%, in the crisis period results are even worse by 7%. Thus, since the feedback-preparation process has not enhanced the results significantly, the results based on the feedback-prepared time series will not be considered for further evaluations.

4.3 Discussion of Results

The aim of this study is to answer the question, whether UGC in form of online reviews from platforms such as TripAdvisor enables a short-term estimation of current tourism arrivals with a superior accuracy compared to a seasonal Naïve autoregressive prediction in case of abnormal and extraordinary demand fluctuations.

Overall, the results show that based on the time series representing “normal” times, the estimation approach leads to little to no improvements when compared to the Naïve baseline. Thus, tourism arrivals can be best explained by the constant seasonal fluctuations and the linear trend. In case of a crisis, however, when comparing the approach to the Naïve baseline, tourism arrivals can be estimated around 27% more accurately. This gain in accuracy is mainly attributed to the fact that a rise and fall of tourism arrivals is directly and immediately reflected in the amount of UGC. Thus, as a theoretical contribution, we proposed a novel approach to estimate current tourism arrivals based on UGC in form of TripAdvisor-based hotel reviews, clearly outperforming traditional forecasting approaches in a crisis period with abnormal demand fluctuations.

As a main managerial contribution, our approach enables a near-realtime monitoring of tourism demand as a robust way to estimate current tourism demand fluctuations, especially caused by a crisis like a pandemic or a natural disaster. Tourism statistics are typically available only with a time delay of several months.

On the other hand, traditional approaches of forecasting tourism demand based on long-term trends or seasonal fluctuations are meaningless in case of extraordinary demand fluctuations. In such situations, our approach fills a gap and provides valuable insights about current demand fluctuations with a short time-delay as relevant knowledge for tourism planning and management. The presented approach further supports benchmarking activities and enables a hotel manager, for example, to compare own demand fluctuations with a peer group of relevant competitors, to appropriately assess the own performance under circumstances of crises, such as Covid-19.

5 Conclusion and Outlook

This study presented a novel approach to estimate tourism arrivals based on UGC from travel platforms such as TripAdvisor. The dataset on tourism arrivals has been extracted from the Statistical Information System Berlin-Brandenburg, the dataset on hotel reviews from the online platform TripAdvisor. Both time series have been decomposed into a trend, seasonal and irregular component based on the additive component model, as input to a correlation and linear regression analysis. Furthermore, time series have been prepared to represent both a “normal” and “crisis” period to compare the suitability of the presented approach to estimate tourism arrivals utilizing UGC in these two modelling settings. The presented approach has been compared to a seasonal Naïve prediction method as a baseline.

Results demonstrate that the presented approach, when applied to the time series representing normal times, offers little to no improvement over established forecasting methods, such as the seasonal Naïve. This, however, changes when comparing the results with the Naïve baseline based on the time series comprising a crisis, since tourism arrivals can now be predicted more accurately by 27%. In conclusion, short-term fluctuations of tourism arrivals are directly reflected in the amount of UGC, thus, the latter can be utilized to estimate tourism arrivals more reliably. This discovered relationship is especially amplified during crisis situations, such as the Covid-19 pandemic, where past statistics are relatively worthless and cannot be used to reliably infer current tourism arrivals.

Notably, the study at hand does not come without limitations. First, the significant decline of hotel reviews on TripAdvisor, beginning in 2017/18, probably caused by a rising competition by Google, constitutes a potential bias and degradation of model performance. Thus, in future research, additional online review platforms could be used to cross-validate results and even further improve model performance. Second, the period showcasing the crisis has been combined with the normal period (2010–2017), as the Covid-19 crisis was still in its early stages when this study was conducted. For future work however, an even more expressive time series representing the Covid-19 crisis could be attained, containing data spanning from the beginning of 2020 to the end of 2022. Third, the approach is tested on one data set only, restricted to the region of Berlin. The same analysis should be executed on a broader scale, covering other tourism regions and countries as well.