1 Introduction

The statistical measurement of tourism has been a vital task for all stakeholders in tourism fields since its emergence in modern economy [1,2]. Historically, major supranational organizations such as the United Nations Statistical Commission (UNSC) and World Tourism Organization (WTO), along with national and regional tourism entities have provided the official tourism data for public. However, this data largely rely on conventional surveys resulting in inconsistencies across countries, costly data collection, problems with respondents’ mobility, and variability in sampled population [3,4,5,6,7,8]. The big data provided an alternative source of low-cost data tracing tourists’ movements, preferences, points of interests, behaviors and even expenditures [9], together with novel data collection methodologies [10]. In the big data domain, social media is particularly promising due to its availability, seamless collection, good spatial coverage at multiple scales, and rich content [11], which has been convincingly demonstrated in multiple studies [12,13,14,15].

Meanwhile, frequent criticism towards the social media is the suggested bias towards the population of social media users leading to unknown representativeness of the entire population [16,17]. Complicating the issue, population representativeness may vary time and across social media platforms [11]. The inherent bias of the social media data has long been debated [18], yet the attempts to measure its extent are extremely limited [19,20]. The purpose of this study is to cross-validate the reliability and validity of visitation pattern of tourist destinations retrieved from the social media with alternative independent data sources. The primary social media data is TripAdvisor reviews of Florida attraction points, restaurants, and hotels. The inferred visitation pattern was validated against two independent datasets: cellphone tracking data and official visitor surveys.

2 Data and Methods

2.1 Social Media Data

We collected all TripAdvisor reviews of Florida attractions, hotels, and restaurants (further – properties) published from January 2003 to October 2019. The collected variables included reviewers’ self-reported place of living address, the total review numbers, property location, and review date. The data was cleaned in the following way: we (1) filtered out the abnormally active reviewers ranking in top 5%; (2) used Google location API to geotag the reviewers’ place of living (at a city, county, state, of country level); and (3) classified the visitors into three groups based on their origins, that is, Floridians, USA domestic, and international. The home locations were kept with at least a city granularity for Floridians, state granularity for domestic visitors, and nation granularity for the international visitors.

Data cleaning resulted in a total of 2,162,249 reviews generated by 250,844 reviewers (visitors) to 51,525 Florida properties. Between the reviewers, 24.4% were Floridians, 57.4% domestic, and 18.2% were international tourists. These groups contributed 42.6%, 39.6%, 13.6% of reviews, respectively. Based on the visitors’ origin (place of living) and destination (location of the visited property), the database was rearranged as a monthly visitation frequency for each visitor group in the origin-destination (OD) format (see Table 1).

2.2 Cellphone Data

The primary independent dataset used for cross-validation was the trilaterated mobile phone signal tower data provided by AirSage (www.airsage.com). The anonymized data (over 8 billion records) covered Florida and adjacent areas from October 2018 to September 2019 and was organized in a form of OD trip counts for visitors from different home zones with a census tract granularity. The raw was preprocessed to filter out non-tourism travels and aggregated at a monthly time scale. Then, data was separated into two market segments: Floridians and domestic visitors. The origins of the domestic were aggregated at the state level. International visitors’ information was largely unavailable in cellphone database and was excluded from research (Table 1).

2.3 VISIT FLORIDA Survey Data

The secondary cross-validation dataset was the Florida Visitor Study survey from Visit Florida (visitflorida.org). The annual survey is the premier reference guide on visitors to Florida. These data largely rely on conventional survey tools such as questionnaires and interviews. The data used in this study cover 2015–2018 and include quarterly statistics on domestic and international visitors: the origins at a state and nation scales and the total number of Florida visitors. The data on destinations visited in Florida is not provided; the local Florida tourists is also not included. Data summary is provided in Table 1.

Table 1. The data used in this research.

2.4 Methods

Based on data availability and spatial resolution, the validation methodology was as follows:

  • to validate the origins of Floridians inferred from the social media, their spatial distributions were compared with the cellphone data. Pearson's r correlation between the log-transformed paired data on the number of visits from each origin was used to estimate the match between different data sources.

  • in a similar way, to validate the origins of domestic visitors, the destination of Floridians, and the travel flows of Floridians, their respective representations in different databases were used.

3 Results

3.1 Validation of Trip Origins

The validation of the origins of Floridian travel was based on the social media and cellphone data at a county resolution. The data on the top travel origins from both datasets are shown in Table 2. The inferred numbers of trips (log-transformed) from same origins estimated from social media and cellphone data are highly correlated (r = 0.93, p < 0.001). The preliminary estimation implies that one TripAdvisor trip approximately corresponds to 100 trip counts from the cellphone data (Fig. 1).

Fig. 1.
figure 1

Correlation of log (social media) * log(cellphone) trip origin counts

Table 2. Top origin counties of Floridians

Validation of the origins of domestic US visitors was based on the comparison between social media, cellphone, and survey data, at a state level resolution. The data on the top 15 origin states provided in the Survey was compared with data from the other two datasets (Table 3) and demonstrated high cross-correlation (Fig. 2). The data implies that one TripAdvisor trip count is equivalent to 100 trips inferred from the cellphone data and 2000 trips inferred from Visit Florida survey, hence providing the base to ranslate the social media and cellphone record data to real visitation data.

Fig. 2.
figure 2

Correlations of origin trip counts estimated from three datasets.

Table 3. Top origin states for the US domestic visitors

4 Validation of Destinations

Validation of the destination choices of Floridian travelers was based on data from social media and cellphone, on a county level resolution. The comparative data for the top destinations from both datasets are found in Table 4. The comparative numbers of trips are highly correlated (r = 0.89, p < 0.0001) (Fig. 3). The preliminary estimation implies that each trip count from the social media approximates 100 trip count from cellphone data.

Fig. 3.
figure 3

Correlation of log (social media) * log (cellphone). Floridian travelers only.

Table 4. Top destination counties for Floridian tourists.

5 Validation of Travel Flows

The validation on the origin-destination travel flows of Floridians was based on data from the social media and cellphones at a county level resolution. The number of trips for the top network links are shown in Table 5. The number of OD trips are strongly correlated (r = 0.72, p < 0.01) (Fig. 4). One travel estimated from the social media approximates 180 travels estimated from the cellphone data.

Fig. 4.
figure 4

Cross-plot of log (Social media) * log (cellphone)

Table 5. Top OD flows for Floridian tourists

6 Conclusions

We found that the social media is a reliable source of data on tourism visitations representative not only of the social media users, but also of the general population. The travel patterns extracted from social media are strongly correlated to those retrieved from the cellphone tracking data and official tourist surveys. The reliability of social media data is evidenced not only in the counts of tourists arriving from various origins or going to various destinations, but also in the travel origin-to-destination travel flows. A longitudinal comparison based on visitation temporal patterns in a future study is suggested to improve the robustness of our results.

This strong correlation in addition implies the potential of social media to represent the real visitation data by fusing the high-resolution social media with the overall tourism measurements from the state or national tourism organizations. In our data, one trip count from the social media approximately represents 2000 visitations from the survey data.

The two high-resolution data sources used in this study, social media and cell phone tracking, can both be used in visitation measurements. Notably, social media data has lower granularity, especially in determining visitor origins. We however found that the seemingly high resolution of the cell phone data can result in significant errors in urban areas. In addition, very high costs of the cellphone data determine its primary area of use in social media validation in key areas.