Keywords

1 Introduction

Movement pattern analysis as a systematic approach is widely used in decision making processes in many fields, especially in the tourism industry. Understanding the tourists’ movements between Points of Interest (POIs) plays a fundamental role for destination management activities and is directly applicable to advance the restaurant and hotel industry in the local area [1].

Accurate and valuable analysis of movement patterns requires a large amount of high-quality data. Most previous research obtained people’s travel data by surveying individuals’ location history or by using automatic location-sensing devices [2], which were neither scalable nor cost-effective to cover numerous individuals [3]. Due to the development of Internet technologies, social media platforms are becoming increasingly popular, in which an enormous number of photos and videos are voluntarily generated and also contain geographic information. Among social media platforms, Instagram, with over 5 million daily active users, contains a large amount of potentially valuable information, i.e. user-generated content which is geographically tagged, to be mined for the tourism industry. However, sophisticated procedures are necessary to systematically retrieve and store this data, especially since the huge amount of data and access restrictions of the Instagram platform make ordinary data retrieval processes difficult.

Facing massive data volume, new approaches and novel learning techniques are needed to fully make sense of them. Machine learning as a rapidly developing technology over the past decade provides potential solutions to mine information and knowledge hidden in the data [4]. The goal of this paper is twofold. One is to introduce an efficient crawling framework for collecting geo-tagged photos from Instagram. The other is, with the help of two different clustering algorithms to group photo-upload locations into different popular POIs and to explore association rules between clusters from different geographical scales to obtain the movement patterns from Instagram users. Compared to previous research, the main contributions of this work are the following:

  1. 1.

    Develop an efficient high-performance crawling algorithm which extracts geographically tagged social media posts and activities from Instagram.

  2. 2.

    Analyze the movement patterns of users in the Lake Constance area using uploaded geographic information from Instagram photos.

  3. 3.

    Recurrent a state-of-the-art clustering algorithm Noise Removal for k-means (NK-MEANS) and compare with the commonly used Density Based Spatial Clustering of Applications with Noise (DBSCAN).

The rest of this paper is organized as follows: in Sect. 2, the up-to-date development of movement pattern research is reviewed; Sect. 3 briefly introduces the methodology involved in this study; Clustering results and association rules obtained from the dataset are presented in Sect. 4; Sect. 5 draws the conclusions about this study and outlines some possible lines of future work.

2 Literature Review

The research of spatial movement patterns of tourists between destinations has been discussed as early as the 1990s [5]. However, the limitations of data volume, research methods, and computing power at that time led to the understanding of the tourist mobility problem as a black-box problem, which was difficult to explore and express [6]. Since movement within a destination played a fundamental role in understanding tourist behaviour, research on tourist’s movement patterns received increasing attention. However, these studies with traditional survey-based approaches were always limited by issues of cost, scalability, data volume, and privacy.

With the development of web technologies, more and more people can upload photos and videos to photo sharing platforms to share their journey with others. Thanks to various social media platforms, a large amount of openly available data with geographic information proves to be available through crawling and scraping mechanisms. Nevertheless, collecting social media data is challenging because of the mix of structured and semi-structured data. Erlandsson et al. [7] crawled Facebook data by using the API (Application Programming Interface) and defined major requirements regarding the crawling results; Jalal et al. [8] scraped Instagram data by using a keyword and location-based approach, which were both utilized in this paper as well. While many studies, like Chu et al. [9] used the Python scraping framework Scrapy, an own framework was created in this study due to lack of functionality of existing ones as later described.

Huge amount of data with geographic information provides alternative data sources for many geospatial and social media applications. However, utilizing these data for analysis poses a new challenge. Facing this problem, Arefieva et al. [10] used images from Instagram to cluster tourist’s destination; Mukhina et al. [11] analyzed tourists’ attraction points using Instagram profiles. However, Instagram, the largest photo-video sharing platform, has a wealth of information about movement patterns in tourist attraction areas, which have not been deeply explored. To bridge this gap, this paper investigated tourists’ movement patterns at and between POIs in the Lake Constance region using data crawled from Instagram, based on the framework from Höpken et al.’s study [12] and compares the performance of NK-MEANS with DBSCAN with regard to the geographic information clustering problem.

3 Methodology

3.1 Data Extraction and Preparation

The ETL (extract, transform, load) process forms the first step of this study. It describes how data is retrieved, transformed and stored in a data warehouse in preparation for social media data of the geographical region of Lake Constance [13].

Extract. Extraction of public social media data can be achieved by using web scrapers [8] or an API (Application Programming Interface) provided by the platform [7]. Both techniques were used in this study to obtain the best coverage and depth of data. Instagram provides an overview page for each city in Germany, which was iterated by an ordinal city identifier, in order to collect locations for the respective region. The crawling procedure first collected all published posts connected to a location by iterating through the paginated Instagram API, and second gathered deeper information such as comments or an accessibility caption by crawling each post one-by-one. Since this study is non-commercial and the crawled user information is non-inclusive of any personal private information, there are no ethical and legal issues involved.

Transform. The transformation step focuses on aggregating semi-structured JSON data which was retrieved by using the Instagram API, complemented by unstructured browser-based scraping data as well as image media.

Load. Several hundred parallel crawlers were used to retrieve the data performantly and write them synchronously to the database by using the object-relational mapping software SQLAlchemy. To realize this immense crawling volume, rotating IPs, deployment of crawlers in a container infrastructure and multiple concurrent VPN connections, as well as redundancy on multiple server systems were introduced.

Since the data preprocessing is essential to reduce computation time, a geo-based filtering approach was used to limit the dataset to the relevant POIs and a threshold was set for the minimum number of posts whose location was classified as relevant. To ensure that the crawled information is tourism relevant, the posts from local residents and commercial accounts can be distinguished and discarded by taking advantage of the high frequency of tourists’ uploading behavior in a short time period and mostly occurring on weekends and holidays.

3.2 Clustering

Clustering plays an important role in the realm of unsupervised learning [14]. The clustering of uploaded photos from Instagram by geographic information is prominent for further analysis. POI information predefined by the platform, i.e. city and location names, can be directly used as input for the cluster analysis. However, for roughly \(30\%\) of the photos the city name of the uploaded locations is missing, and the same POI can have different name variants which also causes difficulties when trying to group photos uploaded from the same location into the same cluster. Therefore, in this study, the precise geographic information from uploaded photos, i.e. latitude and longitude, was used for accurately identifying POIs by a cluster analysis.

For the problem of clustering uploaded photos to corresponding POIs, two clustering algorithms based on different principles, namely DBSCAN and NK-MEANS, were implemented and compared for their suitability in identifying meaningful clusters. A brief description of these two algorithms follows:

DBSCAN. DBSCAN is the first density-based clustering algorithm which was proposed by Ester et al. in 1996. It was designed to cluster data of arbitrary shapes in the presence of noise, both for data in 2D or 3D Euclidean space and for data in some high-dimensional feature space [15]. Since DBCSCAN has the advantage of identifying arbitrary shaped clusters and automatically removing outliers, this property is perfect for grouping uploaded photos into relevant POIs. However, DBSCAN has some drawbacks that cannot be ignored when dealing with data from Instagram. More specifically, the algorithm requires a priori knowledge to obtain satisfying clustering results, i.e. Eps, the radius of a neighborhood with respect to a core point and MinPts, a minimum number of neighboring points, which a core point within Eps has. But for Instagram datasets, this a priori knowledge is usually unknown. Secondly, the distribution of uploaded photos on the map is non-uniform. Dealing with datasets with varying densities, DBSCAN could be prone to dilemma in deciding meaningful clusters [16]. Finally, DBSCAN has a time complexity \(O\left( n\log {n}\right) \) [14], which incurs a relatively higher computational complexity than some other clustering algorithms. Therefore, for comparison, a partition-based algorithm with fast noise removal, namely NK-MEANS, was also employed in this study to group the uploaded photos in this study.

NK-MEANS. NK-MEANS is an improved k-means clustering algorithm with automatic noise removal. K-means is one of the most well-known clustering algorithms, whose core idea is to iteratively build and improve clusters by assigning each data point to its nearest cluster centroid (central point of the cluster) and recompute the cluster centroid, until some criteria for convergence is met. Not all uploaded photos from Instagram are related to nearby POIs. Therefore, they can be defined as noise in the dataset. However, the k-means algorithm is highly sensitive to noise, and if k-means algorithm is directly used for clustering, neither satisfying results nor a precise comparison with the results from DBSCAN can be obtained. Therefore, this study uses a method proposed by Im et al. in 2020 [17], which extends the k-means algorithm by a preprocessing step removing outliers in a way suitable for the k-means algorithm.

To quantify the improvement of clustering results after noise removal, this study employed the Hopkins statistic H to measure the clustering tendency of a dataset. A value close to 1 tends to indicate the data is perfectly clustered [18].

Regarding the parameters of NK-MEANS, the appropriate number of clusters k and the proportion of outliers z found by DBSCAN will be used for NK-MEANS to ensure the comparability of the two approaches.

Although the results obtained after the clustering are already meaningful at the geographical level, they are not sufficient for the following association rule analysis. Some clusters contain so few locations or locations have so few uploaded photos that valuable motion patterns cannot be mined. They seem to appear randomly and, thus, cannot be considered as a POI. Therefore, the popularity of clusters and locations was investigated. When a cluster or a location contains less than a certain percentage of uploads, it was discarded as noise as well.

3.3 Association Rule Mining

After popular POIs are identified by the clustering algorithm, mining the user’s movement patterns among POIs is the final task in this study. Association rule mining is one of the most popular pattern discovery methods in Knowledge Discovery in Databases (KDD), since its introduction in 1993 [19]. In this study, all uploads by one user were considered as transaction and the visitation of a popular POI as an item (multiple visits to the same popular POIs were counted as one item).

Based on a transaction matrix, spanned by transactions in the one dimension and items (i.e. popular POIs) in the other dimension, frequent itemsets (i.e. combinations of POIs often visited together by the same user) were generated by the FP-Growth algorithm, and based on these frequent itemsets the association rule mining was implemented. Some criterions involved are briefly explained next:

If the set of all user-based transactions is given as \(\mathcal {T}={t_1, \dots , t_n}\), then the frequency of itemset X, which is a combination of popular POIs, is defined as: \(\sigma (X) = \left| \left\{ t_i \vert X \subseteq t_i , t_i \in \mathcal {T} \right\} \right| \). LetX, Y be an antecedent and consequent itemset \((X \cap Y = \phi )\), \(rule\left( X \rightarrow Y\right) \) denotes an association rule from X to Y. Accordingly, Support \(s(X \rightarrow Y)\), which indicates how frequently the itemsets appears together; Confidence \(c\left( X\rightarrow Y\right) \), which indicates how often the rule has been found to be true; Lift \(lift\left( X\rightarrow Y\right) \), which indicates the ratio of the observed support to that expected if itemsets were independent, can be defined [20]. Notably, the size of the Instagram database in this study is relatively large, even a valuable association rule would not have a significantly large Support. Therefore, Lift is used to filter valuable association rules. The larger the lift of a rule is, the more is the rule potentially useful for predicting the consequent in future data sets.

To ensure that the mined association rules are indeed valuable, the p-value of each rule was calculated in this study as well. Suppose X and Y are two itemsets that are independently and identically distributed. Assuming this null hypothesis is correct, the p-value of \(rule\left( X\rightarrow Y\right) \) indicates the probability of obtaining test results, i.e. frequencies at least as high as the results actually observed. A very small p-value means that the observed frequency of an itemset would be quite unlikely, if there is no association between X and Y.

4 Results and Discussion

4.1 Data Extraction and Visualization

The data extraction process is proved to be able to capture several million posts published in the Lake Constance region. It extracted all Instagram posts tagged by a location within the analyzed region. The database consists of 9.6 GB of textual data and 146.6 GB of media files, which contain 46,658 locations, 1,215,063 users and 2,490,640 posts during the period from May 2013 to September 2020. These objects represent highly interesting information such as geographic coordinates or image classification tags added by Instagram’s computer vision algorithm.

When looking in detail at the time distribution of the established Instagram database in Fig. 1, an exponential growth of Instagram usage in the analyzed region can clearly be observed. While less than 500 posts were published each week in early 2014, that number grew exponentially to over 65,000 by the end of 2019. Notably, more posts are published in summer than in winter by a steep slope in the middle of each year. This result also confirms that the Lake Constance region is especially popular as a summer vacation spot.

When plotting all locations in Fig. 2 as a heat map weighted by the number of media uploaded at each location, a distribution with a rough trend of clustering can be observed. The results of the heat map clearly demonstrate that the data as is does not constitute an appropriate input to association rule analysis due to a large amount of noise. As indicated above, this data was utilized for knowledge mining at two different geographical scales.

Fig. 1.
figure 1

Temporal distribution of posts amount in the years 2014 to 2019

Fig. 2.
figure 2

Heat map of raw data spatial distribution weighted by post amount

4.2 Region-Level Association Rule Analysis

Based on the data obtained from Instagram, outliers were processed as a first step. Then, the retained data in the whole Lake Constance area were clustered, so that the users’ movement patterns can be analyzed on a large geographic scale. Two methods introduced in Subsect. 3.2 were implemented to remove noise from raw data and cluster single uploads to POIs. The comparison before and after noise removal is shown in Table 1. To ensure comparability of clustering results, approximately the same amount of data is removed as noise. Clustering results are shown in Fig. 3, which presents 53 popular POIs (color-coded) from upload locations distributed in the Lake Constance region.

Table 1. Data information comparison before and after noise removal
Fig. 3.
figure 3

Distribution of 53 popular POIs in the Lake Constance region

To quantitatively compare the results from the two clustering algorithms, the Silhouette Coefficient, Calinski-Harabasz Index and Davies-Bouldin Index were used to evaluate the clustering performance without ground truth. Results in Table 2 demonstrate that, except the Calinski-Harabasz Index, for all criteria DBSCAN performs better than NK-MEANS. This is probably the case because Calinski-Harabasz Index is generally higher for convex clusters like those produced by k-means clustering than other concepts of clusters, such as density-based clusters like those obtained from DBSCAN. By observing the distribution of clusters in Fig. 3, DBSCAN achieves better clustering results compared to NK-Means without over-clustering (e.g. Friedrichshafen region) or under-clustering (e.g. Ravensburg and Weingarten region) problems. Thus, in terms of the region-level dataset from Instagram, DBSCAN outperforms NK-MEANS for clustering of 2D geographic information, and therefore the clustering results from DBSCAN were used in the consecutive association rule mining.

Table 2. Comparison of clustering performance by DBSCAN and NK-MEANS

Based on the 53 popular POIs derived from DBSCAN, association rule mining was performed for the frequent items filtered by \(s(X\rightarrow Y)=0.004\) as a threshold. Meaningful association rules (\(lift(X\rightarrow Y) > 1\)) between 16 region-level POIs are shown in Fig. 4. Since the association rules for the same POIs often appear in both directions, Table 3 only presents the rule that has higher Support, Confidence or Lift between two POIs. Moreover, it is verified that the p-value of each underlying frequent itemset indicates significance (i.e. p-value \(< 10^{-5}\)). In conclusion, the retained clusters basically match the actual range of popular tourist cities in the Lake Constance region, and the mined association rules reflect the movement patterns of users between cities around Lake Constance.

Fig. 4.
figure 4

Illustration of bi-directional movement patterns within the Lake Constance region

Table 3. Association rules in the Lake Constance region based on Lift descending order

4.3 City-Level Association Rule Analysis

To explore the movement patterns of users at a smaller geographical scale, this study selected data within the city of Friedrichshafen to mine the association rules between POIs. Due to the relatively small amount of data, there is little difference between different clustering algorithms. DBSCAN, that performs better in Subsect. 4.2, was used for clustering at city-level as well. Eventually, 499 locations were grouped into 28 clusters after noise removal. It contains a total of 21,755 geo-tagged photos and their distributions are shown in Fig. 5.

Fig. 5.
figure 5

Distribution of 28 popular POIs within Friedrichshafen

Fig. 6.
figure 6

Illustration of movement patterns within Friedrichshafen

Table 4. Association rules within Friedrichshafen based on Lift descending order

These 28 popular POIs, which are mainly located near the lakeshore, basically cover the tourist attractions in Friedrichshafen. Based on these clusters, a threshold of \(s(X\rightarrow Y)=0.001\) was set to filter frequent items and then meaningful association rules where identified, which are listed in Table 4.

The mined association rules are distributed between 13 popular POIs shown in Fig. 6, which are centered on lakeside promenade (1) and spanning over railway station (6), club house (12) and shopping center (13). This results impressively reflect the user’s movement trajectories among tourist attractions within Friedrichshafen.

5 Conclusion and Outlook

Nowadays, with the increasing popularity of social media, this study is the first to collect media containing geographic information from Instagram through an efficient crawler framework and to use this data to mine movement patterns of users within the scenic area of Lake Constance. It can be concluded that big data from social media contains valuable knowledge for the local tourism industry and should, therefore, be given more attention in the future.

When association rules were mined, it has been found that the volume of data plays a significant role in determining the reliability of association rules. The movement patterns at the city-level are less reliable than those at the region-level because of the relatively small amount of data. This means that future research on movement patterns may heavily depend on the availability of big data from various social media platforms.

However, it is worth noting that the data crawled from Instagram also contains plenty of noise, which must be cleaned before analysis, e.g. to remove geographic outliers or to manually label location-names which are useless or incorrect for the study of movement patterns.

Based on the database built from the social media data of Instagram, there is still a great potential for future research. Considering the temporal order of each user’s photo upload, the sequential patterns of tourists can be explored in combination with geographic information, which can lead to a more precise recommendation for local tourism. Furthermore, in addition to geographical information, the content of users’ uploaded photos, related comments and account profiles can be analysed with Natural Language Processing (NLP) or Computer Vision (CV) techniques to discover more feedback-based knowledge and, thus, to propose highly individualized travel advice.