1 Introduction

In recent years, tourists have become increasingly reliant on social media platforms, such as Instagram,Footnote 1 to make travel decisions (Bangare et al. 2022; Hu et al. 2014). Through social media, tourists can discover popular tourist attractions and points of interest, enhancing the information provided by official tourism websites (Živković et al. 2014; Bonilla-Quijada et al. 2021; İştin 2020).

Tourists can benefit from the content posted on social media in several ways. They can obtain valuable insights, recommendations, and information, including updates on travel trends (Fotis et al. 2011). In addition, the experiences of other users can provide tourists with authentic information about destinations and discover less well-known tourist destinations (Kim and Kim 2020). Users can provide feedback by commenting, reviewing, and sharing tips.

However, there is a challenge in finding relevant tourism information on social media platforms due to the overwhelming amount of irrelevant or low-quality information shared (Bontcheva et al. 2013). For example, many Instagram photos shared using location-specific hashtags, such as those linked to tourist attractions, may not accurately represent the tourist attraction or location. The photos posted vary widely, ranging from photos of clothing or shoes to photos promoting local businesses, food, or art, or close-ups of people. Therefore, it may be difficult for tourists to find relevant photos associated with specific locations among the huge volume of content available, most of which is likely irrelevant to individual tourists making travel decisions.

This paper proposes VISTA: Visual Identification of Significant Travel Attractions, an innovative method aimed at assisting tourists in their journey to find relevant information about tourist destinations on Instagram. VISTA utilizes deep learning, a machine learning technique based on artificial neural networks, which generate patterns that enable photos that are related to tourism to be distinguished from those that are not. Our machine learning classifier employs active learning, a method of iteratively combining machine learning with manual labeling, to improve detection of the targeted photos by manually labeling a small number of photos the classifier is most uncertain about.

By using VISTA, tourists can filter out irrelevant or low-quality photos that do not relate to tourist destinations and focus on photos that capture the essence of those destinations, streamlining the process of finding information about tourist destinations on social media and enhancing the travel planning process.

As part of this study, we created the Instagram Top-10 Israeli Cities dataset, a collection of real-world photos mined from Instagram of the 10 most popular Israeli cities on Instagram that contains 791,352 photos. We integrated active learning in an iterative machine learning process to produce an effective and robust classification model capable of differentiating between relevant and irrelevant photos for tourists. Our machine learning classifier’s performance was evaluated on a validation set generated from the Instagram Top-10 Israeli Cities dataset consisting of 804 photos. The machine learning model obtained an accuracy score of 0.965 and a weighted F1 score of 0.964.

Additionally, to evaluate the trained classifier’s robustness and ability to generalize to photos of different cities in the world, we evaluated this machine learning classifier on the InstaCities1M dataset, an additional dataset of Instagram photos taken of the 10 most populous English-speaking cities in the world. When our classifier, which was trained on Israeli locations, was evaluated on a test set consisting of 500 photos taken of 10 cities worldwide, an accuracy score of 0.958 and a weighted F1 score of 0.959 were achieved.

To demonstrate the use of VISTA, we used our classifier to filter out irrelevant photos from the previously mentioned datasets: the Instagram Top-10 Israeli Cities and the InstaCities1M datasets. Manual validation of 200 photos from each dataset confirmed over 0.95 accuracy. The results demonstrate the proposed VISTA method’s ability to filter out irrelevant photos associated with specific locations and identify relevant tourist-related photos on Instagram. In addition, we performed a comparative analysis of the two filtered photo collections (tourism-related and non-tourism-related photos) based on photo proportion, user engagement, and object comparison.

The contributions of this paper are fourfold:

  • We present VISTA, a novel method that utilizes machine learning, and more specifically, deep learning, as well as active learning, to streamline tourists’ search for information about their chosen destination.

  • To the best of our knowledge, this is the first work to combine deep and active learning to differentiate between relevant and irrelevant photos associated with tourism.

  • We present three novel datasets: (1) the Instagram Top-10 Israeli Cities dataset, which was created for this study; (2) a dataset of tourism-related photos filtered from the Instagram Top-10 Israeli Cities dataset; and (3) a dataset of tourism-related photos filtered from the InstaCities1M dataset.

  • We demonstrate VISTA’s effectiveness by performing a comparative analysis of tourism-related and non-tourism-related photo collections with respect to photo proportion, user engagement, and object comparison.

The remainder of this paper is organized as follows: Sect. 2 presents a brief overview of studies that focused on elements of this study or addressed issues similar to those discussed in this study. In Sect. 3, we present the VISTA method, and in Sect. 4, we provide a detailed description of the datasets used in this study. In Sect. 5, we describe our evaluation of VISTA on two datasets: a dataset containing images of Israel and an additional dataset that contains images of cities around the world. We demonstrate the use of VISTA to analyze Instagram photo collections related to tourism in Sect. 6. Finally, our conclusions, limitations, and plans for future research are presented in Sect. 7.

2 Related work

This section explores related works combining tourism, social media, and photos. Section 2.1 provides an overview of existing tourism research analyzing social media photos. Section 2.2 focuses on machine learning and deep learning algorithms for analyzing tourism-related photos on social media. Section 2.3 covers filtering irrelevant information from social media data. Lastly, Sect. 2.4 discusses active learning approaches and relevant research.

2.1 Analysis of tourism-related photos collected from social media

Numerous studies have examined the interactions between tourism and social networks, with particular attention paid to the use of images shared on these platforms. Tourism-related photos on social media have emerged as an influential and rich source of information, significantly contributing to destination marketing campaigns’ effectiveness (Pan et al. 2021). Tourism-related data from social media is often analyzed using metadata embedded in the images, such as geoposition indicators and tags, to accomplish various tourism-related tasks, such as conducting market analyses (Wood et al. 2013), gaining insights into tourist behavior (Gallo et al. 2017; Arthan et al. 2021) and preferences (Su et al. 2016; Hausmann et al. 2018; Pickering et al. 2020), as well as identifying travel patterns (Torrisi et al. 2015; da Mota and Pickering 2021; da Mota et al. 2022) and popular attractions (Kim et al. 2019; Marine-Roig 2019; Paül i Agustí, D. 2020).

Geotagged images from social media platforms have also been utilized to assess and map cultural ecosystem services, for example, Oteros-Rozas et al. (2018) identified landscape features that possess value in terms of cultural ecosystem services, including recreation, aesthetics, and cultural heritage. An analysis of social preferences for cultural ecosystem services within a Portuguese Natural Park was conducted by Clemente et al. (2019), which demonstrated that social media photographs can be used as reliable indicators of cultural ecosystem services. Rossi et al. (2020) also used social media images to identify and map the social preferences for ecosystem services in a remote protected area in the Argentinean Andes. Schirpke et al. (2021) conducted an analysis of user-generated tags of Flickr photos in order to identify the cultural ecosystem services provided by 2,807 lakes in the European Alps. The authors identified 12 different cultural ecosystem services based on 418 unique tags.

Previous studies have mainly focused on extracting quantitative and qualitative information about tourists and their interests from the metadata associated with social media photos, rather than analyzing the visual content of the images directly. On the other hand, our research focuses on examining photos directly and utilizing deep learning techniques to analyze and interpret the images to achieve tourism-related tasks automatically.

Several studies have investigated the differences between the images of destinations presented by official tourism sources and those shared by tourists on social media. For instance, i Agustí, D.P. (2018) demonstrated that Instagram tourist images only partially overlapped with those in official brochures and travel guides. This indicates distinct differences in the attractions tourists highlight versus those promoted by official brochures and travel guides. Bernkopf and Nixon (2019) focused on Mexico City and found that Instagram images and random Google images significantly enhanced the city’s image more effectively than content posted by Destination Marketing Organizations (DMO). Bhatt and Pickering (2022), compared official images of Chitwan National Park with those shared by tourists on Flickr. The study found that, although wildlife and landscapes were common themes, tourists also highlighted cultural attributes. This indicates a divergence between official marketing and tourist interests. These studies compared images from official sources with those shared on social media. In contrast to these studies, our study provides a systematic process for filtering and identifying tourism-oriented images.

2.2 Analysis of tourism-related collected from social media using machine learning and deep learning

Recently, the classification of tourism imagery and scene recognition in social media using deep learning has been the subject of several studies. In 2019, Zhang et al. (2019) used the ResNet-101 architecture to categorize 35,356 Flickr photos of tourists from Beijing into 11 categories, including culture, entertainment, food, insects, animals, mountains, and natural phenomena. A year later, Zhang et al. (2020) performed imagery classification for scene recognition using deep learning, recognizing 78 scenes and demonstrating their model’s efficiency and robustness for scene recognition. Kim et al. (2020) classify sightseeing elements by analyzing Flickr photos uploaded by Seoul tourists. Based on tourists’ photos, they identified 11 areas of interest in Seoul. Bhosale and Pushkar (2021) proposed a convolutional neural network (CNN) model that classified images as heritage, beaches, wildlife, and pilgrimage sites. 91.4% accuracy was obtained using this model. Derdouri and Osaragi (2021) presented a method for categorizing tourists and residents in social media images from Flickr through a CNN that achieved 75.5% accuracy.

Several studies have examined social media images of parks and nature-based tourism, including Väisänen et al. (2021), which explored human-nature interactions in Finland’s national parks. They automatically analyzed photos taken in the parks by visitors both domestically and internationally by semantic clustering, scene classification, and object detection. Bhatt and Pickering (2022) explored Flickr images of Chitwan National Park. Images of this park include natural elements, landscapes, cultural elements, tourism activities, and facilities. Huai et al. (2022) used geotagged Flickr photos to assess tourists’ and locals’ spatial and landscape preferences in Brussels’ urban parks. This was done through scene recognition, image clustering, and image labeling. Their results suggested that tourists and locals have distinct spatial preferences for urban parks; tourists’ photos were spatially concentrated on well-known parks in the city center, while locals’ photos were spatially dispersed.

Numerous studies have used deep learning techniques to identify destinations, attractions and preferences using social media photos and then applied clustering techniques to those images. Using social media photos, Figueredo et al. (2017) classified tourists using techniques, such as CNNs and fuzzy logic, achieving an accuracy rate of 82%. Richards and Tunçer (2018) employed an online machine learning algorithm based on Google Cloud Vision and hierarchical clustering to group social media photos for automated ecosystem service assessment. Giglio et al. (2019) utilized social media to identify tourism attractiveness in six Italian cities. They employed artificial neural networks for object recognition in images and clustering techniques to analyze the behavior and attractiveness of various places and points of interest visited by tourists. Ma and Kirilenko (2020) also identified tourism activities in social media photos. They collected 13,875 photos from Instagram taken within a lake’s area. Google Vision AI software identified image elements. Using the Latent Dirichlet Allocation (LDA) algorithm, the photos were automatically assigned to different topics based on their content. Xiao et al. (2020) employed CNNs and LDA to detect visual content and major tourism topics from travel photos of Jiangxi, China. This revealed 65 scenes of Jiangxi from 531,629 travel photos and found major tourism topics. Kim and Kang (2022) proposed a method for automatically classifying tourist photos by attractions, applying image feature vector clustering and a deep learning model. They used the VGG16 network and HDBSCAN clustering to extract regional image categories and applied a Siamese network for photo classification. A study performed by Kirilenko et al. (2023) focused on image recognition of photos published by 140 Instagram travel influencers prior to and during the COVID-19 pandemic. Employing techniques, such as natural language processing, Google Vision AI, and neural network embedding for analyzing changes in content and topics during the COVID-19 pandemic. In contrast to the studies mentioned above, our study proposes a methodology for filtering the large amount of content available on social media, to distinguish between relevant and irrelevant photos associated with the tourism domain. Our methodology harnesses deep and active learning to streamline valuable information from large-scale photo datasets and extract meaningful insights related to travel experiences and destinations.

2.3 Filtering irrelevant information from social media

Due to the vast amount of text and imagery shared on social networks, much of which is redundant and may be irrelevant for certain users, researchers have developed techniques for effectively filtering out irrelevant information. Giannoulakis and Tsapatsoulis (2019), Reddy and Reddy (2022), and Janarthanan et al. (2022) focused on improving the tag–image pairs that can be trained for automated image annotation (AIA) systems. The studies used the modified version of the hyperlink-induced topic search (HITS) algorithm to filter irrelevant hashtags from Instagram photos. Xia et al. (2015) proposed a bi-layer clustering framework to locate relevant tags in social photos. Argyrou et al. (2018) used a latent Dirichlet allocation (LDA) model to retrieve the relevant Instagram hashtags that are related to the image’s content and can be used for AIA. With respect to photos, Nguyen et al. (2017) proposed a novel method for automatic damage assessment in images. To help human annotators focus their time and effort on analyzing useful image content, the researchers developed an image-processing pipeline based on deep learning. The pipeline automatically detects and filters out photos generated after a major disaster that are irrelevant or do not convey significant information for crisis response and management and eliminates duplicate or near-duplicate images that do not provide additional information. The authors used a CNN and trained a binary classifier that differentiates between images related to disasters and those that are not.

2.4 Active learning applications

Active learning is a subfield of machine learning, and more generally of artificial intelligence. The key idea behind active learning is that by allowing the machine learning algorithm to select the data from which it is trained, it can achieve enhanced accuracy with fewer labeled training examples (Settles 2009, 2011). It is particularly useful when labeling data is expensive or time-consuming since it reduces the amount of data to be labeled without compromising the model’s performance (Settles 2009). Active learning typically involves an iterative cycle, where a small, labeled dataset is used to train the model. In the next step, the model determines which unlabeled instances would be most beneficial to learn from. It is common for these instances to be those about which the model is most uncertain or those which are most representative of the unlabeled data. Following the selection of the next instances, they are labeled by human annotators and added to the training dataset. This training dataset is used to retrain the model, and the process is repeated. This process continues until the desired accuracy level is achieved or the labeling budget is exhausted.

Various studies have used active learning to address classification problems. For instance, Joshi et al. (2009) used an active learning method for multi-class image classification. The researchers applied their method to a variety of datasets, including letters, digits, objects, and natural scenes. The proposed method reduced the number of training records required. In addition, Gala et al. (2014) demonstrated that active learning can substantially reduce training time while minimizing generalization errors when employed in automated neurite tracing. Kim et al. (2020) applied active learning to semantic segmentation, reducing labeling effort and time and increasing training efficiency. Similarly, Yao et al. (2021) utilized active learning and support vector machine (SVM) algorithms to classify facial action units for human facial expression recognition. They found that active learning reduced both the number of non-support vectors in the training set, and the labeling and training time. Li et al. (2022) applied active learning to detect PDF malware detection, effectively reducing the number of training sets while maintaining detection performance. In addition, active learning was employed in several social network studies. Tran et al. (2017) combined active learning and self-learning to reduce the labeling effort required for named entity recognition from tweet streams. In experiments using Twitter data (both machine-labeled and manually labeled data), the proposed method was shown to significantly reduce human labeling effort and improve system performance. Elyashar et al. (2021) used active learning to train a classification model that detected 53K Twitter accounts of healthcare professionals. These studies demonstrate the effectiveness of active learning in addressing classification problems and their ability to increase the efficiency of machine learning in a wide range of applications.

Figure 1 compares the main studies in the field of social media and tourism, which are related to this study. Each row corresponds to the number of studies (with the exact study names provided in Table 1) while the last row representing our VISTA method. We compare the following metrics in each study: whether the studies used text analysis, and image analysis techniques, deep, and active learning, and for image tasks: image filtering, classification, and clustering. As can be observed, we are the first who combined an active learning method with photo filtering. Despite the similarity between Study 5 ( Nguyen et al. (2017)) and ours, Nguyen et al. focused on filtering images to determine whether they were related to disasters or not, which is different from our domain and objectives.

Fig. 1
figure 1

Comparative analysis of related works

Table 1 Related works by groups

3 Methods

In this section, we describe the proposed VISTA method which aims to improve tourists’ ability to search for information about tourist destinations using photos on Instagram. Our novel method consists of four phases: (1) data collection, in which Instagram photos are collected, (2) defining and manually labeling tourism-related photos, (3) training a machine learning classifier by using an iterative active learning process, and (4) using the machine learning classifier to filter out irrelevant photos from the large number of photos available on Instagram (see Fig. 2).

Fig. 2
figure 2

Overview of VISTA’s steps

3.1 Data collection

The data collection phase consists of three sub-phases: (1) defining the country or countries of interest and the tourist destinations, (2) defining the hashtags, and (3) collecting posts from Instagram (see Fig. 3).

Fig. 3
figure 3

Data collection sub-phases

3.1.1 Defining country or countries of interest

To collect photos associated with tourism-related destinations, we first must define the country or countries of interest. In this study, we focused on Israel. Next, tourist destinations of interest are selected. Tourism-related destinations in a given country can include, for example, national parks, cities and urban areas, historical sites, and natural wonders. To collect the photos without biasing, a list of tourism-related destinations from a reliable website is required. For Israel, we obtained a list of the largest cities in Israel (in terms of their population) using the Worldometers.info website (Worldometers.info 2023). This website was established to provide access to world statistics and is run by a team of developers, researchers, and volunteers from around the world. We retrieved a list of Israel’s 70 most populated cities from Worldometers.info. Among the cities listed are Jerusalem and Tel Aviv, which are the most populous cities in Israel, with 801K and 432K residents, respectively.

It is important to mention that it is recommended to obtain a list of destinations from a reliable website rather than online travel agencies (e.g., Tripadvisor), which can be opportunistic.

3.1.2 Defining hashtags

Hashtags are used to tag and categorize content on social media platforms (Small 2011). They assist social media platforms in serving posts to relevant users and providing users with the ability to search for specific content. In general, hashtags can be used on any social media platform, but they are most commonly used on Instagram and Twitter.Footnote 2

Instagram users frequently use the hashtag symbol to tag photos of attractions they have visited, thereby creating a visual guide for others who may wish to visit the same locations (Giannoulakis and Tsapatsoulis 2016; Celuch 2021).

From the list of the 70 most populated Israeli cities, we only focus on the 10 cities with the highest number of Instagram posts. We manually searched for appropriate hashtags associated with each city on the list. To do this, we inserted the name of each city in the search box on Instagram, including the hashtag symbol as the prefix. In response, Instagram retrieved a list of similar hashtags, along with the total number of posts associated with each hashtag. For example, over 6.7 M posts are associated with #telaviv, and more than 614 K posts are associated with #telavivcity. We also searched for similar hashtags by entering the official profile of each city (if one exists) in the search box on Instagram and searching for similar hashtags in the city’s recent posts. The manual search process described above resulted in a list of hashtags, along with the number of posts associated with each of the 10 cities with the highest number of Instagram posts, which is are presented in Table 2.

Table 2 The hashtags and number of Instagram posts of the selected Israeli cities

3.1.3 Data collection

As our methodology is data-driven, we collected photos from the 10 Israeli cities with the highest number of Instagram posts. To represent each city, we selected only one hashtag from its associated hashtags—the hashtag with the highest number of posts for the city, i.e., we focused on the most prevalent hashtag (the hashtags used appear in bold in Table 2). Doing so was efficient and less time-consuming than collecting data for all of the hashtags associated with each city.

To collect the Instagram photos, we used the RapidAPIFootnote 3 website to connect with the Instagram API. It is possible to attach a photo, a series of photos, or a short video to a post on Instagram. Therefore, to gather photos, we downloaded Instagram posts. In many cases (49,037), the posts included videos, so the first frame was stored as a photo. For all other posts, the photo attached to the post was saved; if a post included several photos, just the first photo was saved. As a result, a single photo was saved for each post. Posts from November 1, 2011 to February 16, 2023 were collected. To ensure that we collected a proportional number of posts for each city, we randomly downloaded 5% of the posts for each city. A total of 791,352 Instagram posts with photos were collected. Each Instagram post includes the publication date, number of likes, location, text, and a photo.

3.2 Defining tourism-related photos

To detect appropriate photos, we defined criteria for the detection of tourism-related photos. To do this, we sampled 100 Instagram posts randomly and manually reviewed their photos. During our photo review, we identified three types of tourism-related photos: (1) landscapes, (2) wildlife and animals, and (3) landmarks and monuments (see Fig. 4).

Fig. 4
figure 4

Examples of the three types of tourism-related photos

  1. 1.

    Landscapes - photos taken in natural settings, such as mountains, forests, beaches, or deserts. Also included in this category are flowers and other plants. Artificial landscapes, such as city skylines, panoramic views, gardens, or parks, are also included.

  2. 2.

    Wildlife and Animals - photos related to wildlife and animals.

  3. 3.

    Landmarks and Monuments - photos taken of landmarks, such as architectural sites, monuments, or iconic buildings.

In each case, tourism-related photos include photos in which a person or multiple people appear in the scene; for example, a photo of a person on the beach.

Tourism-related photos include both photos created by tourists and residents. It includes, for example, photos taken by tourists while visiting a destination as well as photos taken by residents. Different studies have examined the differences between tourists and residents using photos (Derdouri and Osaragi 2021, 2021a; Gunter and Önder 2021). Tourism-generated content typically highlights popular attractions and entertainment-related photos with a strong preference for amusement and object-focused photos, in contrast to resident-generated content that is primarily nature-focused (Derdouri and Osaragi 2021a). The study adopts an inclusive criteria that encompasses both tourist-generated and resident-generated content to capture a wide range of tourism-related photos that represent both tourist and resident.

3.3 Using active learning to train a machine learning classifier

To distinguish tourism-related photos from non-tourism-related photos, we trained a machine learning classifier using an iterative process known as active learning (Olsson 2009). The iterative process of training the classifier, which is illustrated in Fig. 5, consists of four sub-phases: (1) selecting the tourist destination and using the previous machine learning classifier to classify the tourist destination photos, (2) photo inspection by annotators, (3) training a new machine learning classifier, and (4) evaluating the results.

Fig. 5
figure 5

Iterative active learning process used to train our machine learning classifier

In active learning, a number of iterations are performed to train a machine learning classifier. The output of each iteration is a trained classifier. In this study, we used the VGG-19 architecture to train a classifier. Simonyan and Zisserman (2014) proposed a convolutional neural network (CNN) architecture called VGGNet which consists of several convolutional layers with small \(3\times 3\) filters, followed by maximum pooling layers and fully convolutional layers. There are 19 layers of weight in the VGG-19 network, which determine the network’s depth.

To build accurate statistical models, supervised machine learning requires labeled training sets. Active learning reduces manual annotation efforts by focusing on photos that contribute most to improving the classification model. To train the machine learning classifier, we used the uncertainty sampling active learning strategy (Lewis and Gale 1994), manually annotating photos with the lowest confidence score. The confidence score of a machine learning classifier reflects the level of certainty assigned by the classifier to its predictions. The classifier assigns a confidence score to each photo prediction. The level of confidence is often expressed as a probability, with values closer to one reflecting high confidence and values closer to zero indicating low confidence.

Here is a detailed explanation of each sub-phase as presented in Fig. 5:

3.3.1 Selecting the tourist destination and applying the previous machine learning classifier

The first phase of the active learning process involved selecting the tourist destination for the current iteration (Fig. 5, sub-phase 1). In each iteration, we recommend selecting a tourist destination with different geographical characteristics. For example, if the first tourist destination selected is a coastal location, the next should have other geographical characteristics. After selecting the tourist destination, the machine learning classifier from the previous iteration is applied to classify the entire photos of the tourist destination (since in the first iteration, the machine learning classifier was not trained yet, we skip this, and move to sub-phase 2).

3.3.2 Photo inspection by annotators

The 500 photos with the lowest confidence score according to the classifier are chosen to be manually classified (Fig. 5, sub-phase 2). Since the machine learning classifier has not been trained yet in the first iteration, 500 random photos of the tourist destinations were selected.

The 500 photos are manually labeled as tourism-related and non-tourism-related and are added to the training set. Each photo was manually labeled by two human annotators. To ensure accuracy and minimize individual bias, this dual-annotation approach was employed. The annotators were familiar with the criteria for categorizing photos as tourism-related and non-tourism-related in Sect. 3.2. An inspection of the specified criteria is carried out in this step. This includes understanding the characteristics that define tourism-related photos, including landscapes, wildlife, animals, historical landmarks, or monuments. Photos that did not fulfill these criteria were categorized as non-tourism-related photos. Annotators review each photo to identify key elements that might identify it as tourism-related photo or a non-tourism-related photo. In cases of disagreement between the two annotators, a third annotator determined the label after discussion among the annotators. If the panel of three annotators could not agree on the label for a given photo, it was excluded from the training set. By doing so, it is ensured that the final labeling is as accurate as possible, adhering to the defined criteria.

3.3.3 Training a new machine learning classifier

In sub-phase 3, a new machine learning classifier is trained using the new training set added in sub-phase 2.

3.3.4 Evaluation

In each iteration, we evaluated the classifier’s performance (based on accuracy and weighted F1 score) using a validation set (Fig. 5, sub-phase 4). The training phase concluded when there was no improvement in the classifier’s performance over successive iterations or the marginal increase in the accuracy was very low.

3.4 Filtering photos

We used our final machine learning classifier to classify the unlabeled photos in our datasets. Each photo was assigned a confidence score, which indicates the classifier’s degree of certainty in its classification. Then, histograms were constructed to present the distribution of confidence scores for each dataset. This was done to determine a threshold value of confidence score that signifies the point at which the classifier exhibits high confidence, resulting in the minimum number of errors.

The purpose of applying the classifier to all the photos and collecting the photos with a confidence score higher than the threshold value is to obtain the tourism-related and non-tourism-related photo collections for each dataset that the classifier is most confident in.

To select the threshold value, we analyzed random samples from each dataset. Initially, we selected 100 photos with the highest confidence scores, 100 photos with scores above 0.8, and 100 photos with scores above 0.5. For each sample of 100 photos, we conducted a manual review to determine whether there were any errors in the classifier’s predictions. When determining the threshold, we considered the observed trends in the number of errors to ensure that the reduction in the threshold value did not result in a substantial increase in the number of errors. Therefore, the threshold value chosen represents a balanced compromise, maximizing accuracy while maintaining the selection of a sufficient number of photos. After applying the threshold, each filtered dataset should include all photos with a confidence score equal to or higher than the threshold.

4 Data description

To evaluate the proposed VISTA  method, we used two datasets: (1) the Instagram Top-10 Israeli Cities dataset, and (2) the InstaCities100K dataset.

4.1 Instagram top-10 Israeli cities dataset

Using the proposed VISTA method, we collected 791,352 photos from Instagram of the 10 Israeli cities with the most Instagram posts. The creation dates of the photos varied. The oldest photo was posted in November 2011, and the most recent photo was taken in February 2023. For each Israeli city, 5% of the photos with the associated hashtag were randomly selected. The number of posts collected for each city is presented in Fig. 6.

Fig. 6
figure 6

The number of posts collected for each city

To evaluate our machine learning classifier using active learning, a training set was constructed. During the iterations of active learning described in Fig. 5, the 500 photos with the lowest confidence scores were manually labeled and added to the training set at each iteration. In the final iteration, there was a total of 2571 photos in the training set, of which 1962 were labeled as non-tourism-related photos, and 609 photos were labeled as tourism-related photos.

In addition, a validation set was created from the Instagram Top-10 Israeli Cities dataset. The validation set consists of 804 photos that were removed from the training set. The number of photos we used for the validation set for each city is proportional to the number of photos collected for each city. We manually classified 524 and 280 photos as non-tourism-related and tourism-related, respectively.

4.2 InstaCities100K dataset

To evaluate the ability of our classifier to generalize to photos of cities worldwide, we used the InstaCities1M dataset published by Gomez et al. (2018). This dataset contains photos from Instagram of the 10 most populous English-speaking cities in the world: London, New York, Sydney, Los Angeles, Chicago, Melbourne, Miami, Toronto, Singapore, and San Francisco. The dataset includes 100 K photos of each city. Since the evaluation of 1 M photos is challenging, we created a smaller dataset of 100 K photos (referred to as InstaCities100K), which includes 10 K randomly selected photos for each city.

A test set was extracted from this dataset as follows: For each city in the dataset, 50 photos were randomly selected and manually labeled. Of the 500 photos in the test set, 399 photos were classified as non-tourism-related photos, and 101 photos were classified as tourism-related photos.

5 Results and discussion

5.1 Training machine learning classifier using active learning

At each active learning iteration, a new classifier was trained using the VGG-19 architecture. The sub-phases during active learning were described in Sect. 3.3. Each iteration included another city from the Instagram Top-10 Israeli Cities dataset.

To create an initial training set, iteration 0 involved downloading 500 random photos of the city of Eilat (a coastal location) and manually labeling them. Of these, 241 photos were categorized manually as non-tourism-related photos, and 177 photos were labeled as tourism-related photos. We discarded 82 photos for which we were unable to determine the type. In this iteration, we trained our first machine learning classifier on the 418 manually labeled photos.

In the first iteration (iteration 1), the trained machine learning classifier from iteration 0 was used to classify all of the photos of Eilat from our dataset. The 500 photos with the lowest confidence score were selected and manually labeled by the annotators who categorized 258 photos as non-tourism-related photos and 135 photos as tourism-related photos; for the remaining photos, the annotators could not agree on the type. Then, the photos manually labeled in this iteration (393 photos) were added to the training set from iteration 0 (which consisted of 418 photos) to form the training set for iteration 1 (consisting of 811 photos), and a new machine learning classifier was trained. After completing iteration 1, which only included photos of Eilat, we selected another city with different geographical characteristics. The city of Eilat is located on the eastern shore of the Red Sea. The landscape is characterized by rocky hills, canyons, and sand dunes, comprising an arid desert setting. Therefore, we chose photos of Nazareth for the next iteration. The city of Nazareth is located in the north of Israel in the Lower Galilee region, and it is known for its cultural and historical importance. Nazareth is characterized by its hilly landscape.

Photos of Netanya, Haifa, and Tel Aviv were used in iterations 3–5 (see Table 3). The same iterative active learning process was performed with photos of a different city in each iteration (as was described in Fig. 5). In each iteration, the trained classifier from the previous iteration was used to classify photos of the new city. Then, the photos with the lowest confidence scores were manually labeled by the annotators and added to the training set, which was used to train the classifier in this iteration.

In each iteration, we evaluated the classifier’s performance based on accuracy and a weighted F1 score using a validation set. Initially, the machine learning classifier accuracy increased by 1–3% in each iteration, but in the fifth iteration, the marginal increase in the accuracy dipped to 0.5%. Therefore, we ended the manual labeling process after five iterations (see Fig. 7).

Table 3 Training process

5.1.1 Performance evaluation

Instagram top-10 Israeli cities dataset

At the end of each iteration as was described in Sect. 3.3, we evaluated the performance of the machine learning classifier on the validation set generated from the Instagram Top-10 Israeli Cities dataset (see Fig. 7). For additional information regarding the validation set’s characteristics, see Sect. 4.1.

Fig. 7
figure 7

Evaluation of the machine learning classifier’s performance on the Instagram Top-10 Israeli Cities dataset. The graph presents the accuracy score obtained in each of the five iterations performed

It can be seen, in iteration 0, that the machine learning classifier obtained an accuracy score of 0.9. By using the active learning process (iterations 1–3), the classifier’s performance improved, and an accuracy score of 0.96 was obtained at the end of iteration 3. In iterations 4–5, the marginal increase in the accuracy dipped to 0.5%, so we stopped the training process. In the last iteration, the classifier obtained a weighted F1 score of 0.964. Based on the results of this experiment, we conclude that active learning has the potential to significantly reduce the number of training iterations required to produce a high-performing classifier since only five iterations were needed for our classifier to achieve satisfactory performance; other studies drew similar conclusions about active learning - Li et al. (2014); Yao et al. (2021); Gala et al. (2014); Li et al. (2022); Jain and Kapoor (2009); Kim et al. (2020).

InstaCities100K dataset

To assess the machine learning classifier’s generalization capabilities and ability to perform well on photos of other cities around Israel and the world, we evaluated its performance on the InstaCities100K test dataset (see Sect. 4.2). This dataset includes 10 K photos of 10 cities worldwide, none of which are in Israel.

The machine learning classifier obtained an accuracy of 0.958 and a weighted F1 score of 0.959 on the InstaCities100K test dataset. These results demonstrate the ability of the machine learning classifier, which was originally trained on the Instagram Top-10 Israeli Cities dataset, to perform well on a dataset that contains photos of other locations worldwide. This impressive performance also highlights the classifier’s generalization capabilities.

5.2 Filtering photos

As described in Sect. 3.4, the machine learning classifier was applied to the unlabeled photos of the Instagram Top-10 Israeli Cities dataset and the InstaCities100K dataset generating a confidence score for each photo. This was done in order to determine a threshold for each dataset that used to select photos for which the classifier has high confidence in its predictions. For each dataset, histograms were constructed to present the distribution of confidence scores. The histograms helped to understand from which threshold value the photos should be selected. We sampled 300 random photos for each dataset with different thresholds. For each sample, we conducted a manual review to determine whether there were any errors in the classifier’s predictions using an accuracy score.

5.2.1 The filtered instagram top-10 Israeli cities dataset

Figure 8 presents a histogram of the distribution of the confidence scores, in which most photos’ predictions had high confidence scores (most were over 0.95). Based on this histogram, we selected three samples of photos. Since as can be seen in Fig. 8 no photos had a prediction score below 0.5, we selected 100 photos with confidence scores above 0.5, 100 photos above 0.8, and 100 photos with the highest confidence scores. The results of the manual review indicated that there was a low number of errors when choosing photos with confidence scores above 0.8, alongside a substantial quantity of photos. To validate our results, we randomly selected two samples of 100 photos with confidence scores above 0.8, and an average accuracy score of 0.945 was obtained in our manual review. Therefore, we only selected photos for which the classifier was most confident, with a confidence score above 0.8. This resulted in a photo collection of 134,448 tourism-related photos and a collection of 559,561 non-tourism-related photos. We will refer to this dataset as the filtered Instagram Top-10 Israeli Cities dataset.

Fig. 8
figure 8

Histogram of photos’ confidence scores

5.2.2 The filtered InstaCities100K dataset

The machine learning classifier which was originally trained on the Instagram Top-10 Israeli Cities dataset was also applied to classify the InstaCities100K dataset. We filtered out irrelevant photos by selecting the photos whose confidence score, according to the trained classifier, was above 0.8. 19,632 were classified as tourism-related photos, and 79,255 were classified as non-tourism-related photos. To manually validate these results, we randomly selected two samples of 100 photos with confidence scores above 0.8, and an average accuracy of 0.97 was obtained. The high accuracy scores achieved on two random validation sets demonstrate that our proposed VISTA method effectively differentiates tourism-related photos from non-tourism-related photos which are not beneficial to tourists. We will refer to this dataset as the filtered InstaCities100K dataset.

6 Application for the filtered instagram top-10 Israeli cities dataset and the filtered InstaCities100K dataset

In this section, we demonstrate the use of the VISTA method to analyze photo collections associated with tourism destinations from Instagram, by performing a comparative analysis between tourism-related and non-tourism-related photo collections in terms of photo proportion, user engagement, and object comparison.

Using the machine learning classifier and selecting only photos with a confidence score above 0.8, resulted in two filtered datasets: the filtered Instagram Top-10 Israeli Cities dataset and the filtered InstaCities100K dataset. Each filtered dataset consists of a tourism-related collection and a non-tourism-related collection.

6.1 Photo proportion analysis

Figures 9 and 10 present the proportion of tourism-related and non-tourism-related photos, respectively, in the filtered Instagram Top-10 Israeli Cities and the filtered InstaCities100K datasets. We can see that the proportion of tourism-related photos is noticeably smaller than that of non-tourism-related photos across all cities, with an average of 84.17% of all photos unrelated to tourism.

Based on this, we can conclude that most photos associated with cities that are published on Instagram are actually irrelevant to tourists. This finding highlights the important role that VISTA can fulfill in helping tourists find useful information about tourist destinations on social media platforms like Instagram and discover points of interest that cannot be found on official tourism websites. VISTA’s ability to filter out irrelevant photos, makes it an effective solution for tourists seeking to find relevant and authentic tourism-related photos among the high volume of irrelevant photos on Instagram. VISTA enables travelers to easily gain access to the wealth of valuable travel information available on social media platforms.

Our study’s results are aligned with other related studies. For instance, Dadgar and Neshat (2022) demonstrated that, in numerous cases, the hashtags employed on social networks do not accurately reflect or align with the contents of the posts they accompany. Among 6,494 posts related to brands, 39.57% were annotated as matches and 60.43% as mismatches. According to celebrity-related posts, only 38.36% of the samples were matched, while 61.63% were mismatched. Also, Zasina (2018) indicated that some geotags associated with Instagram photos do not accurately reflect the actual location in which the photos were taken, thus rendering the data useless for city-scale analysis.

Fig. 9
figure 9

The proportion of photos classified as tourism-related and non-tourism-related for each city in the filtered Instagram Top-10 Israeli Cities dataset

Fig. 10
figure 10

The proportion of photos classified as tourism-related and non-tourism-related for each city in the filtered InstaCities100K dataset

6.2 User engagement analysis

The information contained in the Instagram Top-10 Israeli Cities dataset included the number of likes each photo received, allowing us to examine user engagement with the photos. In our comparative analysis, we analyzed user engagement with tourism-related and non-tourism-related photos in the filtered Instagram Top-10 Israeli Cities dataset. We compared the average number of likes with photos related to tourism and non-tourism. We expected that tourism-related photos would attract more attention than other photos given the visual appeal and attractiveness of such photos. Figure 11 presents the average number of user likes obtained for these two types of photos. As can be seen, the tourism-related photos received more likes than the non-tourism-related photos, as expected. The difference between the number of likes for each type of photo is statistically significant (Mann–Whitney U test with \(p <0.001\)).

To our knowledge, we are the first to compare the engagement between tourism-related and non-tourism-related photos. Existing works prove that user-generated content, particularly in the tourism sector, can enhance follower engagement and involvement on social media platforms (Santos et al. 2023). In addition, posts intended as direct advertising in tourism generate a lower level of engagement in terms of likes and comments than posts that do not advertise (Chugh et al. 2019; Bonilla-Quijada et al. 2021).

Fig. 11
figure 11

The average number of likes for tourism-related and non-tourism-related photos

6.3 Object comparison using object detection

To explain the machine learning classifier’s success in differentiating between tourism-related and non-tourism-related photos, we applied an object detection model to the photos on the filtered Instagram Top-10 Israeli Cities dataset. A comparison of the objects detected in tourism-related photos and non-tourism-related photos can shed light on the machine learning classifier’s effectiveness and its ability to detect tourism-related photos.

To detect objects in photos collections, we used the inceptionV3 model suggested by Szegedy et al. (2017). Inception, also known as GoogleNet, is a deep convolutional neural network (CNN) architecture that Google developed. InceptionV3 was primarily designed for image classification and was trained on the ImageNet dataset,Footnote 4 which consists of 1000 categories of objects. Thus, this model can categorize objects into 1000 predefined categories. We applied the InceptionV3 model on the filtered Instagram Top-10 Israeli Cities dataset.

The InceptionV3 model assigns a confidence score to each photo for each object category; the confidence score represents the model’s confidence that an object appears in the photo. A higher confidence score indicates greater confidence that an object appears in the photo. Photos with a confidence score above 0.2 were included in our analysis. This allowed us to identify the main objects seen in photos of a given city, enhancing our understanding of the city’s spirit. For instance, the objects in photos taken of cities located on the coastal plain, such as Tel Aviv and Netanya, are mainly connected to the sea, whereas the objects in photos taken of Jerusalem and Nazareth are mainly monasteries, churches, castles, and domes. Consequently, our method can not only differentiate between tourism-related and non-tourism-related photos, but can also identify the main objects found in the photos, thereby providing tourists with more accurate information about tourist destinations.

A manual validation process was conducted to ensure that the model we used for object detection was classifying objects accurately. As part of the validation process, 50 photos were randomly sampled from the filtered Instagram Top-10 Israeli Cities dataset, and a human annotator identified the object in each photo; then, the object identified by the annotator was compared to the model’s prediction in order to validate the model. The object detection model was found to be 80% accurate. Sometimes the model misidentified the object in the photos and incorrectly considered it a related object. For example, in one case, the object in the photo was a dolphin, but the model identified it as a whale. This level of accuracy was sufficient for our purposes. We then manually grouped the objects identified into broader categories of objects. For the tourism-related photos, the objects were grouped into waterfront, religious, fortification, and geographical categories. The categories of clothing, stationery, entertainment, and technology were used to group objects in the non-tourism-related photos.

Figure 12 presents the top-10 objects, grouped by category detected in the tourism-related and non-tourism-related photo collection. In this figure, it can be seen that there is no overlap between the top-10 objects in the two photo collections, and, as we would expect, the tourism-related photos contain a variety of items related to tourism, whereas the non-tourism-related photos are primarily composed of advertised items, such as clothing and accessories. We then delved deeper and examined the overlap between the top-100 objects identified in each photo collection. We found that 34 objects were detected in both groups, including person, jeans, umbrella, and sunglasses. Our findings that the objects detected in the two photo collections differ and are largely unrelated can be attributed to the success of our machine learning classifier. The features and visual elements contributing to a photo’s designation as a tourism-related photo or tourism-related photo are sufficiently distinct to enable our classifier to make accurate predictions.

Based on our results, the clothing category is most prevalent in non-tourism-related photos. This finding aligns with the fact that fashion brands utilize Instagram to promote products and brands, to communicate directly with consumers, to promote events and initiatives, and to build brand awareness  (Tallent 2016; Ha et al. 2017; Yanuar et al. 2021; Castillo-Abdul et al. 2022).

Another study reinforces our findings (Acuti et al. 2018). Acuti et al. analyzed Instagram posts with the hashtags #London and #Florence, which revealed a significant amount of fashion-related content. One out of ten photos featured clothes, accessories, or shopping centers. They suggested that fashion accounts strategically use city hashtags to increase their visibility and gain more followers.

Fig. 12
figure 12

Top-10 objects, grouped by category, detected in the tourism-related and non-tourism-related photo collections from the filtered Instagram Top-10 Israeli Cities dataset

7 Conclusion and future work

With the vast amount of content shared on social media platforms, finding relevant and meaningful information about tourism destinations can be overwhelming for tourists and tourism agents. This study effectively addresses the challenge of users extracting genuine insights and experiences related to their desired travel destination using social media. Here, we proposed VISTA: Visual Identification of Significant Travel Attractions, a novel method that combines deep and active learning techniques to differentiate between relevant and irrelevant tourism photos. Harnessing the power of these machine learning methodologies, VISTA streamlines the process of extracting valuable insights from the vast quantity of user-generated visual content related to travel experiences and destinations.

As part of this study, we created the Instagram Top-10 Israeli Cities dataset, a collection of real-world photos mined from Instagram of the 10 most popular Israeli cities on Instagram that contains 791,352 photos. We evaluated the proposed machine learning classifier’s performance on a validation set consisting of 804 photos The classifier obtained an accuracy score of 0.965 and a weighted F1 score of 0.964. Moreover, due to the active learning process inherent in the proposed method, we obtained an effective machine learning classification model based on a relatively small number of training iterations.

As part of our evaluation of the robustness and generalization of the machine learning classification model, which was trained exclusively on Israeli locations, we evaluated the classifier’s performance on 500 photos from 10 cities worldwide. The accuracy score was 0.958, and the weighted F1 score was 0.959, demonstrating its robustness and capability to capture global tourism attractions based on a small set of local examples.

Finally, we demonstrated VISTA’s effectiveness by performing a comparative analysis of the tourism-related and non-tourism-related photos collections with respect to photo proportion, user engagement, and object comparison. Based on the analysis, we concluded that most Instagram photos associated with cities are actually irrelevant to tourists.

Applying the classifier to the obtained datasets shows that most of the Instagram content associated with city destinations is irrelevant to tourists, including advertisements, promotional material, and other unrelated information.

Our analysis also shows that users engage more with tourism-related photos than non-tourism-related photos in terms of the average number of likes received. A low overlap between the objects detected in the two photo collections can explain the classifier’s high performance in distinguishing tourism-related photos from non-tourism-related photos by capturing tourism features and visual elements.

Identifying tourism-related photos in social media automatically by VISTA opens room for many potential applications, which can improve tourism in several aspects. For example, by filtering tourism-related photos, a better understanding of tourist behavior and preferences can be achieved primarily by analyzing the comments and likes of users. Based on the analyzed user’s behavior, targeted recommendations, effective tailoring marketing strategies, and user personalization can be proposed. Grouping social media users based on their published tourism-related photos can assist in improving tourism management based on their common needs, interests, and preferences. Moreover, by analyzing engagement metrics, such as likes, comments, and views on tourism photos from different destinations, tourism organizations can gauge relative popularity. They can also identify hotspots that generate buzz versus those that are underwhelming. Furthermore, monitoring trends and using the geolocation of tourism photography allows the tourism industry to track visitor demand, avoid overcrowding areas, and establish environmentally sustainable visitor limits. They can also identify undiscovered or underutilized attractions that could be promoted to re-distribute tourist traffic.

In addition, applying sentiment and emotion detection algorithms to the comments of tourism-related photos can extract the emotions expressed by users with respect to the given tourism destination. This will allow tourism organizations to detect destinations with a negative reputation and determine where campaigns are needed to improve their reputation.

Moreover, destination perception can be assessed by analyzing tourism-related photos and identifying their sources, specifically distinguishing between photos taken by residents and tourists. Analyzing the distribution of these photos can indicate whether a destination is perceived more locally or globally.

While our data collection process was comprehensive, particularly because we incorporated diverse data sources and a wide range of content, its reliance on post collection based on hashtags introduces certain limitations. Specifically, our VISTA method may overlook relevant tourism-related photos when users fail to include location-specific hashtags in their posts. However, Instagram users can employ our trained classifier to filter photos, enhancing the relevance of the dataset.

In addition, the criteria we defined for detecting tourism-related photos were based on our subjective observations. It is crucial to note that as new trends in photo sharing emerge, there is a possibility that our classifier may not adequately capture the evolving patterns, potentially leading to misclassification. There may be a need for continuous monitoring and adaptation of the classifier in response to emerging trends, and this may be the subject of future research. Also, developing a novel multimodal approach combining visual and textual analysis can be very promising in uncovering rich contextual details and sentiments related to various tourism destinations and activities.