1 Introduction

Understanding the destination's “visual destination image” (the destination image as projected by visual media such as photography and video, as a term probably first coined by Hunter 2012) is becoming ever more important to destination managers and marketers as travellers’ formation of destination image is increasingly being derived from online visual media (Kim et al. 2014). The shift in consumer behaviour to explore destinations online via visual platforms such as Instagram and TikTok rather than visit travel Websites or browse printed brochures means that destinations must include those platforms’ visual content in their marketing intelligence efforts, which traditionally have relied on text mining and not photos or videos (i.e. to measure how a destination is being represented online by the DMO’s own marketing and branding activities as well as compare to how it is being represented by the public through analysis of their own content). A significant body of e-tourism research has established the benefit for tourism stakeholders of measuring their destination’s image and using it in their branding and marketing activities (see Sect. 2). The concept of destination image has been examined in the literature for several decades (Echtner and Ritchie 1991) with researchers interested both in how to define and quantify it, as well as in how it affects consumers’ intention to recommend or visit a destination, which acts as a motivation for its definition and measurement. Initial work on destination image measurement relied on text analysis, e.g. of online visitor reviews. However, the growth in the importance of publicly accessible digital photography in the formation of consumer’s destination image has led to a shift in approaches to include the measurement of the visual component.

An accurate approach for visually classification of tourist photography (labelling each photo with one or more concepts from a concept list) in order to measure the visual destination image has important implications for destination managers and marketers. The strategic management of communication on social media is a tool for influencing traveller decision making and trip planning (Huerta-Álvarez et al. 2020). Photos are a viable source to extract brand image on social media (Liu et al. 2020). Therefore, visual destination image forms part of destination branding efforts (Blain et al 2005). Accurate determination of the destination image as projected by platforms like Instagram and TikTok will allow destination managers and marketers to gain new insights into how visitors project their destinations through photos and videos on those platforms, which will include an important visitor demographic in their marketing intelligence. The image projected by travellers can be compared with the intended destination branding as a measure of the destination’s marketing success (Govers et al. 2007). However, differences may be even more beneficial than similarities: if a feature of the destination which is important to its branding is lacking in the measured perceived image, it is an indicator that the feature needs to be emphasized more in marketing efforts (Mak 2017). Conversely, if a feature of the destination is strongly present in the perceived image yet is not a part of the intended brand, managers should consider if this feature’s importance to visitors should be reflected in the branding efforts. The result will be making better decisions and attracting more visitors to their destination. We will demonstrate the value of our new approach by comparing the visual destination image of different destinations through user-posted photography on Instagram and discussing the implications for the marketing of those destinations. This contributes to the ongoing research efforts in measuring visual destination image and its benefits for destination management and marketing.

E-tourism research initially relied on the manual annotation of photos (Stepchenkova and Zhan 2013). The adoption of Artificial Intelligence (AI) and machine learning techniques is still in its infancy (Picazo and Moreno-Gil 2017). In recent years, advances in computer vision (the study of developing software that can interpret visual content, largely seeking to replicate human vision capability) through deep learning models (a subset of AI which uses complex neural network architectures with many layers to solve advanced problems; Voulodimos et al. 2018) led to e-tourism researchers using such models to automatically classify large collections of tourist photography. However, the research reported to date has relied on results derived from models which have been trained to detect a broad, generic set of concepts on broad, general photo datasets (e.g. 1000 classes in ImageNet) and whose accuracy in visual destination image measurement has not been demonstrated (Nixon 2018). In this paper, we set out to test if we could achieve a more accurate and appropriate classification of tourist photography for the purpose of measuring visual destination image. In our case, the deep learning model has been additionally fine-tuned on tourism photography prior to its use for classification. We train the model on a set of labels which are directly aligned with cognitive attributes identified in past research as appropriately capturing the visual component of destination image.

The contributions of this paper are a common set of cognitive attributes to train deep learning models for destination image measurement on, a ground truth dataset of tourism photographs labelled according to those attributes for benchmarking models, a fine-tuned visual classifier which is available online and can be re-used by researchers and practitioners alike, as well as a proposal for an approach to representing the visual destination image acquired from our model so that destinations may be evaluated and compared for use in destination management and marketing. In Sect. 2, we look at the state of the art in destination image research and its measurement from photography (visual destination image). In Sect. 3, we introduce our methodology: defining the set of cognitive attributes, creating a training dataset, training and validating different deep learning models, evaluating on the basis of a ground truth dataset. In Sect. 4, we describe how we benchmark the accuracy of visual destination image measurement, comparing our model with an approach equivalent to that in previous literature. In Sect. 5, we report initial results from using our model to measure visual destination images and compare various destinations’ representation through UGC in Instagram. In Sect. 6, we discuss the implications of these results for destination management and marketing. We conclude in Sect. 7 with our contributions to the research field, limitations and opportunities for future research.

2 Theoretical background: visual destination image

2.1 Destination image

Destination image conceptualises the mental construct (Reynolds 1965) that is formed from the “beliefs, ideas and impressions that a person has of a destination” (Crompton 1979: 18), which is capturing and retaining the potential customers’ attention (Kavaratzis and Ashworth 2005). In as far as that construct matches the expectations and wishes of the consumer, it can represent a basis for influencing travellers’ intent to visit (Molinillo et al. 2018) or to revisit (Loi et al. 2017) the destination. Appropriate destination image management can increase the destination’s attractiveness to potential tourists (Önder and Marchiori 2017) and as a result boost the economic profitability of a destination (Almeida-García et al. 2020). Through the Web and social networks, travellers now co-create destination image (Abbate et al. 2019), meaning the contribution to a destination’s image by the experiences, feelings, contents, reviews etc. shared by the visitors and travellers online (Agrawal et al 2015). Online UGC (user generated content) is regarded as an ideal source of marketing knowledge (Marine-Roig and Clavé 2016) as it reveals the travellers’ own image of the destination. Xiang et al. (2015) found that social media is having an enormous impact on travel planning. Travellers are increasingly using social networks as a source of information about destinations (Xiang and Gretzel 2010; Leung et al 2013). Several studies concluded that the use of social media is most extensive in the pre-travel stage of inspiration and planning (Almeida-García et al. 2020; Xiang et al. 2015; Öz 2015). In this stage, travellers turn to social media to find information about destinations (Almeida-García et al. 2020; Fotis et al. 2012; Öz 2015) and in trip decision making (Zeng and Gerritsen 2014). It has therefore become of the utmost importance for tourism managers to include social media in their data analysis for marketing knowledge about the travellers’ destination image.

Destination image research has addressed two key questions: (i) how to model the destination image and (ii) how to measure it according to the chosen model. It is considered to be a complex, multi-faceted and composite concept (Stepchenkova and Morrison 2006) where perceptions of various attributes within a destination will interact to form a composite or overall image (Gartner 1986). Destination image models rely on a finite set of attributes considered common to people’s mental constructs of the destination (Tasci et al. 2007). The most researched and popular factors that contribute to interpreting a destination have been considered the cognitive and affective images (Beerli and Martín 2004), where “there is general agreement that the cognitive component is an antecedent of the affective component”. Baloglu and McCleary (1999) also found strong support for the hypothesis that “…cognitive evaluations significantly influence affective evaluations of a destination”. The cognitive attributes are the physical aspects of a destination in which a customer is actively exposed to while searching for travel related information and may be split along a ‘functional’ (having an external tangible representation, such as landscape) and ‘psychological’ (having an internal intangible representation, such as safety) axis (Gallarza et al. 2002). We explicitly consider the visual destination image to be the equivalent of the functional cognitive aspect of the broader destination image, which covers those attributes that have a visually recognizable representation.

The research methods for determining and measuring the different cognitive attributes of destination image vary widely (Stepchenkova and Mills 2010). The shared goal of these efforts has been to enable a destination image model to be formed that can capture all the attributes of the destination that could occur in the mind of a surveyed traveller. The authors have compared the attributes determined by several key research works, based on a search for publications in Google Scholar on “destination image” and then ranking by number of citations those papers which present a list of attributes for destination image. Each of these works justifies its attribute list as appropriately comprehensive for modelling the cognitive component of destination image by the authors’ use of surveys and/or expert elicitation:

  • Echtner and Ritchie (1993) write that “A series of open-ended questions and scale items are developed and are shown to successfully capture all of the components of destination image”. (3012 citations).

  • Baloglu and McCleary (1999) developed their cognitive items from a literature review and a content analysis of four destinations’ guidebooks and validated the list with a sample of 60 students. (5854 citations).

  • Beerli and Martín (2004) refers to the lack of homogeneity in the attributes defined for the cognitive component of destination image and acknowledge the previous two papers as those who “effectively determined the reliability of the scales used”. The authors chose to develop a comprehensive set of attributes aggregating “all factors influencing the image assessments made by individuals” in all past lists and classifying them into nine dimensions. This set was validated by eight experts from the tourism industry. (4087 citations).

  • Stepchenkova and Zhan (2013) is notable as it is the only work where the set of cognitive attributes were considered in the context of content analysis of photos. The attributes had to be restricted to visual characteristics of the destination in this case, so that annotators could agree on their presence in destination photos. The authors themselves examined around 100 photos each to develop the attribute list, acknowledging Echtner and Ritchie (1993) as well as Albers and James (1988) for the theoretical grounding and finalising their list together through “a few iterations of comparative coding” (554 citations, the highest found for a paper which focuses on the visual destination image).

Table 1 compares their lists of attributes, each of which is claimed by the authors to be the most appropriate for their measurement of the cognitive component of destination image in the scope of their respective research. Since we want to consider subsequently the use of these attributes in the task of visual classification (which requires classes which are visually distinct), we repeat here only the functional cognitive attributes from the authors’ lists. Synonymous or suitably similar attributes are aligned onto the same row. It can be observed that indeed there is significant agreement across a majority of attributes in each list, as the alignment leads to 19 rows for all the attributes. In Sect. 3, we explain how our own list of 18 visual classes is derived from this table and hence can be considered suitably exclusive and exhaustive for modelling visual destination image through visual classification.

Table 1 Comparison of cognitive attributes of destination image in the literature

2.2 Visual component of destination image

Photography has become very significant in the formation of destination image (Feighey 2003). In the past decade, the interpretation of the content of tourist photography emerged as an important area of destination image research. Within social media, the widespread sharing of travel photography had enhanced the effects of the visual component on destination image formation (Frías et al. 2012) as visual content has a more significant impact on people’s memories of a destination than other forms such as text or audio (Kim et al. 2014). Travellers enable others to experience their visual memories of the destination by sharing their photos online (Tussyadiah 2010). A set of photographs can be considered as the materialisation of a tourists’ destination image (Pan et al. 2014). Photography serves as a pictorial representation of the cognitive attributes of the destination image in the mind of the photographer, and thus provides a means to construct the destination image (Hunter 2016) from the aggregation of photos taken at that destination (Stepchenkova and Li 2012). Tourist photography projects organic destination image which affects consumer perception of the destination (Kim and Stepchenkova 2015).

The measurement of the cognitive component of destination image from photography requires the specification of a set of visual classes which can be identified and disambiguated in a consistent manner by the human or computer annotator. First works reported in e-tourism literature manually annotated photos according to an attribute set determined by the authors (Tussyadiah 2010; Stepchenkova and Zhan 2013, 2016). Cross-annotator agreement was an important test for the appropriateness of the chosen attributes. They also established a common approach to measurement of destination image from visual media through the act of annotation, i.e. the representation of the non-textual content through a set of textual labels drawn from a controlled vocabulary, applied to each photo in the dataset, and the subsequent aggregation of the labels into a “destination image” through their frequency, co-occurrence, clustering etc. (Stepchenkova and Li 2014).

2.3 Computer vision and destination image

In recent years, the first research papers on measurement of destination image from photography have emerged which adopted deep learning-based approaches. Different neural network architectures—software implementations of a process which mimics the way human brains operate in order to learn—have been developed, with Convolutional Neural Networks (CNNs)—neural networks whose layers can learn the spatial hierarchies of features from image data—proving to be the most effective for computer vision tasks. Various CNNs are available pre-trained with very large visual content datasets which have been labelled according to a broad and generic set of visual classes. The re-use of those neural network models enables the automatic annotation of large scales of digital photography according to the visual classes the model was trained for, e.g. the 1000 categories of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The main publications in the area of e-tourism and computer vision for destination image are summarised in Table 2.

Table 2 Comparison of recent papers using deep learning based visual classification in the measurement of destination image

The deep learning models that have been used have been CNNs pre-trained on datasets such as ImageNet-1kFootnote 1 (1000 classes, published 2012), Places365Footnote 2 (365 categories, published 2017) and the SUN Attribute DatabaseFootnote 3 (102 attributes, published 2011). These all contain classes of a broad, general nature for the purpose of accurately identifying objects in photos or classifying them into “scenes”, whereas prior research annotating photos for destination image measurement using manual analysis restricted the set of cognitive attributes of destination image to between 7 and 20 (and assume this set to be sufficient for describing the destination image; Stepchenkova and Zhan 2013). As a result, most papers undertook some form of post-processing step after the initial annotation using the deep learning model, where the large diverse set of output labels has been reduced by clustering techniques into a smaller number of distinct categories. Only Sertkan et al. (2020a) did fine-tuning as in this paper, using a small data set of 300 photos per category to train CNNs for seven categories deemed the Seven Factor Model. They tested the resulting destination image measurements for profiling traveller preferences in Sertkan et al. (2020b). Different to our work, they focus on measurement of the “personality” of the destination in order to profile travellers and make recommendations, while we align our destination image model to the cognitive attributes that form part of a traveller’s mental image of a destination, based on the prior e-tourism studies, to support destination marketing and branding. It is notable that even papers published most recently continue to use CNN-based computer vision models even though Transformer architectures are now the state of the art for computer vision tasks. Some papers have used the Google Cloud Vision API, which likely uses Google’s Vision Transformer (ViT) architecture which reports better accuracy figures on the ImageNet benchmark, but which still returns labels from a broad, generic vocabulary not fine-tuned to the task of destination image measurement. In each of the papers, the set of input photos has been different, the selected deep learning model has been different, the set of labels in the output annotation has been different and the approach to cluster or otherwise reduce the resulting large set of labels into a smaller number of categories has varied. This makes it impossible to reproduce or compare these approaches, establish a benchmark for the computer vision approaches being used to measure destination image, or derive meaningful conclusions regarding the accuracy of the presented results. Unlike the general field of computer vision, where the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has become the de facto standard for evaluating and comparing computer vision models (the current list of most accurate models is updated regularly atFootnote 4. At the time of writing, all top performing models use the Transformers architecture), no such benchmark for comparison exists to date for measurement of the visual destination image by deep learning models.

As will next be shown (Sect. 3), we have chosen to focus on the fine tuning of a deep learning model for the task of measuring cognitive attributes of destination image (using a technique known as Transfer Learning). This requires the creation of a specific training dataset. In Sect. 4 we demonstrate greater accuracy with this approach compared to past work, aligned with the relevant cognitive attributes of destination image that we want to measure.

3 Methodology

Picazo and Moreno-Gil (2017) reflected that “new automated techniques based on intelligent software to analyse images seem needed. However, a conceptual classification of those images should be developed, prior to introducing intelligent algorithms to scan those images” (p. 13, emphasis by the authors). The accuracy of the pre-trained models when used for destination image measurement may also be questioned as most work did not evaluate their chosen model with prior labelled photography. (Nixon 2018) evaluated two classifiers with manually classified Instagram photographs of Vienna and both returned f1 measures (the harmonic mean of precision and recall) of 0.54 compared against the ground truth (expert annotation). (Kim et al. 2020) acknowledged that their use of a model pre-trained on generic data as a limitation, evaluating the InceptionV3 CNN model they had used with 38,691 manually labelled photos and finding an accuracy of just 28%. They concluded “Because 1000 categories in ImageNet are not intended for tourism analysis, …revised categories need to be created for tourism purposes. Secondly, it is necessary to create a new training data set and retrain…” (pp. 249, 251, emphasis by the authors). We take this approach and test if a deep learning model that has been trained specifically with tourism photography labelled with predetermined cognitive attributes of destination image would give us more appropriate and accurate results. The steps that are taken are:

  1. 1.

    Define the set of cognitive attributes of visual destination image for training.

  2. 2.

    Create a data set for training the deep learning model.

  3. 3.

    Train the model, using train/test splits to validate the result.

  4. 4.

    Repeat the training for different models and hyperparameters, until the best performing model is found.

  5. 5.

    Validate this model using a ground truth dataset (not previously seen labelled data).

Pullman and Robson (2007) argue that attributes for content analysis should be (i) exhaustive, (ii) exclusive and (iii) analytically relevant. We have compared the differing attribute sets from past work which have been established as authoritative in the domain (Table 1) in order to establish our set of cognitive attributes—which will become the classes (or labels) to be supported by our visual classifier. Whereas “exhaustive” originally meant covering all the possible mental images the consumer can have (of the destination), we aim that our list should cover sufficiently all of the attributes specified in the past research, which were determined by the authors to be sufficiently comprehensive for the task of measuring the cognitive destination image. However, our list can also not simply be an aggregation of everyone else’s list (note that this was the approach already taken in Beerli and Martín 2004). Since our classes are to be used for the task of visual classification, the visual classes must also be exclusive, meaning that there is sufficient distinction between the visual details of each class. This is critically important for training an accurate classifier. For example, having two classes such as “parks” and “gardens” would not be exclusive enough: even a human annotator would have difficulty to differentiate between the two when choosing a label for a photo with outdoor greenery. For us, this meant creating new classes—Beerli and Martín’s (2004) “water, mountains and desert” (a single attribute) for example was mapped to three classes exclusive from one another. Exclusivity also meant we excluded attributes like “tourist activities/tourist facilities” from our list as the broad meaning of this attribute meant it would be difficult to determine an exclusive visual appearance to map to this class. Finally, analytically relevant means that we selected classes that were relevant to the task for which the classification is done. We are classifying destination photography in order to capture the dominant features or attributes of the destination reflected in the visual content. So unlike available visual classifiers for generic tasks (such as Google Cloud Vision API) which will return labels for the objects visible in the photo, such as “chair” and “table”, we focus our classes on the domains of interest for a touristic visitor such as “accommodation” or “gastronomy” (note how a photo with a chair and table could be either, based on other visual features that the off-the-shelf classifier would not use such as whether it is a meal or a laptop which is visible on the table). We therefore also removed “climate/weather” since it could apply to every outdoor photo to the detriment of classifying the photo according to the actual focus of the photographer (a landscape or a sports event, for example). Our resulting list has 18 visual classes for which our visual classifier will be trained. The list is reproduced in Table 3. It is shown alongside the lists from Table 1 to show how our list covers the majority of the attributes in each of the previous papers.

Table 3 Author’s list of visual classes for destination image

The importance of the right training data is emphasised repeatedly in the literature on deep learning and neural networks. The training data must be balanced and avoid bias. In computer vision, this means there must be a set of sufficient, representative visual content for each label to be trained for. In our case, this means tourism photography for each attribute. To collect data, we used the Fatkun Batch Bild plugin in Google Chrome to download photos from Google Image Search, using the search term ‘tourism AND (attribute label)’. Collecting photos via Web search has been a standard means of creating datasets for computer vision, including ImageNet. In our case, the search query matches photos associated with the terms of tourism and the cognitive attribute’s label, so we consider them potentially relevant for training. For each attribute, we downloaded 400–600 photos which we manually filtered, discarding all photos which were not relevant as tourism photography (clip art, line drawings, photos with text overlays, low quality etc.). Our final training data set consists of 4949 photos (an average of 275 photos per attribute). We can train different deep learning models on this training dataset using train/test splits to validate the outcome of the training, i.e. for each training task, we split our dataset randomly into 80% data which is used in the training of the network and 20% data which is used to test the accuracy of the trained model (by training over multiple cycles, or epochs, our model learns from each test cycle to more accurately train itself on our attributes). The result is reported as val_acc (validation accuracy) on the test dataset in the last cycle of training. To establish a baseline, the first model followed a standard architecture used in computer vision tutorials: a three layer convolutional neural network (CNN) which achieved no more than 66% accuracy. Adding data augmentation techniques (slightly modifying photos as they are input so that the model learns variations such as rotations or flips) achieved maximum 68% accuracy.

Many neural network model architectures are available via code libraries (Keras, TensorFlow, PyTorch) which are state of the art for computer vision tasks. As benchmarked against the ImageNet dataset, it is currently EfficientNet- and Transformer-based architectures that are achieving over 90% top-1 accuracy (correctly predicting photo labels) as opposed to the original CNNs used for this task. Computer vision research has also confirmed significant improvements in accuracy through the technique of Transfer Learning (Cui et al. 2018). Here, the deep learning model does not start uninitialised but is initialised with the weights learnt from a previous training of the model with a very large data set (for a more general set of labels) and is then trained (fine-tuned) with our smaller data set (for a smaller, more specific set of labels). We found that this did improve our model’s reported accuracy. Figure 1 illustrates our approach with transfer learning in comparison to other work (Table 2) which use the computer vision model directly and post-process the results (the aggregation of broad, generic labels). Our model output (Destination Image vectors) is further explained in Sect. 5.

Fig. 1
figure 1

Illustration of our approach compared to other e-tourism research (Images used according to CC license https://creativecommons.org/licenses/by/4.0/. Sources: vectorportal.com, openclipart.org, commons.wikimedia.org)

Table 4 below summarises the validation accuracy measured from a number of different deep learning model architectures which were fine-tuned with our training dataset: a CNN (InceptionResNetV2), EfficientNets, Vision Transformers (ViT) and, the most state of the art, BEiT (Bao et al. 2021), which is a type of Transformer architecture with a different pre-training strategy. Each model had been pre-trained on ImageNet-1k. We can load the weights of a pre-trained model via libraries like Keras. To fine-tune the weights for our visual classes using the training dataset, the initial layers are frozen (their weights do not change) and only the fully-connected layers at the head of the neural network learn from the new data for the new set of visual classes. We report the results from training for 10 epochs: the training time is based on using one GPU at Google Colab, the parameters are a measure of the size and complexity of the network and the resulting validation accuracy scores are an average from 5 runs (initial accuracy based on the training/test data split, then the accuracy on the ground truth dataset which was fully unseen in training). As a baseline, results from a self-implemented three layer CNN (without Transfer Learning) is included in the first row. It can be seen how this model does not generalise well to unseen data (the ground truth), dropping from 68 to 41% accuracy. Our experiment confirms that Transfer Learning with the pre-trained weights of a very complex deep learning network can contribute a significant accuracy improvement, with training feasible on a single GPU and the ViT-L/16 model reaching the highest accuracy (98.6%) while BEiT-L scored slight better on the ground truth. The accuracy results are in line with the ImageNet benchmark list where Vision Transformer-based networks are also currently reporting the highest accuracy scores.

Table 4 A comparison of deep learning models trained on our cognitive destination image attributes

The high validation accuracy (up to 98.6%) can suggest overfitting, which refers to the danger in deep learning that a network is so well tuned for accuracy on its training data that it does not generalise well to other data (does not achieve the same accuracy levels as reported in the training). To avoid this, a separate data set with previously unseen photographs labelled with the same visual classes has been prepared and used to test the accuracy of the model independent of the training data. This dataset could be used as ground truth by any model trained for the same set of visual classes. To build the ground truth, we take photography from a different source as our training data so that there can be no overlap. Yahoo has made available a dataset of 100 million photos from Flickr (“YFCC100M”) together with their user-provided tags. Those tags can be considered the user-generated annotation of their own photos. A Web interface was developed to browse the photos by tag (http://projects.dfki.uni-kl.de/yfcc100m/). A check for the labels of our 18 visual classes showed that the dataset contained sufficient matches for each label. We downloaded the dataset metadata and then individual photos using the python library yfcc100m. To avoid too many irrelevant photos, we retrieved them based on a conjunctive query on the tags of “class label AND ‘travel’”. We manually reviewed the downloaded photos and reduced the set to 100 photos for each visual class which were deemed representative for each cognitive attribute. We chose 100 as the smallest total response to our query for any one label was slightly higher than that number, thus establishing the “highest minimum” value we could use if we wanted every label to have the same number of labelled photos. To deem a photo as “representative”, we created a set of criteria for each visual class which had to be met by a photo to be considered a member of that class. The criteria were based on common sense determination of what characteristics would cause a human viewer to associate the photo with the given visual class (and not another class, or none of them). For example, for ‘accommodation’, the photo must clearly show the interior of a hotel room (as perceived by the viewer of the photo). For ‘beach’, a beach had to be clearly visible in the photo and in greater focus of the entire photo frame than any other visual class such as water. Following our selection of the photos by these criteria, we shared the same criteria (the annotation guideline) to an external, independent expert (a research colleague who works in textual annotation, and therefore is familiar with following annotation guidelines) to assess if all photos in the ground truth dataset meet the criteria for their given visual class (label). The result was disagreement on 6 photos labelled as Water, 4 photos labelled as Trees, 2 photos each labelled as Beach and Mountains as well as 1 photo each labelled as Historical Building and Animals. This indicates 99.11% agreement. Using Cohen’s kappa as a statistical metric to represent agreement between two annotators beyond what would be expected by chance, Cohen’s kappa is 1.0 which suggests that the annotation is reliable and supports the quality of the labelling process used.

The final column in Table 4 shows the accuracy metrics of our models on the ground truth dataset. While the first models with lower validation accuracy also did not generalise well to the ground truth data (the accuracy on this data is significantly lower), the Vision Transformers in particular lost only marginally on accuracy when evaluated with this previously unseen dataset. Based on the average of 5 runs, the BEiT-L model achieved marginally higher accuracy (94.4%) than ViT-L/16 and its weights will be used subsequently for the validation of our approach against the prior state of the art.

4 Evaluation

We could use our ground truth dataset to benchmark the chosen BEiT-L model against implementations used in past research papers to determine if our fine tuned approach is indeed more accurate. We encounter the problem that the comparison models do not share the same set of labels, nor is there any single approach to mapping those labels to the final set of destination image cognitive attributes. However, we can reproduce the same methodology followed in the past work (firstly, a more general classification of the photos with a generic vocabulary of visual classes; secondly, apply clustering on this broad classification to derive a smaller, more focused set of destination attributes). As applied to our ground truth photos, we can compare the expected set of derived destination image attributes from the comparison model to our set of 18 visual classes (Table 3). Most pre-trained models for image classification (including those available to future researchers) were trained on ImageNet, and as mentioned before, ImageNet-based classification is the standard benchmark for model accuracy. Therefore, where a researcher would use a pre-trained model for image classification, the most likely case is that it will use the ImageNet set of labels (as seen in Table 2, ImageNet has indeed been one of the vocabularies used in lieu of fine tuning on the researchers’ own visual classes). While others have occurred in the literature—Places365 or SUN, for example—they share the characteristic of labelling images for broad, generic classification purposes and not specifically for the attributes of a visual destination image. We therefore consider the approach valid to annotate our 1800 ground truth photos (each labelled with one of our 18 visual classes) with an ImageNet pre-trained model and cluster, as the previous studies did, the resulting set of labels (which will appear quite arbitrary, since ImageNet based annotation is focused on a rather inconsistent set of 1000 concepts to be detected) and determine if the final clusters align with our 18 visual classes.

While different pre-trained CNNs were used to acquire the initial set of labels, we will compare our model with the more recent EfficientNetV2L pre-trained on ImageNet-1k. We choose this model since the model weights are available from the Keras Applications libraryFootnote 5 and it is currently the best performing model on the ImageNet benchmark (85.7% top-1 accuracy compared to CNNs used in earlier research such as 77.9% for InceptionV3 or 76.4% for ResNet101). For each of our 18 visual classes, we annotate the 100 ground truth photos with EfficientNetV2L and count the number of times each unique ImageNet label is returned as the top-1 result for a photo (to reduce outliers, the label must have a minimum confidence score of 0.1 and be the result for more than a single photo). To make a decision if the label is or is not relevant to the visual class in the ground truth in an unbiased manner, we note that ImageNet labels are aligned to WordNet synsets (a grouping of synonymous terms for the same concept in the WordNet lexical database). We can use the hierarchical tree structure of WordNet synsets (connecting a synset to its hypernyms and hyponyms) to calculate the path similarity (we use the WuPalmer similarity measure) between the synset of the ImageNet label returned by the comparison model (EfficientNetV2L) and the synsets closest to our 18 visual classes, selecting the visual class which is most similar (with a threshold of 0.2 to discount labels which are dissimilar to all of our classes). Since ImageNet annotation is focused on the objects in the photos, for our more abstract visual classes like accommodation, entertainment or sport we used synsets for typical objects in those photos such as ‘bed’, ‘stage’ and ‘stadium’. Table 5 shows the results, with the accuracy score for the comparison model calculated as the ratio of labels returned which were most similar to the ground truth visual class divided by the total number of labels returned (given the removal of outliers). For brevity, only the three most common ImageNet labels returned for each set of 100 ground truth photos (one visual class) are shown in the table below.

Table 5 Ground truth annotation results between our deep learning model and a ‘state of the art’ model without fine-tuning

A significant variation in accuracy of results can be seen across our visual classes. While a paired t-test could be used for significance testing (our null hypothesis would be that the accuracy of our model and that of the comparison model are not significantly different from one another), the resulting p-value (p < 0.0001) does not indicate the size or practical significance of the difference between the two result sets. We therefore provide a test effect index, which is a quantitative measure of the magnitude of the difference. Calculating Cohen’s d as a standard test effect index for the t-test, we acquire d = 1.81 which indicates a very large effect size and suggests a substantial difference between the two models. We identify an inherent bias in the distribution of label semantics in ImageNet as the root cause. While animals, types of shops and also monumental structures are part of the set of ImageNet-1k labels and the comparison model annotation proves to be as accurate if not more as our own fine-tuned model, labels are missing in ImageNet-1k for visual classes such as desert, landscape, museums or water. Photos which are visually obvious members of those classes are labelled with other objects that are visible or with labels that have a similar appearance (so desert photos are labelled as valley, landscape photos are labelled as alp (mountain), museum rooms are confused with altars or thrones, and photos of bodies of water are labelled according to the visible shoreline). In general, it can be observed that a model trained on a predetermined set of labels (ImageNet or otherwise) will preserve gaps in the distributional semantics of the labels (i.e. if certain objects or concepts are missing) as well as any bias learnt in the training. For example, in ImageNet’s training data, photos are given with a single label for one object visible in them, so many of our gastronomy photos were annotated as different objects or types of food or drink rather than the whole concept such as “restaurant”. While a clustering step might still identify the relationships here (e.g. between a plate and a glass), some labels returned for visual classes would lose the semantic relevance when clustering, e.g. a café photo was labelled ‘laptop’ as this was visible in the centre of the photo. Many photos which showed primarily trees or plants/flowers were similarly annotated with other visible objects such as the (plant) pot or (park) bench. Furthermore, the training data labels each photo with one label even if multiple objects are visible so a photo showing a certain object could be “correctly” annotated by the model with the label for another object (Yun et al. 2021), some labels are misleading due to an annotator’s decision (Beyer et al. 2020) and some labels have biased or skewed training data (e.g. ‘programmer’ tends to show young white males, ‘mobile phone’ is trained with older touchpad phones in photos) (Yang et al. 2020).

Having used models pre-trained on a large set of diverse labels, most past research has then used clustering as a technique to derive the smaller set of destination-specific characteristics. Clustering techniques such as k-means are unsupervised, meaning the resulting clusters have no labels per se, but may be distinguished by the top-occurring labels within them. We also test if the set of labels acquired from the ImageNet-trained model could be clustered effectively into destination image cognitive attributes. We have 363 unique ImageNet labels from the annotation of our ground truth dataset of 1800 photos. As the WordNet synset tree was restrictive for semantic similarity measurement (e.g. ‘bees’ and ‘flowers’ may be related by co-occurrences in text, but WordNet has them in two distinct ‘animal’ and ‘flora’ subtrees) we will now use GLoVe—high dimensional word embeddings learnt from Wikipedia and a large news corpus (Pennington et al. 2014). Clustering can now use the distance between their 100-dimensional vectors which allows similarity to be measured along different dimensions. Since GLoVe uses single words, we convert multi-word ImageNet labels to single words either by taking the last word (when the semantics does not change, e.g. ‘worm fence’ and ‘picket fence’ both map to ‘fence’) or by a manual mapping to a single word synonym or hypernym (‘traffic light’ → ‘stoplight’, ‘four-poster’ → ‘bed’). We use kmeans++ algorithm in scikit-learn with initialisation of 2 centroids and maximum iterations of 500. We plot the elbow curve for between 10 and 32 clusters (past literature has varied from 10 to 32 unique cognitive attributes for destination image) and find the knick (which also has a slightly higher silhouette score) with 13 clusters, so we use k = 13. Table 6 reports the clusters, with again for brevity only the first five labels shown for each cluster, and a determination by the authors of how the cluster would align with any of their destination image cognitive attributes.

Table 6 Results of k-means clustering of ImageNet labels from the comparison model

Nine of the 13 clusters could be associated through manual examination with eight of our visual classes (accommodation, animals, building (hist.), entertainment, gastronomy, monument, roads traffic, water), although ‘animals’ and ‘entertainment’ were relevant to two clusters each and one cluster had the terms relevant to both ‘monument’ and ‘building (hist.)’. Two clusters had inside them terms relevant to two other visual classes—‘shops markets’ and ‘sports’—but mixed up with terms associated with other classes. The remaining eight visual classes could not be differentiated at all by the clustering as a result of inaccurate labelling by the classifier to begin with (e.g. for ‘desert’ photos, ‘camel’ would be clustered with animals). We conclude that a clustering step can not determine a more accurate representation of a destination’s image when it persists the errors or imbalances in the original labelling of the photos.

This comparative evaluation suggests that previous approaches which have relied directly on pre-trained models using broad label sets are not adequate to capture correctly all relevant cognitive attributes of the destination image, providing skewed or incomplete insights. Even if clustering is done as a subsequent step, it can not be assumed that the model has been trained on an adequately comprehensive set of labels nor that appropriate training data was used to train for those labels. The annotation of the training data may also have been done to fulfil a different purpose, less relevant to visual destination image, such as e.g. correctly recognizing specific objects in a photo rather than what the presence of those objects represent (such as ‘plate’ rather than ‘gastronomy’). Our experiment suggests it is necessary to fine tune the deep learning model, using a specifically prepared training dataset for a comprehensive set of relevant visual classes (the cognitive attributes of destination image), as a prerequisite to accurately measuring the visual destination image. Only Sertkan et al. (2020a, b) also fine-tuned CNNs for a smaller set of visual classes, but focused on seven “personality” factors rather than a set of cognitive attributes, meaning their annotations can also remain ambiguous as to the main (visible) features of a destination (e.g. a “Sun and Chill-out” destination does not necessarily feature a beach, it is unclear whether a “Nature and Recreation” destination is appealing due to the presence of mountains, lakes or forest). Their purpose for annotation was different from ours, as they compared destinations with travellers (personalities) for recommendation while we focus on the measurement of the (functional) cognitive component of destination image in order to inform the desirable content in destination marketing.

5 Results

We have considered what can be the most accurate deep learning model for annotating photos according to our pre-selected list of cognitive attributes for destination image and have selected Transfer Learning with the BEiT-L neural network (Sect. 3). We have shown that our model, fine-tuned by a specifically prepared training data set for the cognitive attributes, can outperform pre-trained models (without fine-tuning) as used in past research on visual destination image measurement (Sect. 4). Now we consider how our deep learning model could be used to accurately measure visual destination image and enable the comparison of different destinations for the purposes of destination management and marketing.

The most recent studies have selected single destinations for visual destination image measurement (cf. Table 2). However, since multiple destinations may be measured according to the same set of visual classes, a comparison is also feasible, e.g. to identify which destinations share a similar visual destination image or how they differ (along which attributes). For an initial exploration of visual destination image measurement by our model, we selected a number of popular destinations and downloaded the most recent photos posted on Instagram (in August 2021) using the official tourist board hashtag (removing any photos from the DMO itself). Instagram is regarded as an important factor in trip planning among younger people (Varkaris and Neuhofer 2017) and a significant platform for destination branding by DMOs (Fatanti and Suyadnya 2015). To choose a set of destinations, we took some of the most popular in the “travel bucket list” announced by website Big 7 Travel,Footnote 6 which was based on surveying 1.5 million Instagram users in 2019. Post-pandemic, these destinations can expect heightened interest in their travel content on Instagram from potential visitors. Analysing the UGC photos from tourists at the destination can reveal the “perceived” destination image, which can be used by DMOs to inform their destination management and marketing, e.g. invest in or promote more those aspects that are most popular with the visitors, since they are also more present in the photos posted online and hence inform the expectations of the potential future visitors as well (Song and Kim 2016).

Table 7 shows the destinations, the hashtag used for data collection and number of photos collected. The visual destination image of one destination is considered the aggregate of the labels annotated to the photos in the dataset for that destination. While the sets of labels produced in past work could be distinct for each destination, precluding the possibility to compare directly, our model measures with the same set of visual classes for each destination. We have seen how we can measure the similarity between words through multidimensional word embeddings (vectors). The visual destination image can also be represented as a vector, where each feature is the measurement for one visual class, resulting in each destination being represented by an 18-dimensional vector. For comparability, we calculate the value of each feature as the total number of photos annotated with the corresponding visual class divided by the total number of photos annotated. This ensures each feature is scaled to a value between 0 and 1. For example, the resulting vector for the Maldives produced by our model is:

$$\begin{gathered} [{\mathsf{0.01663366}}\quad {\mathsf{0.00871287}}\quad {\mathsf{0.3829703}}\quad {\mathsf{0.00435644}}\quad {\mathsf{0.02336634}}\quad {\mathsf{0.02336634}} \hfill \\ {\mathsf{0.02930693}}\quad {\mathsf{0.02534653}}\quad {\mathsf{0.02336634}}\quad {\mathsf{0.01029703}}\quad {\mathsf{0.0150495}}\quad {\mathsf{0.00475248}} \hfill \\ {\mathsf{0.03960396}}\quad {\mathsf{0.01584158}}\quad {\mathsf{0.01346535}}\quad {\mathsf{0.01029703}}\quad {\mathsf{0.00990099}}\quad {\mathsf{0.34336634}}] \hfill \\ \end{gathered}$$
Table 7 Data collection from Instagram for nine destinations

The dominant features are the visual classes beach and water (in the 3rd and last positions, respectively, with values of 0.383 and 0.343). The model suggests that these are the dominant visual attributes of the Maldives in UGC Instagram photography.

The distance between vectors along different dimensions may be calculated using Euclidean or cosine distance, where cosine is preferable as it measures the angle between two vectors rather than the absolute distance. This emphasises more the similarity in the distribution of the visual classes in the photography over all cognitive attributes. We first make a comparison on the complete visual destination image made up of the 18 visual classes. We can calculate the cosine similarity between all destination pairs but the value alone is not informative enough, e.g. Dubai’s vector is equidistant from both Paris and the Maldives. Instead, we perform dimensionality reduction and transform the 18-dimensional vectors to two dimensions, retaining some of the meaningful properties of the original data. The distributional semantic hypothesis assumes that data items that share similar values in the same dimensions of vector space tend to have a similar meaning, whereas the reduction in dimensions makes possible the exploration and visualisation of the destinations in that vector space. We use the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm to approximate the distances between multidimensional objects inside the lower dimensional space. Using the sklearn Python library, we reduce the destination vectors to two dimensions and can plot them on a scatter plot (Fig. 2).

Fig. 2
figure 2

Scatter plot of the nine visual destination images (Instagram UGC)

While dimensionality reduction techniques can be non-deterministic (i.e. produce different results depending on the used parameters—we settled on a perplexity score of 3 for the above Fig. 2), this approach can uncover latent features in the data, e.g. discovering previously unknown shared characteristics of destinations’ visual representations. Along the reduced two dimensions of our figure, the visual similarity of Maldives and Bora Bora is clearly demonstrated in a high tsne1 figure (x-axis), suggesting a beach/water feature weighting (Dubrovnik is plotted also more towards the same corner of the plot). The tsne2 figure (y-axis) seems to weigh strongly urban features, with New York and Dubai at the top. A latent semantic similarity in visual representation is suggested by the proximity of Marrakesh and Bali which on investigation is predicated on a comparatively strong presence of plants flowers and shops markets in the Instagram UGC of both destinations. It is found that Marrakesh visitors are taking photographs of the plants outside private houses in the city streets, and Bali visitors share visuals of the shops/market stalls selling local artisanal products such as jewellery.

While dimensionality reduction can aid the visualisation of how different destinations relate to each other in terms of their entire visual destination image, the loss of detail in the data means it can be difficult to determine the precise reason for the alignments between the destinations. For example, what is Dubai’s visual destination image in Fig. 2? Its high y-axis value aligned with New York suggests a modern urban image but what is the significance of its alignment with Bali and Dubrovnik on the x-axis? Destination images may also be compared along a smaller number of dimensions, so that there is no data loss in the dimensionality reduction. A subset of the visual classes could be used to measure destination image according to a specific typology, e.g. we select the three classes of “historical building”, “modern building” and “roads and traffic” to reflect in combination the ‘urbanity’ of a destination (a focus on photos showing buildings and urban infrastructure as opposed to other aspects). Figure 3 shows a 3D plot of the vector feature values for the three classes for all the destinations.

Fig. 3
figure 3

The “urban” aspect of the visual destination images

New Orleans, New York and Dubai form a cluster in one part of the plot. Their visual image is urban and modern and there is less reflection of their historical buildings. Dubrovnik has the strongest urban-historical image. Paris stands out as having a strong image in both historical and modern buildings, the only destination in our experiment to not tend towards only one or the other aspect. Maldives and Bora Bora are naturally the least urban destinations although Marrakesh also appears in their cluster, indicating that its offer of historical buildings is not an important focus for visitors. Bali stands alone with a comparatively high value in its visual destination image for roads and traffic. While a beach and sea destination, visitors do also go to urban areas (to shop, for example). We found others photograph the many motorcycles on Bali’s roads. This may explain Bali’s image being closer to Dubai in Fig. 2 than to other classical beach and sea destinations like Maldives and Bora Bora.

6 Implications for destination marketing

Through the measurement of visual destination image and its representation as a multidimensional vector, it becomes possible to both learn about how individual destinations are being presented visually in terms of which of their attributes are reflected more or less in the tourist photography as well as compare and cluster multiple destinations according to similarities and differences in the attributes that are reflected in the respective photo datasets.

This has important implications for destination marketers. Social media photography is a significant source of insights into travellers’ perceptions of a destination (Lo et al. 2011), contributes to the consumers’ destination image formation (Kim et al. 2017) and influences attitudes towards the destination (Lund et al. 2018). While marketing success could be assessed by comparing how close the consumer’s destination image is with the intended destination branding, we believe differences could be of more significance than similarities. An earlier study by the authors looked at how Instagram photography could change an observer’s image of a destination (Nixon et al. 2017) and found that while photos which backed up existing impressions had little effect (e.g. Jordan and desert), those which highlighted new attributes positively affected destination image (e.g. Jordan and food). By knowing which aspects of a destination capture more of the attention of visitors outside of its established (dominant) attributes, the destination may invest more in providing or promoting that aspect in order to attract more visitors (new or returning). This could apply to greenery in Marrakesh or shopping in Bali, based on the results reported at the end of Sect. 5. Destinations may discover undervalued aspects that are worth highlighting more, e.g. our Dubai visual destination image was comparatively strong in the animals class (which was thanks to desert safaris and the aquarium) despite overall being a highly urban and modern destination, suggesting a visitor interest that could be promoted more strongly in the destinations’ marketing activities.

Comparisons between the visual destination image of different destinations can aid destination managers in understanding where the destination fits in the mental considerations of the typical consumer, especially in comparison with other destinations. Branding research has established that brand choice is based on perceived branding that matches consumers’ actual, ideal and social images (Ataman and Ülengin 2003). Marketing research has proven that the closeness of a mental image to the person’s expectations from a trip can lead to a destination choice (Leisen 2001). Consumers today are well aware of the wide range of destinations available to them. Their destination choice at any moment will depend on how well any destination among those within their consideration (a subset of all available destinations constrained by many other factors such as distance, price, etc.) fits with the mental expectation of its ability to fulfil the travel purpose. Based on our analysis above, a traveller wishing for a pure “sea and beach” trip may be more willing to consider the Maldives than Bali based on how the destinations are being presented visually to them. Their colleague, who prefers to mix beach time with more ‘urban’ activities, could therefore prefer Bali, and furthermore a traveller who prioritizes an urban experience where water can be a feature that is present but not prioritized may find their answer in Dubai. We acknowledge that the measured similarities and differences between visual destination images presented by our analysis still lack empirical testing with travellers, particularly to test if the mental images dominate which are measured by our visual classifier. Destinations which share a similar visual destination image can consider those destinations closest to them as their most significant competitors, such as the Maldives and Bora Bora in Fig. 2. Latent similarities can be uncovered which suggest new directions for destination marketing, e.g. it may be Marrakesh and Bali share a similar composite image in travellers’ minds, i.e. consumers who are interested in one of those destinations are a valid target audience for marketing from the other destination.

Some initial additional studies have been completed with our fine-tuned visual classification model to compare DMO and UGC Instagram photography for several popular destinations, uncovering further similarities and differences which can guide and inform destination marketing efforts (Nixon 2024). We will continue our efforts in the exploration of the visual destination image and what may be learnt from it for destination management and marketing. For example, the visual destination image of a destination could also be compared across platforms or time periods to understand how the destination is experienced differently by different demographic groups (e.g. Facebook vs Instagram) or at different times, (e.g. summer vs winter). Further studies are needed to test if the hypotheses derived from the visual destination image measurement (e.g. Marrakesh and Bali have similar composite images that make them appeal to the same type of traveller) can be proven. We would need empirical proof through human evaluation (surveys, focus groups or similar) or data analysis (e.g. overlap in Website visitors, social media likes or similar), particularly to test if the focus on certain features of destinations (e.g. as reflected in Instagram photography) can be associated with consumers’ intention to visit (or recommend or revisit) the destination.

7 Conclusion

In this paper, we present our deep learning model trained specifically on training data prepared by the authors for an appropriate set of cognitive attributes that comprehensively represent visual destination image. Given the importance for destination management and marketing to accurately measure how a destination is being represented visually to potential and actual visitors, DMOs and destination managers need to have sufficient confidence in the destination image derived from automated analyses. An evaluation with a state of the model trained on ImageNet demonstrates that the prior work which focused on measuring the cognitive component of destination image from photography—clustering labels directly acquired from pre-trained models without fine tuning—would repeat biases and gaps in the original training of the selected model. We demonstrate that our fine-tuned model provides more accurate and complete measurement of visual destination image than these approaches.

Our approach removes potential errors in destination image measurement by introducing a new training dataset for a set of visual classes which are aligned with the sets of cognitive attributes identified in prior e-tourism research on destination image. We also validate the accuracy of our approach with a new ground truth dataset for the same set of cognitive attributes. We make a comparative evaluation to past approaches by aligning the labels from the annotation by a state of the art model (without fine-tuning) of our ground truth photos with our 18 visual classes which represent destination image’s cognitive attributes. We found that the state of the art model without fine tuning persists the biases and imbalances of its training (which was done for a generic visual classification task) when re-used for visual destination image measurement. Finally, we show how our deep learning model can be used to measure the visual destination image of different destinations (using UGC photos on Instagram), and—by representing destinations as multi-dimensional vectors—apply vector distance measures to determine similarities and differences between destinations’ visual representations and discover latent features which can be used to support destination management and marketing activities. For example, DMOs can discover features of their destination which are given importance by visitors through their Instagram photography and emphasize those features more in the content for their Instagram marketing.

We hope that this work will encourage further exploration of the optimal approach to measure the characteristics of destinations from photos (and extended to video) and provide a means to compare within and between destinations. With the increased use of photography and video in communicating about destinations online, deep learning approaches to automatically analyse large scales of media content are appealing but issues of accuracy and validation remain. The correct set of cognitive attributes to express the mental construct we call ‘destination image’ can still be debated. In training, clear distinctions need to be made between the classes yet human cognitive understanding could include overlaps and even confusions between the same classes. User surveys would be needed to determine if computer-facilitated labelling of photos is consistent with human understanding of the visual content. Even if the objective label of the photo is accurate (e.g. the visible object), human interpretation of the photo (for the subsequent formation of destination image) may be less based on tangible concepts and more subjective than a ‘computer vision’ model can anticipate. For example, we have kept our cognitive attributes as globally relevant as possible—there is no consideration of cultural differences in interpretation of a photo nor any region-specific differences in our model training (e.g. a classical hotel room photo should be annotated by our model as ‘accommodation’ but alternative forms such as a hammock in a jungle tree house would not).

Future work would include benchmarking other models (trained with different data using other neural networks and hyperparameters) against ours using the ground truth data (as long as those models also support the same visual classes). New classes for destination image’s cognitive attributes could be added through the provision of adequate training data, in so far as they do not overlap visually with existing classes. We believe especially destination marketing and management will benefit from the accuracy of our fine-tuned deep learning model. We will continue to use it to measure and compare the image of different destinations in digital photography, as we have already begun (Nixon 2024), in order to uncover further what can be learnt now that accurate visual destination image measurement is available. Other future work would be empirical tests with users to determine the relationship between the measured visual destination image and aspects important to destination management and marketing such as visitors intention to visit and recommend the destination as well as post-visit tourist satisfaction.

To support the re-use of our Transfer Learning approach as well as benchmarking against other approaches, our deep learning model (weights) as well as the ground truth dataset are available publicly online on HuggingFace (https://bit.ly/destinationclassifier) and Google Drive (https://bit.ly/visualdestination) respectively.