Keywords

1 Introduction

The shift in digital consumer behaviour, particularly among Generation Z and millennials, towards visual content on the Web for destination discovery and inspiration raises a new challenge for tourism stakeholders. Insights into how destinations are being perceived by online consumers, which are necessary for optimal decision making by managers about how to manage the brand of the destination online, will only be meaningful if the whole story is being considered, i.e. that the online digital photography of the destination is included in the marketing data analytics alongside textual content and statistical analytics. Advances in computational media under-standing (computer vision) - particularly through the application of neural networks and deep learning - have enabled significant progress in computer systems that can accurately classify visual content (photos) into predetermined categories. Such visual classification capabilities are now available publicly as pre-trained models, meaning that functionality that has long been only accessible to very few based on highly complex and expensive computer systems is now a possibility for any business who identifies a business need for it. However, these off-the-shelf solutions demonstrate certain limitations: the available models are trained on highly generic visual data (not specifically touristic imagery) for a very broad set of fine-grained labels (not specifically destination related). As a result, users still need to train these systems to work for specific use cases, both determining the appropriate set of (visual) classes to be supported and ensuring sufficient training data for those classes so that the system can learn to distinguish between them in previously unseen content. In this paper, after motivating the need for a new approach to extracting destination brand from imagery (Sect. 2), we present our visual classifier which was trained and fine-tuned specifically on tourist photography of destinations to classify into classes which are aligned to those accepted in past e-tourism research as capturing the various attributes of a destination’s brand (Sect. 3). This offers the opportunity for a shared, consistent approach to destination brand measurement from visual media. This is demonstrated in an explanatory study extracting the brands of nine different destinations from user generated content on Instagram and comparing them (Sect. 4). The results of this study offer an opportunity for discussing how destination visual branding could be used by destination marketers to adapt and improve their destination marketing (Sect. 5). We close with a conclusion of the contributions of this work to research in applying computer vision approaches in the e-tourism do-main and ideas for future studies making use of visual destination branding (Sect. 6).

2 Background

Destination management includes the management of the mental image of the consumer when they think of the destination, promoting desirable attributes that achieve an intention to visit and differentiating the destination from others, which then becomes the destination brand. The success of this will increase the destination’s attractiveness to potential tourists [1] and boost the economic profitability of a destination [2]. Destination marketers have always wanted to understand the image that their target audience has of their destination (perceived image) and how to influence that image to be closer to the intended brand through their own marketing activities (projected image). With the rise of the Web, social networks, mobile and digital photography, a destination’s image is being less and less determined by the destination’s own marketing and more and more by traveller UGC (user generated content) posted and consumed globally on platforms such as Facebook and Instagram. While marketers suffer a loss of control over the destination’s own branding, UGC also provides the opportunity to discover what the audience itself associates with the destination [3] and adapt marketing strategies to emphasize more strongly the positively seen aspects and mitigate negative aspects [4]. Multiple studies have confirmed the increasingly important role of social media and UGC on travel inspiration and planning [5,6,7,8].

While destination brand is an aggregate of multiple sources of information and messaging, both online and offline, social media and UGC is playing a significant role. Instagram is considered to be the most influential social network for travelFootnote 1, particularly among millennials and Generation Z where “Instagrammability” has become a criteria for travel. It is said that 70% of photos posted on Instagram are travel relatedFootnote 2. 48% of Instagram users rely on the images and videos that they see on the social media platform to inform their travel decision making and 35% of users use the platform to discover new placesFootnote 3. Visual content is found to be a significant contributor to the formation of a destination image [9], enhanced by the public global sharing of photographs on social networks [10]. Both [11] and [12] have indicated that Instagram content can positively influence a person's perception of a destination. Therefore, destination marketing needs to include creating and publishing images and videos that will positively catch the attention of a potential consumer [13]. The emergence of digital photography and its global distribution on the Web led to studies to compare the contents of online destination photos, e.g. those on DMO Websites with UGC on Flickr [14, 15]. Tourism management research has only recently started to explore the use of image sharing social networks in destination marketing [16]. However, a limitation to studies has been the need to manually annotate each image and ensure inter-annotator agreement.

The inaccessibility of automated analytics for image data can be compared to text, where text mining has well established techniques for decades coming from natural language processing. This has changed in the last decade due to new advances in computer vision made possible through neural networks with multiple processing layers (deep learning) [17]. Year on year, new architectures are leading to even better accuracy scores on computer vision tasks, usually benchmarked against the ImageNet-1k dataset (1.2 million photos labelled with 1000 visual classes)Footnote 4. At the time of writing, 9 of the 10 top models use the Transformers architecture. Many top performing computer vision models are available publicly and can be used in “transfer learning” [18]: models are pre-trained with large generic image datasets such as ImageNet-1k and are then fine-tuned with new data for more specific tasks.

To overcome the limitations of manual annotation, e-tourism research has recently begun to use such computer vision models to automatically annotate larger sets of photos. [19] compared tourist perceptions of Beijing, classifying 35,356 tourist photos into one of 103 scenes, which were reclassified manually into 11 categories using the ResNet-101 model. [20] analysed 58,392 Flickr photos geolocated to Hong Kong, comparing perceived destination image between residents and tourists, following the same approach as [19]. [21] used the DenseNet161 model to identify 365 scenes in 531,629 photos of Jiangxi, then LDA to determine the five major tourism topics. [22] used the InceptionV3 model to assess the destination image of Seoul. They used 39,157 Flickr photos which were classified into 858 classes and then grouped into 14 categories. [23] measure the destination image of Austria presented by Instagram photography using the Google Cloud Vision APIFootnote 5. On average 8–12 labels were returned per image (with a confidence threshold of >  = 0.5) and resulted in 5290 unique labels. Machine learning was then used with the textual labels to produce 15 clusters which were labelled by their most significant associated labels, e.g. Vienna photography was most related to the clusters ‘random travel photography’, followed by ‘urban feelings’, ‘historical perspectives’ and ‘cathedral views’. In this literature, no computer vision model has been specifically trained for the tourism domain (e.g. with destination photography). Rather, a generic model trained on a broad range of visual categories (e.g. the 1000 labels of ImageNet) has been employed to annotate a set of tourism photographs and a subsequent clustering step used to reduce the larger number of unique labels to a smaller set of characteristics.

3 Methodology

One consequence of the previous approaches to classification of destination photography has been a lack of any consistent vocabulary for the visual characteristics of the destination. This means that different approaches cannot be compared, and even the same approach would produce a different set of cluster labels for different destinations. E-tourism research has long considered what would be an appropriate set of characteristics to measure the destination image or “the sum of beliefs, ideas and impressions that a person has of a destination.” [24]. The visual or cognitive component of the destination image – the mental picture of the location [25] – relates to the physical aspects of a destination that a person is actively exposed to when searching for travel related information [26]. Destination image research has mostly focused on capturing and measuring the cognitive aspect through a finite set of disambiguated attributes considered as common to people’s mental constructs of a destination [27]. However, there has been no agreement on these different cognitive attributes nor on the research methods to determine them [28]. The authors compared the attributes used in four key research works in the literature (based on number of citations provided by Google Scholar for the term “destination image”) [15, 29,30,31] and aligned them into a list consistent with the task of visual classification. Table 1 compares the attributes in these works and presents the authors own list (rightmost column). The result (18 classes) covers the brand attributes used in past research as determined by surveys or expert elicitation; climate/weather and tourist activities/facilities were excluded by the authors as they were identified as too visually generic.

With the set of 18 visual classes, we have prepared a new, specific dataset of destination photography for training our visual classifier. The literature in deep learning has repeatedly highlighted the importance of appropriate training data regardless of the complexity or power of the neural network to be trained (imbalances in training data have led to widely reported cases of “AI bias”). After all, a neural network learns visual classification by building up an idea of the common features of the visual category over its multiple layers, from simpler features (lines etc.) to more complex (shapes etc.) as the network gets deeper. Off-the-shelf models are usually trained on ImageNet, which has been criticized for the quality of its image annotations [32]. We fine tune an existing model with our training dataset for the 18 classes that represent a destination brand. Our training data is collected via Google Images search and results in a dataset of 4949 photos (ca. 275 photos per class, further details in [33]).

Table 1. Lists of cognitive attributes of destination image used in the research.

As is usual in the deep learning domain, we use a train-test split of the dataset (80%/20%) to improve the training of the model over multiple cycles (epochs) through evaluation against the test dataset. To also validate the resulting model and provide a basis for benchmarking different implementations, we also created a new ground truth dataset from the YFCC100M dataset (100 million tagged Flickr photos), using matching tags to collect 100 previously unseen photos per class (1800 photos total). Our initial classifier was fine-tuned from the InceptionNetV2 model trained on ImageNet and scored 0.91 accuracy on the test dataset and 0.75 accuracy on the ground truth dataset. We have then compared this implementation with the use of other models, focusing on the best performing Transformer architectures. ViT-L/16 has the best test accuracy score to date (0.986) but almost perfect test accuracy can also be a consequence of overfitting (fine tuning the model so well on the training data that it performs much worse on new, unknown data) so we focus on the model with the best accuracy score on the ground truth data, which is BEiT-L (0.944) [34].

4 Results

Tourism destination marketers aim to align the perceived destination image of the consumer to their projected destination image (the image they want consumers to have of their destination according to their marketing strategy, i.e. their intended destination brand). [15] state in their paper that “DMOs need to know what images dominate the internet and whether these images are consistent with the information projected by the destination itself, so that they can reinforce positive images or counter unfavorable images, if necessary”. Instagram is the most significant visual medium for sharing destination imagery today. Therefore, as an exploratory study for the use of our classifier, we choose to compare the extracted destination brand of nine different destinations using UGC on Instagram as the data source. The destinations are chosen from the top of the Forbes list of the top 50 destinationsFootnote 6 so that we can trust that there is significant Instagram photography available and that these destinations would have a clearer brand in the mind of travellers. Using the Python library instaloader with the hashtag promoted by the tourist board, we downloaded the most recent photos posted on Instagram with the hashtag(s) (cf. Table 2). After classification, the data was deleted. We did not perform any further preparations on the dataset: we keep the potential randomness that can occur with “real world” data.

Table 2. Chosen destinations, hashtags and number of photos downloaded.

The classifier labels each photo with a single visual class – a confidence score is calculated for all 18 classes and the class with the highest confidence is the chosen label as long as it crosses a threshold (after some heuristical testing, we chose 0.5). To provide a representation of the destination brand (based on the labels of all the photos), we can sum the number of photos labelled with each class. Since the number of photos differs for each destination, that sum is divided by the total number of photos from the destination to produce a part-to-whole ratio. This means values can now be compared across destinations. Mathematically, the list of 18 numbers which represent the ratios for each of the 18 visual classes can be represented as a vector. The resulting vector for Bali’s destination brand, for example, is: [0.07124011 0.04309587 0.10202287 0.01319261 0.02022867 0.03605981 0.06068602 0.06508355 0.09146878 0.01055409 0.01319261 0.01055409 0.09234828 0.05013193 0.08531223 0.0351803 0.02726473 0.17238347]. The highest ratio is the last number (in 18th position) which is the value for the visual class of water, followed by the third number which is the visual class of beach. However, how does Bali’s destination brand compare to others? Vectors can be compared and manipulated mathematically. Cosine similarity is a common algorithmic choice which measures the angle between two vectors. The closer the angle between the two vectors, the more similar they are regarding the distribution of their features (the values). Table 3 shows the cosine similarity between four selected destination pairs (New Orleans, New York, Maldives, Bora Bora). The two most similar destinations are Maldives and Bora Bora (0.99) and the most dissimilar destinations New Orleans and Bora Bora (0.2).

Table 3. Cosine similarity between selected destination pairs.

To have a visualisation of all nine destinations according to their overall destination brand, we need to do dimensionality reduction. We use the t-Distributed Stochastic Neighbor Embedding (t-SNE) which approximates the distances between multidimensional objects inside a lower dimensional space. Using the sklearn Python library, we reduce the 18-dimensional vectors representing the destination brands to two dimensions and plot them on a scatter plot (Fig. 1, created by matplotlib, mpl_toolkits and pylab Python libraries. t-SNE algorithm with perplexity = 5). While the reduction in dimensionality hides the complexity of in which individual features (visual classes) the different destinations are more or less similar, it can be seen how Maldives and Bora Bora are closest to one another in one corner of the 2D space, and New York and New Orleans form a pair in the opposite corner. These two pairs are those which are furthest away from one another, i.e. most dissimilar, and this fits with our cosine similarity measurements (Table 3). Such visualisations show latent features in the data, i.e. previously unknown similarities in the destination brands. For example, Marrakesh and Bali are close to one another and a vector subtraction to find the difference shows that both accommodation and entertainment have a similar presence in their brands. To explore the destinations along specific brand attributes, we can choose a subset of visual classes. We use sets of three classes since three-dimensional data is easier to be visualised and understood than additional dimensions:

  • the three dimensions of beach, trees and water represent a “seaside” brand.

  • the three dimensions of entertainment, gastronomy and shops & markets represent a “tourism services” brand.

  • the three dimensions of historical buildings, modern buildings and roads & traffic represent an “urban” brand.

Fig. 1.
figure 1

2D visualisation of the destination brand vectors.

Figure 2 (created by matplotlib, mpl_toolkits and pylab Python libraries) shows the seaside brand of the destinations. We see clearly Bora Bora and Maldives have the strongest brand (on top of one another on the right-hand side). Bali and Dubrovnik have an equally strong brand association with water but weaker association with beaches. The remaining destinations show weak seaside branding, with only New York and Dubai being slightly more associated with water than the others.

Fig. 2.
figure 2

3D plot of the destination brand vectors for the “seaside” branding

Figure 3 shows the tourism service brand of the destinations. Bora Bora and Maldives are least focused on these services, i.e. the brand is more about relaxation. Dubrovnik, Paris and Dubai UGC show significant focus on gastronomy. Bali and Marrakesh are well balanced along all three dimensions, with Marrakesh having the stronger association with shops & markets. New York has the strongest brand in the entertainment and gastronomy dimensions, yet it is New Orleans with the strongest tourism services brand overall, particularly from the shops & markets perspective.

Fig. 3.
figure 3

3D plot of the destination brand vectors for “tourism service” branding

Figure 4 shows the urban brand of the destinations. Maldives, Bora Bora and Marrakesh have the least urban branding. New Orleans, New York and Dubai are the most urban destinations with strong associations with both modern buildings and roads/traffic. Paris shows the strongest association with historical buildings. Interestingly, Dubrovnik and Bali have brands suggesting low urbanity but with one strong dimension: historical buildings for Dubrovnik and roads & traffic for Bali. The latter suggests that while Bali is also a “seaside” destination, it has a more urban image than Maldives and Bora Bora.

Fig. 4.
figure 4

3D plot of the destination brand vectors for “urban” branding

5 Discussion

Our explanatory study has shown how the visual classifier can be used to extract a destination brand from image data, in this case Instagram UGC photography. The results indicate that this is an accurate approach to gain insights into how destinations are being presented visually to consumers, in effect the target audience of that destination. UGC is particularly important to destination marketers as photography is used by visitors to capture the most salient attributes of the destination from their experience [35]. A set of photos from a destination can be seen as a materialisation of the tourist’s perceived destination image [36], and thus are a valid data source to construct the existent destination brand [37]. For the marketer, this destination brand can be used to determine their marketing success (how much of their branding is reflected in UGC imagery). Comparisons of the UGC branding with the DMO’s intended brand can help identify desirable aspects not in their brand strategy which are being focused on in the visitor photography as well as undesirable aspects that may need mitigation in an adapted marketing strategy. For example, it seems Bali UGC shows more urban infrastructure (roads & traffic) than maybe desirable, whereas beaches are underrepresented. Of course, it may be that Bali does not wish to be seen solely as a beaches-and-water destination like the Maldives. On the other hand, New Orleans – a destination with a strong entertainment brand – has a significant UGC representation of shopping and gastronomy attributes as well, which may indicate to the DMO new aspects to focus its marketing activities on. Finally, destinations can identify other destinations which present a similar brand to the same audience, suggesting targeting opportunities for their social media marketing (i.e., people who visited X may also want to visit…), e.g. Marrakesh and Bali present a similar brand in certain aspects.

6 Conclusion

This paper has presented our visual classifier, trained and fine-tuned specifically on destination photography to classify photos into one of 18 brand attributes. Since the same set of visual classes are used in each classification task, this approach can be used to compare different destinations brands or even track changes in a destination’s brand over time or across platforms. Our current version uses the BEiT-L Transformer architecture, scoring 94% accuracy on previously unseen Flickr photography. It is available on the HuggingFace platformFootnote 7, where it is possible to classify individual photos via the Web interface or to classify larger photo sets via the API. Our ground truth data is available on Google DriveFootnote 8 for others who would like to benchmark their visual classifiers. We believe destination marketers and other tourism stakeholders will benefit from the classifier extracting the destination brand from visual UGC. Just as text mining has long helped marketers to understand what visitors associate with their destination through the analysis of blogs and reviews, visual classification makes it possible for the marketers to gain equivalent insights into the more important aspects of their destination for visitors through the analysis of their photography. While we have focused on 18 visual classes which cover differing sets of attributes established in past research on destination image, the visual classifier can be trained on different or additional classes as needed by the user, e.g. we did not include attributes like traditions or arts and crafts in our classifier as their visual representations would vary greatly across the world. While we focused on attributes with a globally consistent representation, a classifier for a specific purpose (e.g. a specific destination) could be additionally fine-tuned to unique aspects that we could not include, e.g. a hammock in a jungle might be accommodation. Having explored the measurement of a destinations brand ‘in and of itself’ as well as comparing different brands, we are interested in extending this work by correlating measured destination brands with other metrics of marketing success, e.g. social media engagement, visitor numbers or over-night stays. We hope that future research can use the measurement of the destination brand as a starting point to ask fundamental questions such as “is my brand distinct from my competitors?” or “what is the optimal branding to maximise my marketing success and gain visitors?”.