Understanding tourists’ urban images with geotagged photos using convolutional neural networks

This study aims to track down representative images and elements of sightseeing attractions by analyzing the photos uploaded on Flickr by Seoul tourists with the image mining technique. For this purpose, we crawled the photos uploaded on Flickr and classified users into residents and tourists; drew 11 region of attractions (RoA) in Seoul by analyzing the spatial density of the photos; classified the photos into 1000 categories and then 14 categories by grouping 1000 categories by utilizing Inception V3 model; analyzed the characteristics of the photo image by RoA. Key findings of this study are that tourists are interested in old palaces, historical monuments, stores, food, etc. and those key elements are distinguished from the major sightseeing attractions in Seoul. More specifically, tourists are more interested in palaces and cultural assets in Jongno and Namsan, food and restaurants in Shinchon, Hongdae, Itaewon, Yeouido, Garosu-gil, and Apgujeong, war monuments or specific artifacts in War Memorial and the National Museum of Korea, facilities, temples, and pictures of cultural properties in Samsung Station, and toyshops in Jamsil. This study is meaningful in three folds: first, it tries to analyze urban image through the photos posted on SNS by tourists. Second, it uses deep learning technique to analyze the photos. Third, it classifies and analyzes the whole photos posted by Seoul tourists while most of other researches focus on only specific objects. However, this study has a limitation because the Inception v3 model which has been used in this research is a pre-trained model created by training the ImageNet data. In future research, it is necessary to classify photo categories according to the purpose of tourism and retrain the model by creating new training data set focusing on elements of Korea.


Introduction
Today people prefer to share the posts such as texts, images, and videos via Social Network Services (SNS) with others without regard to time and location. Moreover, the geo-tagged photos uploaded on the site by tourists display the perception and the action of tourists as well as the images that tourists feel about the sightseeing attractions [1]. As the images of touristic sites are closely associated with the tourists' attraction and intention, they serve as a reference for other tourists who seek to travel to those sites [2]. In addition, as the touristic images on SNS can be continually produced and reproduced, we are able to ascertain the perceptions and the trends of representative sightseeing elements and locations by analyzing the images uploaded on SNS. Furthermore, this process contributes to the basic research on tourism for discovering, developing, and improving sightseeing attractions [3].
We think that it is possible to conduct in-depth analysis with the extracted information in tandem with pre-existing methodologies of spatial data analysis because geo-tagged photos contain locational information. Especially we can make better use of Flickr data because they contain the information on location and time and are automatically affiliated with photo metadata. Previous studies which have utilized geotagged data on SNS have mostly explored the location that users occupied [4][5][6], the patterns of movement [7,8] and the texts of uploaded photos [9][10][11][12][13][14][15][16]. However, as the image analysis using deep learning technology becomes available, the studies using the photos posted on SNS keep increasing recently. Examples of researches on analyzing the photos posted on the SNS include classification of food [11], analysis of bird observations between experts and ordinary people [17], estimation of weather preference by visiting specific places [18]. Most of the studies are focused on analyzing the photos which contain specific objects. There have been no studies to analyze the image of tourists in the area by classifying the whole photos posted by the tourists who visit the specific area.
The purpose of this study is to analyze representative images and elements of sightseeing attractions by analyzing the photos uploaded on Flickr by Seoul tourists. For this purpose, first, we crawled the photos uploaded on Flickr, which is one of Social Network Service (SNS) platforms that people can share geotagged photos, and classified users into residents and tourists. Second, we drew 11 region of attractions (RoA) in Seoul by analyzing the spatial density of the photos uploaded by tourists. Third, we classified the photos into 1000 categories and then 14 categories by grouping 1000 categories by utilizing Inception V3 model, which is one of the convolutional neural networks (CNN) with deep learning capability. Finally, we analyzed the characteristics of photo image by RoA.

Research on image data mining via convolutional neural networks
Image data mining is the process of extracting information or knowledge from image data [19]. Recently, with the increase in the volume of image data as well as the improvement of training algorithm, techniques of image data mining using artificial neural networks have been applied to various fields such as medicine, environmental studies, information science, and computer graphics [20]. Convolutional neural network (CNN) which is one of artificial neural networks has been developed based on neurological knowledge surrounding the visual cortex of humans and animals [21]. As CNN has been shown to be effective in distinguishing and categorizing the photo images, it has become a trend to make use of it in most image data mining research. CNN is basically composed of three layers such as a convolutional layer, a pooling layer, and a fully connected layer. One can not only produce a variety of models by changing the CNN configurations, but also train the CNN through the scan of the image characteristics.
Researches on classification of images by category using CNN method have been actively conducted in the field of medicine. Krishnan et al. [22] categorized liver diseases surfaced on the images of ultrasonic inspection. Sawant et al. [23] detected brain cancer through MRI, and Motlagh et al. [24] distinguished breast cancer from the images of histopathological samples. Further, CNN method has been applied in other fields of image mining. Park and Shim [25] established a model of discerning the genre from the images of movie posters, taking inspiration from the thought that elements such as title font and chroma of images of movie posters can differ according to the genre of the movies. Lee and Lee [26] created the model which could recognize the characters in the animation of 'The Simpson', and Xu et al. [27] conducted a research on distinguishing geo-tagged land images by the conditions of land coverage.
The studies that have executed the image data mining using the images on SNS are as follows: Kaneko and Yanai [12] researched to track down event photos such as festivals, sports game, earthquake and fires by analyzing geotagged photos on Tweeter. Deng et al. [16] compared the images of Shanghai seen by tourists from the East and the West through the tags of photos uploaded on Flickr by tourists. These studies did not analyze the image itself, but categorized the images through tags included in the photos. Meanwhile, Okuyama and Yanai [14] selected representative images of designated locations after extracting the locations from the photos uploaded on Flickr. This study has applied Speeded-Up Robust Features (SURF) technique out of various image data mining techniques. The SURF technique is good at recognizing a specific object because it extracts and recognizes the features of an image. However, it is being evaluated that it is difficult to recognize a photographic background and it is low in view of the accuracy of classification [9,28].
Recently, the CNN method is mainly used for analyzing the photos posted on SNS. Jang and Cho [9] proposed a method of extracting tags automatically from the images posted on Instagram. Hong and Shin [10] proposed a method of recommending followers (information providers) by extracting the categories with the huge number of images uploaded after categorizing the images posted by Instagram users. Kagaya and Aizawa [11] distinguished the images that actually contained food from those that did not among the populated photos when searching ''#food'' on Instagram. On the other hand, Koylu et al. [17] analyzed the spatial distribution of birds' photos posted on Flickr by comparing the birds' photos uploaded by ordinary people with the photos uploaded by birds' experts and found the difference in viewing birds between ordinary people and birds' experts. Chu et al. [18] estimated the weather when a user visited each landmark through the appearance of a sky or a cloud in the landmark photos posted on Flickr. In recent years, the studies on the use of CNN to analyze the photos posted on SNS keep increasing continuously. However, there are only a few researches on analyzing the images of a region by classifying all photos posted on SNS while there are many researches which focus on specific objects.

Data collection and extraction of RoA
We have crawled the photos from Flickr using the open API in the spatial range of 37.4°-37.8°latitude, 126.8°-127.2°longitude, which represents spatial range of Seoul and in the temporal range from January 1, 2015 to December 31, 2017. We have collected a total of 86,304 photos uploaded by 1974 users. Then we have distinguished residents from tourists among 1974 users. We divided 1974 users into 868 users who have filled in owner location on Flicker metadata and 1106 users who either have not filled in owner location or difficult to distinguish accurately. Furthermore we divided 868 users into 689 users who were classified as tourists and 179 users as residents of Seoul. In addition, we divided 1106 users into 319 users who are residents of Seoul and 787 users who are tourists. If the time difference between the first photograph and the last photograph taken in the Seoul area during the study period exceeds 30 days, users are categorized as residents; otherwise users are categorized as tourists. As a result a total of 1476 users were discerned as tourists after sorting out 689 users who have filled in and 787 users who have not filled in their location, respectively. Finally, we analyzed the image of Seoul based on a total of 39,157 photos on Flickr uploaded by a total of 1476 tourists [29].
Based on data collection, we extracted RoA from the 39,157 Flickr data uploaded by tourists through the use of Density Based Spatial Clustering Application with Noise (DBSCAN) algorithm. DBSCAN is a density-based clustering algorithm which can be used to identify clusters of any shape in a data set containing noise and outliers [30]. It groups together the points that are closely packed together, marking as outliers the points that lie alone in low-density regions. DBSCAN requires two parameters which are e (eps) and the minimum number of points to form a dense region. In order to adopt the optimal combination, various pairs of parameters were applied to the experiment. The minimum number of points was set between 200 and 2000 and the minimum search radius was set between 300 and 1000 m. As a result of the experiment, 11 RoA were derived by adopting the combination of the minimum number of points of 350 and the minimum search radius of 250 m, which appropriately include major tourist attractions in Seoul [29]. We named the RoA by referring to ''The survey of the current state of foreign tourists'' conducted by the Korea Tourism Organization in 2017 [31]. We found that 11 RoA derived from this study are coincide with the major tourist attractions in Seoul reported by ''The survey of the current state of foreign tourists''. Table 1 and Fig. 1 show the information on each RoA and Fig. 2 illustrates analytical method and procedure of this study.

Method of image data mining
We conducted image data mining with 38,691 photos among 39,156 photos posted by 1476 users, excluding 465 photos deleted by users. For our analysis we applied Python version 3.6 and Tensor flow, open source machine learning library. We applied Inception v3 model of Google Net, which is one of various CNN models, in the photo data mining. Inception v3 is a model ''trained'' with ImageNet's image data set, which comprises of 14,197,122 images divided into 1000 categories [32]. The images in ImageNet are divided into 27 primary categories and 1000 secondary categories. In case of classifying the images with the Inception v3 model, the model returns the category name that resembles most with the input image among 1000 categories and its accuracy value. In addition to Google-Net, there are also LeNet-5, AlexNet, and ResNet, which are various variations of the basic structure of neural networks. The Inception module, a subnetwork included with GoogleNet, has a deep structure and makes GoogleNet use parameters more effectively than other models [20]. Among the various models of GoogleNet using the Inception module, Inception v3 model provides low-error rates and its source codes are widely available.
As Inception v3 model uses Tensor Flow to operate, it is necessary to pre-process the photos into appropriate formats before analyzing photo data. As data crawled from Flickr's API are in the format of image URL, we downloaded them into BMP format and then converted them into size of 299 * 299 RGB, which can be used in the Inception v3 model. It is not easy to derive the meaning by comparing 1000 categories when each image is categorized into one of 1000 categories. Moreover, the 27 primary categories in ImageNet were also not easily applicable to the category for tourism. Given these constraints, we generated 14 new categories that were suitable for the field of tourism  Table 2. 4 Results of analysis 4

.1 Image of Seoul
We were able to produce 858 categories out of 1000 categories of ImageNet by classifying the 38,691 photos uploaded by Seoul tourists. The categories with a proportion of 1% or above among 858 categories are shown in Fig. 3. When looking at the category into details, there are and neon signage decorating the outside of buildings. The ''prison'' and ''monastery'' include the images of crowded residential areas, museums, and the like. ''Lakeside,'' on the other hand, includes images of natural views with not only lakes or rivers but also trees or sky, while ''pier'' does the images of rivers, streams, ponds, college campus, and so on. To sum it all up, we can deduce the tourists' perception of Seoul in which owns palaces, food, buildings, and facilities. Table 3 shows the results of assigning 858 categories to 14 primary categories for analysis by subjects. Figure 4 shows the results of the top five primary categories by extracting and examining their secondary categories. We can see that tourists who visit Seoul are generally interested in palaces, historical monuments, cultural properties, objects, food, facilities, natural views, and flora and fauna. More specifically, when looking into the category in details, ''palace/historical monuments/cultural properties'',  ''palace'' and ''bell cote'' contain the images of palaces, tile-roofed houses, and Korean-style houses, respectively. ''Patio and terrace'' contain the images of courtyards and ''tile roof'' contains the images of rafters. From this we can deduce that a good number of tourists seem to consider the palaces and the traditional houses as representative images of Seoul. ''Umbrella'' which belongs to a subcategory of ''objects/miscellaneous'' includes the images of not only actual umbrellas but also silhouettes that resemble the shape of an umbrella. Similarly, ''tray'' contain not only some images of food on tray, but also the images of objects that resemble a tray. ''Book jacket'' has the images of historical monuments and exhibits. As mentioned before, this is probably due to the lack of adequate categories to properly categorize the images taken by tourists. ''Plate'' which belongs to a subcategory of ''food'' has numerous images of food such as traditional Korean cuisine and sashimi. ''Restaurant'' has the images of restaurants and coffee shops. ''Food market'' has the images of big supermarkets and traditional street markets. ''Hot pot'' has the images of food such as rice cake in hot sauce, soups, and teppanyaki. ''Pier'' which belongs to a subcategory of ''facilities'' contains the images of Cheonggye Stream and the ECC building of Ewha Women's University. ''Planetarium'' does the images of landmarks such as Dongdaemun Design Plaza while a subcategory of ''natural views/ flora and fauna'' contains the images mostly of sky, the Han River, and mountains.

Comparison of image by RoA
We categorized the photos into 11 RoA 1 in Seoul to compare their different characteristics. Table 4  More specifically, there were photos of temples in Samsung Station, Bongeunsa Temple, and Coex Mall. The photos of Jamsil, Gangnam Station and Garosu-gil/Apgujeong included ponds and amusement parks, urban scape, food, respectively. Meanwhile, the photos of Yeouido appeared to include not only food and restaurants but also Han River. Figure 5 shows the results of assigning 858 categories to 14 primary categories for every RoA. We can see that the tourists who visit Jongro, Namsan, War Memorial of Korea, and National Museum of Korea usually are interested in ''palace/historical monuments/cultural properties,'' ''facilities,'' and ''objects/miscellaneous.'' As the images for National Museum of Korea categorized as ''objects/ miscellaneous'' are mostly of historical monuments or cultural properties, we can see that the tourists who visit this area have the images of palaces, historical monuments, and cultural properties in common. Meanwhile, the tourists who visit Shinchon, Hongdae, Itaewon, Gangnam Station, Garosu-gil, and Apgujeong have the images of ''food'', those who visit Samsung Station, Bongeunsa Temple, Coex Mall, Jamsil, and Yeouido have the images of ''facilities'', and those who visit Garosu-gil, Jamsil, Gangnam Station, Itaewon, Shinchon, Hongdae, and Apgujeong have the images of ''shopping''. While the images of Gangnam Station are related to ''urban scape,'' the images of Jongro, Namsan, Samsung Station, Bongeunsa Temple, Coex Mall, and Yeouido are related to ''natural views/flora and fauna.'' Figure 6 shows a map with the 14 primary categories and representative photos of 11 RoA.

Accuracy assessment
The Inception v3 model is a pre-trained model trained by 14,197,122  Out of the total 38,691 photos, 10,807 photos were matched and 27,884 photos were mismatched, that is, the overall accuracy ratio was about 27.93%. Table 5 shows the accuracy ratio between Inception v3 model and manual labeling by category in which the number of photos is more than 1%. The highest matching categories are 'plate', 'tile roof', 'restaurant', 'hot pot', while the lowest matching categories are 'monastery', 'prison', 'bell cote', and 'movie theater'. In case of 'plate', 'tile roof', 'restaurant', and 'hot pot', the accuracy ratio is high because there are little difference by country. On the other hand, in case of 'monastery', 'prison', 'bell cote', and 'movie theater', the accuracy ratio is low because building types are different by country and culture. Figure 7 shows an example of the category 'palace' with a high matching ratio. While the photos of Gyeongbok Palace, Changdeok Palace, and Deoksugung are classified as 'palace' correctly, the photos of Yongsan War Memorial, university building, Hanok, residential area in front of Cheonggyecheon, and pavillion are misclassified as 'palace'. In the case of the palace, Western style building in Korea was misclassified as 'palace' because the model is based on the ImageNet's data. Figure 8 shows an example of a photo classified as 'pier'. The photos of Han River Park, Hangang Bridge, and Cheonggyecheon are classified as 'pier' correctly while Ewha Womans University ECC building, Jongno Tower cloud building, Myeongdong shopping mall building and Dongdaemun design plaza building are misclassified as 'pier'. If the sky or the base of building is widely photographed, it is misclassified as a 'pier' because it can be recognized as a river.
Through the accuracy evaluation process, we could find some implications. First, it is necessary to classify the photos posted by tourists for the purpose of tourism. Because 1000 categories in ImageNet are not intended for

Conclusion
In this study we aim to analyze the tourists' images of Seoul by analyzing the photos uploaded on Flickr by Seoul tourists. We were able to find out that tourists have a strong image of palaces, historical monuments, traditional food, and restaurants, etc. These characteristics are distinguished from one RoA to another. The images that tourists feel about Jongro and Namsan are palace and cultural properties while the images of Shinchon, Hongdae, Itaewon, Yeouido, Garosugil, and Apgujeong are food and restaurants. The images of War Memorial of Korea and National Museum of Korea are the monuments that could be photographed on site and the artifacts that were displayed in the museum. Moreover, the images of Samsung Station are a combination of facilities, temples, and cultural properties while the images of Jamsil are toyshops and amusement park. This study is meaningful in three folds: first, it analyzes urban image with the photos posted on the SNS by tourists. Second, it uses deep learning technique to analyze the photos. Third, it classifies and analyzes the whole photos posted by Seoul tourists while most researches focus on only specific objects.
On the other hand, we could find out new research topics that further studies are needed. We recognized that the Inception v3 model that we applied in this study had a limitation because it is a pre-trained model using Ima-geNet's data set which does not include Korea's characteristics. It was not possible to accurately categorize certain iconic landmarks of Korea such as Namsan Tower, Dongdaemun Design Plaza, and Hanbok that are not widely known in the world. The photos related to palaces and Hanok villages were scattered in the categories such as 'Palace', 'bell cote' and 'terrace'. In future research it is necessary to create traing data set and retrain Inception v3 model based on the photos posted by tourists who visit Seoul and make a category suitable for the purpose of tourism.