Keywords

1 Introduction

Due to the advent of geo-referenced digital photography, photos are now associated with the metadata describing the geographic location in which they were taken. The rapid growth of content sharing websites has resulted in large volume of location data stored on social media websites. Geo-location specific multimedia content is accompanied by user generated content (UGC) such as title, tags, comments and description. These annotations are in the form of free keywords, also known as social tags. Using these annotations, users can express their personal opinion to describe items. So, social annotations can be utilized to understand users interest and context of the image.

Various photo sharing websites allow users to be owner, tagger or commentator or to mark favorite for their community contributed images. They also allow users to interact and collaborate (share, chat) with each other in a social media dialog. The typical structure of photo sharing social media involves three types of interrelated entities: user, image and tag. The largest repositories of user location histories are in fact photo sharing web sites like Flickr. Viewing and interacting with such collections has a broad social and practical importance. However, these collections are inherently difficult to organize, browse and search due to their size and the inability of computers to understand the content of the images. This is leading to interesting tasks like geographically organizing photos and location visualization [3, 5, 810]. These geo referenced photos can be organized geographically. But, we need to filter, sort and summarize the collection of photos.

The work proposed here addresses an interesting task of organizing these geo-referenced media on Flickr to generate visual and thematic summarization of specified geo-location. Content and context information of geo-location images and modified PLSA algorithm is used for location visualization and theme generation. Content information for images is obtained using visual analysis. Context information for geo-location images is generated using users potential annotations. To improve the underlying associations between the images and tags, user affinity, image affinity and tag affinity are modeled in the form of graph to represent intra-relations among them. Multiple intra-relations and interrelations among User, Image and Tag are incorporated into 3-order tensor. We utilize tensor factorization framework [4, 7, 11] for tag refinement. We introduce the problem of Location visualization using content, context and geo-reference information of social images. The unstructured and unrestricted community contributed images and annotations are utilized to generate the knowledge. Annotation refinement or recommendation of tags for un-tagged geo-referenced images is achieved. We propose a modified PLSA model to discover the location theme by combining textual and visual content of images. To summarize, major contributions include:

  • An outline of the new approach for generating visual and textual summarization of geo-location using content, context and geo-reference information of social images (Sect. 3).

  • An implementation of algorithm for refinement of image tag association by tag recommendation using Multiple Inter and Intra relations of U, I, T and HOSVD (Sect. 4).

  • An Implementation of modified or extended PLSA for generation of image clusters. Additional Conditions to be satisfied by Extended PLSA include visual cluster connectivity and users coverage (Sect. 5).

  • Tag cleaning, Tag scoring and Tag selection steps are implemented to select representative tags and demote personal and general tags (Sect. 5).

  • Evaluation of geo-location summarization algorithm. We compare this method with baseline methods (Sect. 6).

2 Related Work

The rich User Generated Content (UGC) on social sharing web sites has opened up great possibilities for novel multimedia research and applications. Geo-referenced multimedia data mining is the most attractive and important filed of research. Recent research and applications on online geo-referenced media [6] can be grouped into different classes such as: (a) Organizing media resources geographically, (b) Social knowledge extraction from geo-referenced media, (c) Learning landmarks in the world, (d) Estimating Geographic location of photo.

Visual summary generated by any algorithm should be having representative and diverse set images. The most intuitive approach to organizing social media resources is to perform image clustering and then select a representative image of each cluster. State-of-the-art clustering approaches can be divided into several categories including visual content, text associated with images, automatically generated metadata such as geo-tags, users activity statistics or a combination of these resources.

Rudinac et al. [10] developed an approach for automatic visual summarization of a geographic area that exploits user contributed images and related explicit and implicit metadata. It is based on the random walk with restarts over a graph that models relations between images, visual features extracted from them, associated text, as well as the information on the uploader and commentators.

Kennedy et al. [5] proposed a multimodal approach for selecting diverse and representative image search results for landmarks. They also rely on both the visual information in the images and the user-contributed tags for these images. The two step method extracts the representative tags first, which are further utilized to automatically find place and event semantics.

Fang et al. [9] investigated the problem of location visualization from multiple themes. They identify the highly photographed place (POI) and discover their distributed themes.

Hao et al. [8] proposed an approach for location overview generation approach, which first mines location representative tags from travelogue and then uses these tags to retrieve images from web.

Despite the importance and benefits of social tagging, it suffers from reduced recommendation accuracy problem. Reasons behind are: Free Nature: Tags assigned are free keywords. They are subject to multiple interpretations and results in polysemy and synonymy problems. Cold Start or Tag Sparsity: Many a times user rarely participates in tagging process. This results in untagged photos. Visual-Textual Relevance: How to interpret the relevance of user contributed tag with respect to the visual content? Ternary Relation: Social tagging data forms a ternary relation between users, resources, and tags.

In order to capture the ternary relation among users, tags and items in social tagging systems, previous methods like [7] focused on generating recommendations based on tensor factorization (TF) techniques. Such methods are able to (1) solve problems like polysemy and synonymity (2) preserve the ternary relation (3) reveal the latent associations among users, tags, and items (4) provide more accurate recommendation. These methods do not solve the problem of sparsity, and the cold start problem. To deal with these problems multiple inter and intra relations among U, I, T are utilized by the method proposed in [4].

Fig. 1.
figure 1

System architecture

3 System Architecture

Figure 1 outlines the framework of the system. The input into our first module is a location, which is specified by its geo-tag. We utilize set of images taken at given geo-location and the associated metadata as input to our system. The interrelations, multiple intra relations among U, I, and T are incorporated into tensor factorization framework for tag refinement and potential annotations for images are generated in second module. In the third module of the proposed work, context and content information of image is incorporated into probabilistic generative model to generate visual summary of the location. Unlike to the standard probabilistic model, the extended PLSA proposed here combines, image-tag association, location context information represented by annotations and image content information for images to generate themes. Finally, for each visual theme, generated in the third module, representative tags are selected.

4 Image Tag Refinement Using HOSVD

4.1 HOSVD Algorithm Steps

Let UIT denote the sets of users, images, tags. The set of observed tagging data is denoted as, \(O \subset {U \times I \times T}\). The ternary interrelations can then constitute a three dimensional tensor, \( Y\epsilon {R}^{\left| U \right| \times \left| I \right| \times \left| T \right| }\), which is defined as:

$$\begin{aligned} {Y}_{u,i,t}=\left\{ \begin{array}{*{20}ll} {1} &{} {if(u,i,t)}\\ {0} &{} {otherwise} \end{array}\right. \end{aligned}$$
(1)

To jointly model the three factors of user, image, and tag, we employ the general tensor factorization model, Tucker decomposition. In Tucker Decomposition [4, 7, 11], the tagging data Y are estimated by three low rank matrices and one core tensor.

$$\begin{aligned} \hat{Y}=C \times _{u} U \times _{i} I \times _{t} T \end{aligned}$$
(2)

where, \(\times _n\) is the tensor product of multiplying a matrix on mode n.

The core tensor C governs the interactions among user, item and tag entities.

$$\begin{aligned} C = Y\times _{1}{{{U}_{c1}}^{(1)}}^{T}\times _{2} {{{U}_{c2}}^{(2)}}^{T}\times _{3} {{{U}_{c3}}^{(3)}}^{T} \end{aligned}$$
(3)

The Tensor decomposition problem is reduced to minimizing a point-wise loss on \(\hat{Y}\), defined as :

$$\begin{aligned} min_{U,I,T,C}\sum _{(u,i,t)\epsilon \left| U \right| \times \left| I \right| \times \left| T \right| } (\hat{Y}_{u,i,t}-Y_{u,i,t})^{2} \end{aligned}$$
(4)

where, Y is the original observed tagging data and \(\hat{Y}\) is the result of tensor factorization.

4.2 Multiple Intra Relations of User, Image and Tag

To handle the problem of sparsity and to increase the quality of recommendations, multiple Intra relations [4] between U, I, and T are incorporated into the Tensor Factorization. The multiple intra relations between U, I, and T are modeled and represented in the form of graphs. Graph based clustering algorithm is further utilized. Tag refinement problem addressed here is divided into two steps: incorporation of intra relations of U, I, and T into tensor and, tensor Factorization using Tucker Decomposition.

User Affinity. Each image is owned, shared, annotated or commented by user. For example if the image owned by user \(U_ 1\) is commented by the user \(U_2\), then the user \(U_1\) and user \(U_2\) are associated. Therefore, we measure the affinity relationship between two users using the number of images shared, annotated, commented and liked or marked favorite by them.

$$\begin{aligned} {{Sim}^{U}}_{i,j}=\frac{n({u}_{i},{u}_{j})}{n({u}_{i})+n({u}_{j})} \end{aligned}$$
(5)

Image Affinity. For each image we extract three types of features to capture 81-dimensional global color and 120-dimensional texture content and 512-dimensional GIST features of image [1]. Visual similarity between images is defined by using RBF kernel.

$$\begin{aligned} {{Sim}^{I}}_{i,j}=\exp (-\frac{{\parallel {x}_{i}-{x}_{j}\parallel }^{2}}{2{\sigma }^{2} }) \end{aligned}$$
(6)

Tag Affinity. Tag affinity graph is constructed based on the tag context and semantic relevance [4]. The context relevance of two tags is simply encoded by their weighted co-occurrence in the image collection. Tag semantic relevance information is obtained using WordNet. Contextual Similarity Between two tags \(t_i\) and \(t_j\) is computed as:

$$\begin{aligned} {{CSim}^{T}}_{i,j}=\frac{n({t}_{i},{t}_{j})}{n({t}_{i})+n({t}_{j})} \end{aligned}$$
(7)

Semantic similarity between two tags \(t_i\) and \(t_j\) represented as is computed using WordNet. If two tags share the semantic relations such as synonym, hypernym, hyponym, meronym, and holonym, the words are considered as related. We assume \({\lambda }_{c}\) and \({\lambda }_{s}\) are the weights of context relevance and semantic relevance \({Sim}^{T}_{m,n}\) represents similarity between two tags \(t_m\) and \(t_n\).

$$\begin{aligned} Sim_{i,j}^{T}=\lambda _{c}CSim_{i,j}^{T}+\lambda _{s}SSim_{i,j}^{T} \end{aligned}$$
(8)

Regeneration of Tensor by Incorporating Multiple Intra Relations. The multiple intra relations of user, image and tag are incorporated into Tensor Factorization framework [2]. The rank of tensor will be \({R}^{5\times 4\times 4}\), if there are 5 Users, 4 images and 4 tags. Based on Tag affinity definition, if clustering result is: \({T_1,T_2,T_3 }\) and \({T_4}\). Then, we regenerate initial triplets in the form of tag clusters: \( CT_1-\{ {T_1,T_2,T_3 }\}, CT_2 - \{{T_4 }\}\). Based on regenerated triplets, the initial tensor is reconstructed as, \(Y \epsilon {R}^{{\left| U \right| \times \left| CT \right| \times \left| I \right| }}\). Similarly, based on image affinity and user affinity, the initial tensor is reconstructed as, \(Y \epsilon {R}^{{\left| U \right| \times \left| T \right| \times \left| CI \right| }}\) and \(Y \epsilon {R}^{{\left| CU \right| \times \left| T \right| \times \left| I \right| }}\).

5 Geo-Location Summarization

5.1 Visual Cluster Generation

For document clustering and theme generation, we propose the use of probabilistic generative model which extends standard PLSA [12]. Assume, for a given geo-location, a set of N images \(D=\{D_{1}, D_{2}, ....\,D_{N}\}\) is retrieved. Each image d is represented as a vector of word occurrences, \(W=\{w_{1}, w_{2}, ....\,w_{N}\}\), which are collected from associated tags. By considering image as virtual document and tags as terms, we obtain Document-Term matrix. We apply PLSA to model the generation of location images and tag occurrences. The Document-Term Matrix is constructed as follows: The rank of matrix is \( DT\epsilon R^{\left| I \right| \times \left| T \right| } \) where,

$${Dt}_{i,j}=\left\{ \begin{array}{ll} 1 &{} if \, tag\, j\, belongs \, to \, image \, i\\ 0 &{} otherwise \end{array}\right. $$

PLSA associates an unobserved class variable \( z\epsilon Z=\{z_{1}, z_{2}, ....\,z_{k}\}\) with each occurrence of \( w\epsilon W=\{w_{1}, w_{2}, ....\,w_{M}\}\) in a document \( D=\{d_{1}, d_{2}, ....\,d_{N}\}\).

Standard Expectation-Maximization Approach:

Expectation (E) Step: We compute the posterior probabilities of latent variables from the previous estimate of the model parameters (randomly initialized).

$$\begin{aligned} P(Z_k|d_i,w_j ) = \frac{P(z_k )P(d_i |z_k )P(w_j |z_k )}{\sum _{t=1}^{K}P(z_1 )P(d_i |z_t )P(w_j |z_t)} \end{aligned}$$
(9)

Maximization (M) Step: Here model parameters are updated for given posterior probabilities (computed in previous E-step).

$$\begin{aligned} P(z_k |d_i )=\frac{\sum _{i=1}^{N}n(d_i,w_j )P(z_k |d_i,w_j)}{\sum _{i=1}^{N}\sum _{j=1}^{M}n(d_i,w_j )P(z_k|d_i w_j )} \end{aligned}$$
(10)
$$\begin{aligned} P(w_j |z_k )=\frac{\sum _{i=1}^{N}n(d_i,w_j )P(z_k |d_i,w_j)}{\sum _{i=1}^{N}\sum _{j=1}^{M}n(d_i,w_j )P(z_k|d_i, w_j )} \end{aligned}$$
(11)
$$\begin{aligned} P(z_k )=\frac{\sum _{i=1}^{N}\sum _{j=1}^{M}n(d_i,w_j )P(z_k |d_i,w_j)}{\sum _{i=1}^{N}\sum _{j=1}^{M}n(d_i,w_j)} \end{aligned}$$
(12)

By following the likelihood principle, \(P(z_k |d_i)\) and \(P(w_j |z_k)\) are determined by maximization of the log-likelihood function. \(n( d_i,w_j )\) denotes the number of times the word \(w_j\) occurs in document \(d_i\).

$$\begin{aligned} L=\sum _{i=1}^{N}\sum _{j=1}^{N}n( d_i,w_j )\log \sum _{k=1}^{K}P(w_j|z_k )P(z_k |d_i) \end{aligned}$$
(13)

5.2 Extended PLSA for Representative Cluster Generation

Standard PLSA results in k number of image clusters. The above mentioned likelihood function considers only the textual associations among the images and tags and does not include the content level information for images. Fact is images with similar content should share common topics. Therefore, we can use the image similarity as a constraint over images to learn the latent topics of interest more accurately. Similarly, the image cluster generated by the standard PLSA may have all the images from same user. Ideally, the image cluster for specified geo-location should be covering a broad interest. The extended PLSA proposed here aims at maximization of following two conditions in addition to the log-likelihood principle.

(1) Coverage of Users Interest: Aim is maximization of number of users \(|U_k|\) that are represented in photos from cluster \(z_k\) (2) Intra Cluster Connectivity: If cluster’s photos are linked to many other photos in the same cluster, then the cluster is more likely to be representative. The links between photos represent that the photos are visually similar and share same context (set of tags). Aim is maximization of average number of links per photo in the cluster. Visual similarity between photos is decided by following the method described for image affinity computation. Image context information will be decided as follows: Assume \(T_u\) and \(T_v\) are the set of tags assigned to the two images \(D_u\) and \(D_v\), respectively. Using Jaccard similarity measure:

$$\begin{aligned} ConxtSim=\frac{n({T}_{u} \cap {T}_{v})}{n({T}_{u}) \cup n({T}_{v})}\end{aligned}$$
(14)

where, \( n(T_u)\) and \(n(T_v)\) represent the number of tags assigned to image u and v respectively and \(n(T_u \cap T_v)\) represents the number of tags common for image u and v.

Ranking of Cluster. The score of the cluster is decided for ranking of results. The score is computed as [9]:

$$\begin{aligned} score(z_k )=\sum _{i=1}^{\left| U \right| }{\log (N_{images}(u_i)+1)} \end{aligned}$$
(15)

where, \( N_{images} (u_i)\) is the number of images in the cluster \( z_k\) which are: (a) owned, tagged, commented, shared, annotated or marked favorite by the ith user. (b) number of links per image are more than average number of links per image in the cluster Once the generated clusters are ranked according to cluster score, from each cluster we select representative images.

5.3 Location Representative Tag Generation

Tag Cleaning. Once the clusters have been determined, a lot of tags are collected from all the images in the cluster. Many of these tags are noisy and need cleaning: (1) Some general frequent tags and very rarely occurring tags are removed by following Luhn’s idea. (2) Tags are identified as irrelevant if they are: stop words, meaningless words, time and number related words, camera related words, abbreviations or acronyms, words with hyphens or dashes, or misspelling. (3) Tags which are not utilized by more than one user are removed. (4) Suffix removal is done using Porters Algorithm.

Tag Score. The system computes scores for each clusters tags to extract representative tags. In other words, we consider each cluster \(z_k\), and the set of tags \(T_k\) that appear with photos from the cluster. We assign a score to each tag \(t\epsilon T_k\) according to the three factors: (a) Term Frequency (TF) (b) Inverse Document Frequency (IDF) and (c) User Frequency (UF). One of the factors we use is TF-IDF (term frequency, Inverse Document Frequency). This metric assigns a higher score to tags that have a larger frequency within a cluster compared to the rest of the area under consideration. The assumption is that the more unique a tag is for a specific cluster, the more representative the tag is for that cluster. To avoid tags that appear only a few times in the cluster, the term frequency element prefers popular tags. The term frequency \(tf(z_k,t)\) for a given tag t within a cluster \(z_k\) is the count of the number of times t was used within the cluster.

$$\begin{aligned} tf(z_k,t)=n(z_k,t) \end{aligned}$$
(16)

The inverse document frequency for a tag t, computes the overall ratio of the tag t amongst all photos D in the geo-location region under consideration:

$$\begin{aligned} idf(t)= \frac{\left| D \right| }{ n(D,t) } \end{aligned}$$
(17)

While the tag weight is a valuable measure of the popularity of the tag, it can often be affected by a single user who accesses the image a large number of times. To guard against this scenario, we include a user element in our scoring, that also reflects the heuristic that a tag is more valuable if a number of different users use it. In particular, we factor in the percentage of users in the cluster \(z_k\) that use the tag t.

$$\begin{aligned} uf(t,z_{k})= \frac{ n(U,t) }{\left| U \right| } \end{aligned}$$
(18)

The final score for tag t in cluster \(z_k\) is computed by

$$\begin{aligned} score(z_k,t)=tf(z_k,t)\cdot idf(t)\cdot uf(t) \end{aligned}$$
(19)

The higher the TF-IDF score and the UF score, the more distinctive the tag is within a cluster. For each cluster, we retain only the tags that score above a certain threshold. The threshold is needed to ensure that the selected tags are meaningful and valuable for the aggregate representation. We use an absolute threshold for all computed clusters to ensure that tags that are picked are representative of the cluster.

Tag Selection. The goal of tag selection algorithm is that (a) important textual concepts that are related to specific location are selected and (b) unimportant or highly personal tags are demoted. For user specified geo-location, given a set of images, \( D=\{ \begin{array} {lcrcl} d_{1}, d_{2}, ....\,d_{N} \end{array} \} \) is a set of images for topic k. We aim to extract representative tags for this topic from the complete set of tags \(W_k\) associated with the topic k. We follow the following two conditions for extraction of representative tags: Condition 1: If a tag t is representative tag for theme/topic k, then the probability of observing the tag t among images \(D_k\) of theme k is larger than the probability of observing it among all images in D. Condition 2: A tag t is a visually representative tag if its annotated images are visually similar to each other. Condition 3: A tag t is representative as per user’s interest if same tag is utilized by maximum number of users. Conditions 2 and 3 are already covered according ranking of clusters. Occurrence probability of tag \(t_i\) in the set of images \(D_k\) of theme k is computed as:

$$\begin{aligned} p(t_i|D_k )=\frac{n(t_i\cap D_k)}{n(D_k)} \end{aligned}$$
(20)

In the same way, occurrence probability of tag \(t_i\) in the complete set of images D is computed as:

$$\begin{aligned} p(t_i|D )=\frac{n(t_i\cap D)}{n(D)} \end{aligned}$$
(21)

Rank of tag \(t_i\) for topic k is decided by the condition:

$$\begin{aligned} p(t_i|D_k )- p(t_i|D)> 0 \end{aligned}$$
(22)

6 Experimental Work

The goal of our system is to generate a set of location representative images and a set of location descriptive tags for user specified geo-location. Our implementation utilizes three features: location (geo-tags), user, and tags. The system clusters the set of images using extended PLSA. These clusters are formed on the basis of visual-textual relation, visual connectivity, and users interest. Representative tags are selected by first ranking them on the basis of TF, IDF and UF. Selection of representative tags is done using three conditions: separation, cohesion and users interest.

The goals of evaluation are to: (a) Verify the performance of newly proposed Extended PLSA algorithm for image clustering. (b) Determine the representativeness of selected location tags. (c) Test whether the visual and textual location summary generated by our algorithm is satisfactory or not. The goals are directly dependent on subjective means. Therefore, we performed our evaluation by user tests. We executed three experiments to accomplish these goals.

6.1 Dataset

Our dataset was collected by crawling images and associated metadata using Flickr API. To test our approach we selected only those locations for which we can retrieve at least 100 CC-licensed Flickr images. For the practical implementation of our algorithm we constrained this selection to the range of the radius of 1 Km around the input location. Together with image, the accompanying metadata, i.e. tags and user information are also collected. The system is highly dependent on the geo-tagged photos uploaded by different users on Flickr. (a) More photographs are taken at locations that provide views of some interesting object or landmark. (b) Photos are taken or photos on social sharing web site are accessed by a large number of users. (c) Textual tags assigned must reflect the presence of interesting landmarks in a location.

6.2 Clustering Performance Test

We evaluate the effectiveness of our Extended PLSA based clustering algorithm with respect to four baseline methods. Visual Clustering: K-means clustering is applied to visual features to cluster images into K clusters. We adopted visual clustering mechanism based on clustering approach carried out in past such as [5]. Geo-tag Based Clustering: GPS coordinates or Geo-tag based image clustering approach is utilized by Fang et al. [9], Jaffe et al. [3], and Liu [13]. Mean shift clustering is employed to cluster the photos based on GPS coordinates. Tag Based Clustering: Semantic and contextual meaning of tags is utilized for clustering of associated images. Graph based clustering algorithm is utilized. Users Interest Based Clustering: If any user uploads, comments, annotates, shares, likes, or marks favorite to the image, the user is considered to be interested in the image. Based on users interest images are clustered.

Fig. 2.
figure 2

Clustering performance test

In this test, we showed to our subjects clusters of images formed by each of these four baseline methods and our method. We performed within subject evaluation with a set of 20 subjects. Subjects were asked to rate each cluster on criterias such as: visual-textual relevance of images in cluster, coverage of users interest, and visual coherence. Figure 2 shows the results. From these experiments we can conclude as follows: (a) Only visual features are poor at understanding the content and context of the image, which results in making visual relevance insufficient for generating summarization. (b) Clustering on the basis of only GPS coordinates or Geo-tags can never be covering content or contextual information or users interest. One major possibility is GPS coordinates may be inaccurate (e.g. users may take photos from a long distance of the sights). (c) User provided tags for community contributed images are far from perfection for image clustering. Tags suffer from ambiguity, knowledge and terminology limits of users, tags assigned may not be the actual descriptive words for the image. (d) Clustering based on users interest may be misleading and may skew the selections towards generally insignificant subjects.

6.3 Summary Relevance Test

The goals of summary relevance test are (a) to confirm that the visual and textual summary generated by the system outperforms the baseline methods, (b) to verify that the generated visual summary for given geo location is representative, but still diverse and precise, (c) to confirm the user satisfaction by evaluating user feedback for a set of questions, (d) to confirm that the tags selected to generate textual summary of geo location are descriptive and representative for the location and personal or unimportant tags are ignored while summarizing the location.

Fig. 3.
figure 3

Precision for images and tags

The baseline methods utilized are: Random(B1): Random selection of images and tags. View Count(B2): Images are sorted according to highest view count. Top ranking images with no more than one photo per user are selected to generate visual summary of geo-location. Tags assigned to these images are selected using TF-IDF method to generate textual summary. Recent(B3): The most recent photos with no more than one photo per user are selected to generate visual summary. Tags assigned to these images are selected using TF-IDF method to generate textual summary. PLSA Based Summarization(B4): Images and tags are selected by our system without applying the extended PLSA conditions. Extended PLSA Based Summarization(B5): Images and tags are selected by our system.

Using these baseline methods, we select ten representative images and ten representative tags for five different geo-locations. The ground truth judgments of image and tag representativeness are defined by human evaluator. Using ground truth judgments, we evaluate precision for each of visual and textual representation. The precision metric measures the percentage of images and tags that are indeed representative of location. Figure 3 shows precision values for images and tags for five locations.

It is observed that performance of Random, Recent and ViewCount methods is not consistent over time and location. Due to probabilistic nature, PLSA based summarization performance is also not consistent over time. In case of extended PLSA, the probabilistic nature is governed by the two conditions of cluster connectivity and user coverage. So, the Extended PLSA approach performance exceeded the baseline methods that have proven to be less effective, less consistent and less robust in the face of changing data and time.

Fig. 4.
figure 4

Location summarization results

A fact with precision based evaluation method is that precision does not capture all the aspects that could impact the perceived quality of a set of representative images. In the precision based evaluation, each image is identified as representative (1) or not representative (0). But the fact is representativeness is not binary. Repetition of similar or nearly identical images in the summary could affect the quality of summary. These issues of relative representativeness can be evaluated by human judges. Next we describe a wider evaluation that was designed to measure the relative representativeness of images.

6.4 User Survey

We conduct a small-scale user study to evaluate the effectiveness of the proposed method and the user experience of the novel visualization form. The experiment was conducted with two different locations which were known to the user. For each location, images are grouped into four different themes as shown in Fig. 4. Four criteria are considered: (a) Representativeness: the level of representativeness of the visual cluster (0: Worst Representative, 10: Best Representative). The visual cluster representativeness is decided by the total number of images in the cluster and number of representative images in the cluster. (b) Coverage: the extent that the mined visual themes and representative tags provide sufficient information about the location (0: Insufficient, 10: Best). (c) Uniqueness: The extent of uniqueness of images in the visual cluster. Uniqueness of the images in the cluster is decided as the number of representative images minus redundant photos in the cluster (0: Not Unique, 10: Unique). (d) Satisfaction: how satisfactory are the aggregated multiple themes for location visualization (0: Not Satisfied, 10: Very Satisfied). We invited 20 participants, who are well known to the given geo-location for the user study experiment. The eight themes depicted in Fig. 4 are selected for evaluation. The results are averaged over all participants for each theme and shown in Fig. 5. The participants gave positive feedback to the novel location visualization scheme.

Fig. 5.
figure 5

User survey results

7 Conclusion

The phenomenal growth of personal and shared digital photo collections presents considerable challenges in building navigation and summarization applications. By utilizing our method for location visualization, we enable users to view the most relevant samples from large-scale photo collections. We have presented a novel location visualization scheme using extended PLSA for geographically and thematically organizing photos into multiple themes. The proposed tensor factorization HOSVD using intra-relations between use, image and tag helps deal with the tagging problems of, visual textual relevance, cold start, ternary relationship among user, image and tag. Experiments on Flickr datasets for various known locations show that the proposed framework greatly outperforms the baseline and also shown its advantage in deriving compact location visualization and themes for improving user experiences.