8.1 Introduction

The popularity of social media, and location-based services in particular, has led to a vast increase in the number of georeferenced resources on the web. Examples are the large numbers of Flickr photos, Twitter posts (tweets), and Wikipedia articles for which explicit geographic coordinates are currently available. This trend has also led to an upsurge in research into methods for automatically assigning coordinates to web content. Being able to associate coordinates to web content is important in applications such as geographic information retrieval [16], where search results are adapted to the location of the user, and in applications which rely on characterizing places, e.g., for offering personalized travel recommendations [13].

Although there are differences between georeferencing Flickr photos, tweets, and general web documents, the most useful input is usually in textual form, viz. the tags associated with Flickr photos and the contents of tweets and web documents. As a result, methods that have been proposed for georeferencing these types of resources tend to be similar. Broadly speaking, there are three ways coordinates can be assigned to a text document. First, named-entity tagging has traditionally been used to identify mentions of place names in documents, with gazetteers subsequently being used to map these place names to coordinates. The geographic scope of a web document can then be identified with the centroid of those coordinates, or, more commonly, with a probability density of locations. This method is difficult to apply to Flickr photos, since tags lack the context needed to effectively identify named entities. This is exacerbated by the fact that Flickr tags are converted to lower case, which makes it challenging to resolve ambiguities between place names and common nouns (e.g., nice vs. Nice). Second, we can treat the problem of georeferencing text documents as a classification problem. This approach has been adopted in, among others, [19] for Flickr photos and [29] for Wikipedia articles. Essentially, in this approach, the locations on Earth are discretized into a finite set of areas, using clustering, a fixed-grid representation, or administrative boundaries of geographic regions. Using standard text classification methods, such as Naive Bayes or support-vector machines, the most plausible area is identified, and the centre-of-gravity of that area is used as the estimated location. Third, we can try to find resources with known coordinates which are similar to the resource to be georeferenced, and the location of these resources can be used to estimate the location. A combination of the latter two approaches was advocated in [25], which showed that a two-step approach which first uses a text classifier to find an appropriate area and then uses a similarity search to find a plausible location within that area consistently outperforms either of the two individual methods.

In this chapter, we focus on the case of Flickr photos, and in particular on the question of how the aforementioned text-based methods could be improved by taking into account visual information as well as information from the user profile of the photo uploader. Taking visual information into account efficiently has generally proved challenging in the field. When easily identifiable landmarks such as the Eiffel tower are shown in a photo, methods based on, e.g., SIFT features can be very effective in determining the correct coordinates from the photo alone. However, such cases are rare in practice, and in most of these cases, the correct location can also be obtained from available tags, at a much lower computational cost. We argue that the most effective way to utilize visual features of photos is by extending the aforementioned two-step approach from [25] to a three-step approach. Specifically, an approximate location is first identified using textual information alone (possibly also with some evidence from the owner’s user profile), and this location is then refined by comparing the photo to be georeferenced with nearby photos with known coordinates. Given that only nearby photos are considered, lower-level visual features can be especially effective, even though such features are too general to be of use at a worldwide scale.

This chapter is structured as follows. In the next section, we discuss the peculiarities of Flickr photos and describe preprocessing methods which make the subsequent analysis more reliable. Section 8.3 then gives an overview of text-based methods, and visual methods are presented in Sect. 8.4. Finally, in Sect. 8.5, we discuss how the two types of approaches can be integrated.

8.2 Data Selection and Preprocessing

Flickr contains more than 300 million photos with associated geographicalcoordinates.Footnote 1 By analyzing correlations between the locations of these photos on the one hand, and the visual features and textual metadata of the photos on the other hand, we can train a system that estimates the location of previously unseen Flickr photos or videos.

For each uploaded photo, Flickr maintains several types of metadata, which can be obtained via a publicly available API.Footnote 2 In this paper, three types of metadata will be relevant. First, photos may be associated with descriptive tags provided by the photos’ owners. Many of these tags provide evidence about where the photo was taken (e.g., because they refer to a place name or the name of an event, because they are in a particular language, or because they refer to regional cuisine), so modeling the spatial distribution of photos which have been assigned a particular tag will play a particularly important role in our framework. Second, photos are associated with an owner, whose user profile contains a free-text field describing their home location (e.g., “Ghent, Belgium”). It turns out that this home location is particularly helpful when dealing with photos that do not contain any location-specific tags, or only tags which are ambiguous. All things being equal, photos are more likely to have been taken close to the owner’s home location than at the other side of the world. Finally, the photos we consider are also associated with a geographical coordinate, which is considered as the ground truth for the purposes of this work. Note, however, that this is a simplifying assumption, as, for example, photos of a landmark may be associated with the position at which the photo was taken or with the location of the landmark. Indeed, while a small percentage of coordinates come from GPS devices, most coordinates are manually provided by users. For each pair of coordinates, Flickr provides information about its precision, encoded as a number between 1 (world-level) and 16 (street-level), reflecting the zoom level of the map the owners used to assigned the coordinates (in the case of manually assigned locations).

In most approaches for georeferencing Flickr photos, a number of preliminary filtering steps are carried out to clean the training data:

  1. 1.

    Photos that do not have any tags or have invalid coordinates are removed.

  2. 2.

    Photos whose location precision is too low for the task (e.g., 11 or lower) are discarded, retaining only those photos that provide meaningful location information at the sub-city scale.

  3. 3.

    If there are multiple photos with the same upload date, an identical tag set, and identical coordinates, only one of the photos is retained. Users on Flickr can upload content in bulk, i.e., upload multiple photos at once and tag them with the same information; this can skew the analysis, as was first pointed out in [20].

The photos that remain after these filtering steps are used for obtaining clusters of locations and for estimating language models.

The tags associated with the photos can also be preprocessed. Flickr normalizes the user-provided tags by converting them to lowercase, removing spaces within tags, and then replacing commas between tags with spaces. For example, the set of tags “Trip 2010, Sagrada Familia, Barcelona” becomes “trip2010 sagradafamilia barcelona”. Some approaches use a number of additional preprocessing steps, e.g., removing diacritics and numbers, separating numerical characters from alphanumeric tags, and/or discarding words from a problem-specific list of stop words (e.g., camera brands and lens types). It should be noted that such forms of preprocessing do not always improve results. For example, we found that the tag “911” (referring to the attacks on 11 September 2001) is strongly correlated with locations in New York City. To give an idea of how many videos with tags tend to remain after these preprocessing steps, in the MediaEval training set of \(2012\), approximately 40 % of the photos contained at least one tag.

Fig. 8.1
figure 1

Distribution of the MediaEval Placing Task 2012 data set for Europe

8.2.1 Clustering the Training Data

Most approaches to georeferencing Flickr resources interpret it as a classification problem. To this end, the locations of the photos in the training data are clustered into sets of disjoint areas, which are then interpreted as the class labels. Once a classifier has identified the most likely area where a photo was taken, we may use other techniques to find the most likely location within that area, as detailed in Sects. 8.3 and 8.4.

A number of techniques are available for obtaining a clustered representation of locations. Some of these methods are compared experimentally in [9, 11, 27]. Here, we summarize the main advantages and disadvantages of several popular methods. We will also illustrate each of these methods by using them to cluster the MediaEval Placing Task data set (Figure 8.1 shows the distribution of this dataset over Europe).

\(k\) -medoids clustering is closely related to the well-known \(k\)-means clustering algorithm, differing only in how the center of each cluster is determined. While \(k\)-means uses the center-of-gravity for this purpose, in \(k\)-medoids, the center is selected as the medoid of the cluster, i.e., the element which minimizes the sum of the distances to the other elements of the cluster. The main advantage of using \(k\)-medoids is that the selection of the medoid is more robust to outliers than the center-of-gravity; this is why \(k\)-medoids is more commonly used than \(k\)-means for clustering sets of coordinates. However, it should be noted that selecting the medoid of a cluster has a time complexity which is quadratic w.r.t. the size of the cluster, in comparison with the linear complexity of selecting the center-of-gravity. For very large training sets, this means that the number of clusters chosen must be sufficiently high (such that the number of photos per cluster is manageable), or that only a sample of the training data can be used to obtain the clusters. Distances are usually calculated using the Haversine metric instead of the Euclidean metric. An example clustering with \(k=1,000\) clusters (worldwide) is shown in Fig. 8.2.

Fig. 8.2
figure 2

Sample clustering for Europe using \(k\)-medoids, \(k\) = 1,000 (worldwide)

Grid clustering uses a fixed grid of square (or sometimes hexagonal) cells over the surface of the Earth. The main advantage of this method is that it is computationally inexpensive. In the example in Fig. 8.3, grid cells correspond to 4.375 degrees of latitude and longitude; this value results in 1001 clusters worldwide, making it easily comparable with Fig. 8.2. An important distinction from \(k\)-medoids clustering is that the size of the grid cells does not depend on the amount of available training data. However, this means we cannot attempt a more accurate classification in areas of the world for which we have abundant training data while being more cautious in areas where training data is sparse. Experimental results described in [27] confirm that using a grid leads to worse performance than using \(k\)-medoids.

Fig. 8.3
figure 3

Sample clustering for Europe using a grid of 1,001 cells over the world

Mean shift clustering does not require predefining the number of clusters, but instead relies on a scale parameter \(h\). The number of resulting clusters emerges from the choice of the scale factor. An example with \(h=150\) is given in Fig. 8.4. Two points are worth noting here. First, outliers tend to end up in separate clusters, leading to a large number of small clusters. Second, as with the grid clustering, the granularity of the clusters does not reflect the amount of training data. As a result, models for georeferencing Flickr photos (see Sect. 8.3) perform worse when using mean shift clustering than when using \(k\)-medoids clustering, even if small clusters with outliers are merged with other clusters [27].

Fig. 8.4
figure 4

Sample clustering for Europe using the mean shift algorithm (\(h\) = 150, resulting in 2,349 clusters worldwide)

Finally, as an alternative to clustering, considered in [9, 10], we could also define the set of areas based on national boundaries (as depicted in Fig. 8.5). This method has the advantage that the definition of the areas is likely to be better aligned with the distribution of tags (e.g.  the distribution of toponyms and of languages used for tags). As with grids and mean-shift clustering, a disadvantage is that the size of the clusters does not reflect the amount of available training data. However, we could combine the best of both worlds, by considering a two-level hierarchical clustering, where the first level is based on national boundaries and the second level corresponds to \(k\)-medoids-based clusterings of photos within the same country. Note that such a two-level approach requires that we can accurately find out the correct country for a given test photo or video. In many cases, this is a realistic assumption, since the determination of a country is usually less problematic than disambiguating the name of a landmark or a city. For example, for a given photo, each associated term (e.g.  tag or title word) may potentially refer to a place name. We ca use a gazetteer (i.e., a dictionary of named places, usually linking place names to geographic locations and sometimes a semantic type) to find out which terms may be place names, and where on Earth places with that name occur. One of the most popular gazetteers is the GeoNames database [1] which contains over 10 million geographical names corresponding to over 7.5 million unique features and provides a web-based search engine which returns a list of entries ordered by relevance. The approach from [9, 10] uses GeoNames to create a ranking of the possible countries with which a photo or video can be associated (based on the possible interpretations of the place names associated with its tags). Then, the boundary of the most likely country is determined by querying the Google Maps API [2].

Fig. 8.5
figure 5

Two-level hierarchical clustering [9]

8.2.2 Term Selection

Many of the tags associated with photos and videos on Flickr are not useful for estimating geographic location. To prevent overfitting, it is useful to apply a term selection step, in which all tags that are not deemed geographically relevant are removed. There is a wide selection of methods that can be used for this purpose.

If we consider the areas obtained after clustering, term selection methods that have proven effective for text classification could be applied [30, 31]. Examples of popular methods include \(\chi ^2\) and information gain. The advantage of such methods is that they are easy to implement and they are based on well-known statistical and information theoretic principles. However, such methods effectively ignore the spatial dimension of the problem. For this reason, [5] introduced a heuristic, location-aware method for selecting terms. The method proposed in [5], called geographic spread, first clusters adjacent grid cells in which a particular tag occurs. Tags are deemed geographically relevant if the number of clusters is sufficiently small and the largest cluster contains a sufficiently large number of tag occurrences. Despite the heuristic nature of this measure, it substantially outperforms methods such as \(\chi ^2\) and information gain. Finally, methods from the field of geospatial statistics could be considered. For example, using kernel density estimation (KDE) we may model each tag as a smooth probability distribution over the set of locations on Earth. We could then select tags whose associated distribution diverges from a background distribution (reflecting the distribution of Flickr photos and videos). Another possibility is to use methods from epidemiology, such as Ripley’s K statistic, to identify tags whose pattern of occurrences deviates significantly from a uniform sampling. Finally, methods for measuring spatial autocorrelation could be used to identify tags whose occurrences tend to be clustered in space.

In [23], an analysis is presented of how different term selection techniques perform in the context of georeferencing Flickr resources. The geographical spread measure, KDE-based methods and a method based on Ripley’s K statistic performed comparably. Of these three approaches, the geographical spread measure has the clear advantage of being the easiest method to implement and being the computationally least expensive method. However, it was found to be more sensitive to the number of selected features. Out of a total of about 300 K features, all methods performed optimally when about 50–100K features were selected. However, while the KDE and Ripley’s K-based approaches still performed well when only 10 or 25 K features could be selected, the geographic spread measure was not competitive in that range. Overall, the experiments in [23] strongly suggest that measuring spatial autocorrelation is not sufficient; for a term selection method to be effective in this context, it needs to favor terms which only occur in a small number of areas around the world. For example, while the term beach is clearly geographically relevant (i.e., spatially autocorrelated), it is not a useful term for deciding in which area a photo was most likely taken.

8.3 Textual Approach

There are two main text-based approaches for estimating the geographic location of a Flickr photo or video. First, we may view the problem as a classification task, in which the classes are geographic areas. For a given photo, the most likely area is determined and the center of that area is used as the estimated location. This approach was proposed in [20]. Second, we may treat the task of assigning a location to a Flickr photo or video as a retrieval task. In this case, we first identify the photos in the training data which are most similar to the considered resource, and use their locations to determine the estimated location. As proposed in [24], we can combine these two approaches by first using a classifier to find the most likely area where the photo or video was taken, and then find similar photos in the training set from that area. Experimental results in [22, 27] show that such a hybrid two-step approach outperforms either of the individual methods. A key factor is the number of clusters considered: the fewer clusters are used, the more emphasis there is on the retrieval step. It turns out that, in general, the more training data is available, the smaller the optimal number of clusters.

In the remainder of this section, we explain in more detail how both approaches operate.

8.3.1 Variations on the Classification Approach

As we explained in Sect. 8.2.1, various methods exist to segment the training data in a set of disjoint areas. We can see each of these areas as a class, and treat the problem of georeferencing Flickr photos as a standard text classification problem. Some authors have used support vector machines [4] or Kullback-Leibler divergence [29] for georeferencing social media documents. The most popular approach, however, seems to be to use a naive Bayes classifier [20, 27], based on a multinomial bag-of-words language model.

In this model, the probability that a Flickr photo \(d\) with tags \(t_1,...,t_n\) was taken in area \(a\) is estimated as:

$$\begin{aligned} P(a | d) \sim P(a) \cdot \prod _{i=1}^n P(t_i | a) \end{aligned}$$
(8.1)

As usual, the use of logarithms replaces the multiplication by a summation and prevents the underflow of floating points:

$$\begin{aligned} \log P(a | d) \sim \log P(a) + \sum _{i=1}^n \log P(t_i | a) \end{aligned}$$
(8.2)

We now go in more detail on how the likelihood \(P(t_i | a)\) and the prior probability \(P(a)\) can be estimated from Flickr.

8.3.1.1 Estimation of Term Location Distribution

The probability \(P(t | a)\) reflects the likelihood of seeing a photo with tag \(t\) in photos that have been taken in area \(a\). The simplest way of estimating this likelihood would be to choose \(P(t | a) = \frac{N_{t,a}}{N_a}\), where \(N_{t,a}\) is the number of photos with tag \(t\) in area \(a\) and \(N_a\) is the total number of photos in area \(a\). However, this would lead to \(P(a | d)= 0\) as soon as \(d\) has one tag which has not previously been seen in area \(a\). To cope with this problem, usually some form of smoothing is applied.

One possibility for smoothing the probability \(P(t |a)\), called Laplace smoothing:

$$\begin{aligned} P\left( t| a \right) =\frac{ N_{ t,a }+1 }{ N_a + |V|} , \end{aligned}$$
(8.3)

where \(V\) is the set of all tags (that have been retained after feature selection). Another possible smoothing method is Bayesian smoothing with Dirichlet priors, in which case we have (\(\mu >0\)):

$$\begin{aligned} P(t| a) = \frac{N_{ t,a } + \mu \;P(t|V)}{N_a + \mu } \end{aligned}$$
(8.4)

where the probabilistic model of the vocabulary \(P(t|V)\) is defined using maximum likelihood:

$$\begin{aligned} P(t|V) = \frac{\sum _{a} N_{ t,a }}{\sum _{t' \in V} \sum _{a}N_{ t',a }} \end{aligned}$$
(8.5)

A final possibility is to use Jelinek-Mercer smoothing, in which case we have (\(\lambda \in [0,1]\)):

$$\begin{aligned} P(t|a) = \lambda \frac{N_{ t,a }}{N_a} + (1-\lambda )\;P(t|V) \end{aligned}$$
(8.6)

with \(P(t|V)\) defined as in (8.5). For more details on these smoothing methods for language models, we refer to [32]. A comparison between these smoothing methods in the context of georeferencing Flickr photos has been given in [27]. Figure 8.6 shows the term-location probabilities \(P(t|a)\) of terms \(t\) occurring in a video located at the Ground Zero, Mahattan, New York, USA. As shown, single terms do not indicate a specific or right location, only the combination maximizes the likelihood of the right spatial segment. Thus we can conclude that generic terms like ‘911’ do help with specifying the location despite the fact that these terms do not have a geographical relation in the sense of being an entry in a gazetteer.

Fig. 8.6
figure 6

Probabilities \(P(t|a)\) of a video containing the terms: ‘usa’, ‘manhattan’ and ‘911’

Note that all the above formulas describe our probabilistic model when using a multinomial distribution with term frequency (tf) weighting. Other weighting schemes are possible, as we will explain in Sect. 8.3.1.3.

8.3.1.2 Estimation of the Prior Probability

There are at least three different ways of choosing the prior probability \(P(a)\) that a photo is taken in area \(a\). In some situations, we want to refrain from introducing any bias, and choose \(P(a)\) as a uniform probability. However, if photos are randomly sampled from Flickr, some areas of the world are much more likely than others. We ca use a maximum likelihood estimation to include this evidence, in which case we choose:

$$\begin{aligned} P(a) = \frac{N_a}{\sum _{a'}N_{a'}} \end{aligned}$$
(8.7)

A third possibility is to take into account prior knowledge about the home location of the owner of the photo, using the intuition that areas closer to where the owner lives are more likely. In [26], the following prior probability was proposed:

$$\begin{aligned} P(a) \sim \left( \frac{1}{d(a,h)}\right) ^{\theta } \end{aligned}$$
(8.8)

where \(h\) is the estimated home location of the user (obtained by georeferencing the corresponding text field in their profile) and \(a\) is the centre of area \(a\). Along similar lines, [6] uses a prior probability which is based on population density and a prior probability which is based on climate data (assigning a higher prior probability to more temperate climates). Additionally, it could be worth considering a different prior probability for each user, if sufficient information is available. This was proposed in [22], using a method called User History. In particular, when the training set contains previously uploaded photos or videos made by the same user, computing the prior probability based on these locations may help to improve the estimation. Furthermore, it has been investigated in [22] how the user’s social network could be used to extend the user-based prior location, following the assumption that the location of a user might be related with the locations of her social connections.

8.3.1.3 Weighting of Term Occurrences

So far, the term-location probability and the prior probability are estimated by counting the frequency of each term, as applied in (8.3)–(8.6). In this section we will explain how term weighting is used to improve these estimations. Hereby, the count \(N_{t,a}\) is replaced with a weight \(W(t,a)\) which may reflect other aspects than just the frequency of occurrence of the term. Term weighting is a well know technique in the Information Retrieval domain [15].

As a first method, the term frequency (tf) weighting, which has been used so far, is used to distinguish between areas that contain a given tag only a few times and those that contain the tag many times. The use of term frequency corresponds to the assumption that areas which contain more mentions of a given tag are more likely to contain the location of a photo or video with that tag. The number of occurrences of a tag \(t\) in area \(a\) is usually normalized, by dividing it by the number of occurrences of the most frequent tag, among those tags in photo or video \(d\).

$$\begin{aligned} W_{\mathrm{tf}}(t,a)=\frac{N_{t,a}}{\max _{t'\in d}N_{t',a}} \end{aligned}$$
(8.9)

Some studies in [9] have shown that it is not always useful to weight the frequency of a tag, assuming that all that matters is whether it occurs at least once.

$$\begin{aligned} W_{{to} }(t,a)=\left\{ \begin{matrix} 1,&{} N_{t,a} > 1 \\ 0,&{} \text {otherwise} \end{matrix} \right. \end{aligned}$$
(8.10)

In the context of geo-location recognition, certain tags have little or no discriminating power. For instance, the term ‘video’ or ‘photo’ exists almost in every area. The idea is to reduce the tf weight of a given tag by a factor which is increasing in the number of areas in which the tag occurs. For this reason the inverse document frequency (idf) of a tag \(t\) is defined as follows:

$$\begin{aligned} W_{\mathrm{idf}}(t, a)=log\frac{ N }{ \sum _a W_{{to} }(t,a) } \end{aligned}$$
(8.11)

where \(N\) denotes the total number of areas.

The term frequency-inverse document frequency (tf-idf) weighting schemes is one of the best known in information retrieval. It combines the term frequency and inverse document frequency: The \(N(t_i,d)\) in (8.9) and \(N_{t,l}\) in (8.10) are replaced by the tf-idf scores.

$$\begin{aligned} W_{\mathrm{tf-idf}}(t,a)=W_{\mathrm{tf}}(t,a) \cdot W_{\mathrm{idf}}(t,a) \end{aligned}$$
(8.12)

These three weighting schemes are used for estimating the term-location probability (Sect. 8.3.1.1) and the prior probability (Sect. 8.3.1.2) by replacing the term count \(N\) by one of the introduced weightings. The three different weighting schemes — term occurrence (\({to}\)), term frequency (\({tf}\)), and term frequency-inverse document frequency (tf-idf)—are used for term-location probability and prior probability, and then compared against each other. Table 8.1 displays the percentage of correctly predicted cells (i.e., area) in a large grid of \(360\times 180\) segments (see Sect. 8.2.1), according to the meridians and parallels of the world map. In particular, this table shows how frequently the correct location is within one of the \(N\) highest ranked grid cells. The textual model with tf-idf weighting predicts the correct grid cell for the half of the Placing Task 2012 dataset. In 66 % of all cases, the correct locations is among the 10 most likely grid cells.

Table 8.1 Correct decision for spatial segments

The Okapi bm25 is more than a term scoring method, but rather a method for scoring documents in relation to a query. It was introduced in the early 1990s in [17], the bm25 formula has been widely adopted, and it has repeatedly proved its value across a variety of search domains. One of the first comparisons with tf-idf was done by [28], showing that the Okapi bm25 weighting scheme performed better in many cases. In the context of geo-location recognition, it can be used in line with a Query-Document retrieval approach, such as the one we discuss in Sect. 8.3.2. Alternatively the term weight could also be pre-computed independently from a given query, in which case the Okapi bm25 weights are defined as follows:

$$\begin{aligned} W_{\mathrm{bm}}(t, a) = {\mathrm{idf}}(t) \times \frac{w^\prime _{t,s} \times (k + 1)}{w^\prime _{t, s} + k \times (1 - b + b \times \frac{|D|}{\mathrm{avg}}_{dl})} \end{aligned}$$
(8.13)

where \(w^\prime _{t,s}\) is the weight of the term \(t\) associated to the area \(a\) , \(avg_{dl}\) is the average number of tags per training sample, \(k\) and \(b\) are free parameters usually chosen as \(k \in [1.2, 2.0]\) and \(b = 0.75\) (see [15]).

The idf part instead is given by

$$\begin{aligned} \text {idf}(t) = \log \frac{M - N_{t} + 0.5}{N_{t_i} + 0.5} \end{aligned}$$
(8.14)

where \(M\) is the total number of training photos, and \(N_{t}\) is the number of training photos containing tag \(t\). We refer to [9, 10, 22] for an experimental comparison of these methods.

8.3.2 Variations on the Retrieval Approach

This approach is inspired by the standard information retrieval setting, where the goal is to retrieve the documents which are most closely related to a given input query. In a geo-location context, our query is formed by the text attached to the test photo or video. The text could be the title, description, tags, or any other associated text. In this section we consider only the tags, but similar considerations apply if other text fields are used. In order to interpret georeferencing as a document retrieval problem, we also need to define the document collection. One possibility is to identify documents with geographic areas. In particular, each document then consists of the tags associated with all training photos of the corresponding area. Section 8.3.2.2 will discuss a number of alternative approaches that could be considered. First, in Sect. 8.3.2.1, we discuss a way to compute the geo-descriptiveness of a word. This measure of how much a word is related to a specific location is important to reduce the impact of noise (i.e., tags which are not indicative of a particular location, but whose distribution is not entirely uniform due to chance).

8.3.2.1 Geo-Relevance Filtering

Describing a geographic area or a specific coordinate as a weighted set of tags requires a weighting scheme that reflects the relationship between tags and coordinates. Since we want to consider only tags that are geographically relevant, we first need a way to determine the level of geographic spread (or geo-descriptiveness) of a tag. This is similar to the problem of term selection, which was discussed in Sect. 8.2.2. However, whereas term selection requires a hard decision, here we are interested in measuring the degree to which a tag is geographically relevant, so that we can discount, rather than ignore, geographically less relevant tags in the retrieval model.

We can measure the relevance of a tag by jointly considering its frequency of occurrence (or term frequency) and the average distance between the locations where it occurs. Therefore, for each tag \(t\) we compute its term frequency \({tf}_t\) in the training data and the average Haversine distance \(d_t\) between the locations of the images or videos which contain \(t\). For example, using the data of MediaEval2013 [7, 22] applied the following heuristic approach for measuring the degree of geo-descriptiveness of a tag \(t\):

$$\begin{aligned} {\small w_t = {\left\{ \begin{array}{ll} -1 &{} \quad \text {if }\mathrm{tf}_t > 100\, \mathrm{K}\,\,\text {or}\,\,d_t < 0.2\\ 10 &{} \quad \text {if }\mathrm{tf}_t \ge 200\,\,\text {and}\,\,10 \le d_t \le 50\\ 5 &{} \quad \text {if }\mathrm{tf}_t \ge 150\,\,\text {and}\,\,d_t \le 70 \\ 1 &{} \quad \text {otherwise} \end{array}\right. }} \end{aligned}$$

This weighting was designed to assign higher weights to tags representing geographic information, i.e., not only places but also references to locations such as monuments, e.g., Eiffel, Colosseum) or famous people with a geographical correlation, e.g., Gaudi and Barcelona). Table 8.2 illustrates the effect of using these weights.

Table 8.2 On the left, we list the most frequently occurring tags. On the right, we list tags \(t\) which maximize the weight, i.e., for which \(w_t = 10\), ordered by term frequency

8.3.2.2 Frequency Matching

When interpreting the task of georeferencing Flickr photos as a document retrieval task, it is intuitive to treat the set of tags associated with a Flickr photo or video that we want to localise as the query. However, there are several possibilities for defining the document collection.

A simple solution is to consider each photo in the training data as an individual document, where the tags associated with the photo represent the terms of the document. However this approach is very noisy in the sense that it results in many documents whose text is identical but which have a different location. For example, the set of tags “france”, “pompidou” and “paris” appear in many photos with slightly different coordinates, e.g., (48.8611, 2.3521) and (48.6172, 2.213), whereas we are interested in assigning a single location to a given query. This issue can be tackled by collecting all the coordinates associated to the same set of tags, counting how often each pair of coordinates appears. For the set of tags highlighted before we obtain the coordinates (48.8611, 2.3521) and (48.6172, 2.213) with frequency respectively \(12\) and \(3\). The frequency counter suggests that the first one is more reliable than the latter one since it appears more often. Once we obtained the most frequent coordinate we associate it to the set of tags. Alternatively, we could also select the medoid in the set of coordinates. Each document in the collection is then a set of tags, with an associated coordinate, such that no two documents correspond to the same set of tags:

$$ <\mathrm{france, pompidou, paris}>\quad \rightarrow \quad (48.8611, 2.3521) $$

A second way to define documents in our context is by grouping all tags that are assigned to documents at the same location. In other words, there is a document for each pair of coordinates, whose text contains all tags of all photos in the training data that have these coordinates:

$$ (48.8611, 2.3521)\quad \rightarrow \quad <\mathrm{france, centre, pompidou, centrepompidou, paris, }\dots > $$

However we have to consider the presence of “duplicates” as photos at slightly different coordinates are often associated with similar tags (as the example highlighted before). In order to have a cleaner and more reliable collection of documents it is possible to merge documents if their location is sufficiently close and their text is sufficiently similar. Alternatively, we could use a clustering method to obtain a large set of areas, and then associate one document to each cluster (Fig. 8.7).

In each case, we obtain a collection of documents, defined by a bag of tags and a unique location. To find the location of a test photo, we interpret its tags as the query and then identify the most similar document in the collection, taking its location as the most probably location of the test photo. In particular, we retrieve all documents which have at least one word in common with the query and rank these documents by relevance, using a standard weighting scheme such as tf-idf or Okapi BM25 (see Sect. 8.3.1.3). According to [25], choosing the top ranked document outperforms taking a weighted average of the top-k documents, although these experiments used a term-frequency weighting, rather than tf-idf or Okapi BM25. However, in case of a tie (i.e., when several documents have the maximal relevance score) we could apply other approaches to select a pair of coordinates. In [21, 22], for example, the medoid of all the top ranked documents, weighted by the number of occurrences of each location, is taken as the most representative location.

Fig. 8.7
figure 7

Coordinates of three tags plotted on the world map. The first and the third figure show the spread of tags that are not bound to specific locations, whereas the middle one shows a tag related to a specific country. At first glance it is clear which is the more meaningful tag that describes a specific location, a beach, b italy, c iphone

8.4 Visual Approach

The georeferencing of photos or video recordings with visual features is useful for fine-tuning the result of the textual approaches and for georeferencing Flickr resources without any tags.Footnote 3 Several authors have investigated the usefulness of visual content for predicting geographical tags [11]. As expected, visual information is more challenging to use than textual information. Nevertheless visual georeferencing is also useful for other vision tasks by narrowing down the possibilities for further processing (e.g., visual landmark recognition) or refining textual approaches. Imagine a photo depicting a coastal scene or open water for which the geo-coordinates are not known. Many locations like hinterlands or metropolitan areas become very unlikely, and can be excluded from further processing. As in the case of textual methods, we ca use classification approaches and retrieval approaches to take into account visual information.

8.4.1 A Classification Approach to Using Visual Information

The classification approach is about modelling the characteristics of a spatial region in terms of visual features. These visual features can be described by various descriptors, grouped in Colour and edge/texture descriptors. The most prominent are Colour and Edge Directivity (CEDD), Gabor (GD), Scalable Colour (SCD), Tamura (TD), Edge Histogram (EHD), Autocorrelogram (ACC) and Colour Layout (CLD). Colour Histograms, in which the colour distribution is presented in quantised colour component bins, are used to differentiate on the basis of colour perception of the human vision. Edge features, as another class of visual features, help distinguish between natural scenes and scenes containing man-made structures, while texture features might help to discriminate properties such as different terrain types.

These features are applied on each single image or on key frames in case of video sequences, in order to reduce their temporal dimensionality. With these descriptors, a wide spectrum of colour and texture features within images is covered. We are aware that some descriptors address similar image features. If computation time matters, e.g., during a machine learning step, dimensionality reduction techniques like principal component analysis or cross correlation analysis can be applied. In this way, more or less sophisticated machine learning algorithms can be used to generate a model of each spatial area. To this end, all feature can be included in the learning step or several models can be generated, one for each feature. In the latter case, classification performance of the individual models can be used for figuring out the most geo-related visual feature, as shown in Table 8.3. This table contains the results of a nearest neighbour classification for each descriptor and two hierarchy levels (see Sect. 8.2.1). Here, the classification result in terms of accuracy on selected error margins is evaluated per descriptor. Details about the experimental set-up are shown in [12]. This table does not only contain the classification accuracy for different features and different margins of errors, but also for different area sizes (here labelled as ‘Block size’). Block sizes labelled with ‘large’ stands for a spatial level in which the areas are as big as the surface spaned by the meridians and parallels, resulting in a size corresponding to one cell of a grid of 360 by 180 cells. Areas in the spatial level labeled with ‘small’ are only a quarter of that size, that means the world map is segmented in a grid of 720 by 360 cells. This method can iteratively determine the most visually similar spatial area by calculating the Euclidean norm of the respective visual descriptor values.

Table 8.3 Accuracies on selected error margins (in km) of the visual approach with different descriptors

As can be seen in Table 8.3, the scalable colour descriptor (SCD) consistently outperforms the other descriptors. Although overall textual approaches perform much better than visual approaches, it should be noticed that the best visual model with the scalable colour descriptor achieves an result which is three times more accurate result than the random baseline (12 % at 1,000 km). The nearest neighbour classification can be drastically speeded up by organising the descriptor values in tree structures. A k-d tree has the advantage that the subsequent search for nearest neighbours is speeded up because the search is recursively among the branches of the tree. Finding the nearest neighbour is a \(O(\mathrm{log} N)\) operation instead of \(O(N)\) in a naive implementation. Here, the problem is the generation a proper ‘model’ for each area. The simple but fast method of averaging is quite successful, but more sophisticated techniques such as classification methods like support vector machines can be applied, which may enhance the robustness of the model. Applying such a method, a decision \(a_{i,j}\) and a probability score \(w_{i,j}\) is generated for each geographic areas (i.e., cluster or grid cell) \(j\) and each photo or key frame \(i\). Since videos consists of multiple images, these single frames can be classified into different areas \(a_j\), so a subsequent step is necessary to determine the final location decision for the whole sequence. Here, voting leads to a single decision for the complete video sequence. Two voting methods have been previously used are consensus voting and weighted voting. Weighted voting is based on the idea that not all decisions on key frame level are equally accurate. The decision of the predicted area is weighted with the probability score \(w_{i,j}\), such that more accurate decisions on key frames give more importance to the final video result. This is in contrast to consensus voting, which assumes that each key frame’s decision carries equal weight, i.e., the scores \(w_{i,j}\) are effectively ignored. The decision for a spatial area of a video sequence, when using weighted voting, is:

$$\begin{aligned} a = \arg \max _{j} \sum _{i} w_{i,j} \cdot a_{i,j}, \end{aligned}$$
(8.15)

where the decision \(a_{i,j}\) is set to \(1\), if key frame \(i\) is classified to be in the \(j\)th spatial area, otherwise it is \(0\). The decision rule for consensus voting is obtained by replacing the weights \(w_{i,j}\) by \(1\) in the previous formula.

8.4.2 A Retrieval Approach to Using Visual Information

The retrieval approach is closely related to the field of content-based image retrieval (CBIR). In CBIR visually similar media items are returned for a given query image. These visually similar media items are taken from the training data and are therefore geo-tagged, hence their location can be propagated to the query image or video. We now explain the basic technique for geo-referencing media using their visual content in more detail.

The starting point is an image depicting a outdoor scene for which the geo-coordinates are not known. A CBIR system retrieves images that capture the gist of the query image and returns geo-referenced images of similarly looking outdoor scenes. Those can be used to predict the location of the query image, based on the assumption that similar scenes in images (or key frames from video sequences) correlate to similar geographic areas. In its simplest form, this approach looks for the nearest neighbour of a query item by comparing relatively low-dimensional feature vectors, which is faster than performing a sophisticated classification or object recognition algorithm that has to be computed over many object models. Estimating location using this nearest neighbour method requires a large dataset that covers the entire world. This retrieval approach also has some restrictions; the required database should not only be large, but should also densely cover the world, which is not often the case. As a result, in practice there is a bias towards North American and Europe, and popular tourist destinations due to the large number of photos taken in these areas. However, the greatest limitation is due to visual ambiguity. Images depicting coastal scenes but captured at different places can look very similar. This restricts the ability of approaches based on visual similarity to geo-reference locations that look very different. Georeferencing based on visual features often does not lead to accurate locations, but it can constrain the search space, and may thus be used effectively together with other methods (e.g., based on textual features or context information such as the home location of the owner). However, using visual information is not recommended as a stand-alone approach.

8.5 Centroid-Based Candidate Fusion

We are interested in addressing cases where the training dataset is sparse (e.g., to apply the methods discussed in this chapter to regions of the world where the uptake of Flickr is limited). To this end, we explored the possibility of using unlabelled instances, i.e., photos or videos whose coordinates are unknown. In particular, we used the videos in the test data for this purpose. The Placing Task dataset benefits from a high density of photographs at popular locations and well known places (see Fig. 8.1) across the Earth. Sections 8.3 and 8.4 shows that the retrieval approaches perform very poorly when asked to locate images from undersampled regions. In [8] and [14] it has been shown that the problem of a sparse dataset will not be solved by increasing the size of the dataset. A greater number of training images does not lead to a uniform distribution of the world but mainly to a better description of popular places. Choi et al. [3] proposed a graphical model framework in which geo-tagging is interpreted as a graph inference problem, and showed that performance improvements can be achieved by smart processing of the test data set. The node potentials in this graph-based framework are modelled as a product of the term location distributions (see Sect. 8.3.1.1), given each tag individually.

Next, we describe a centroid-based candidate fusion method to solve the problem of data sparsity and to enhance the distributions of single candidates in a multimodal manner. The framework also facilitates the fusion of textual and visual features that can further improve the localization performance. One of the biggest problems in the fusion of multimodal features is the different range of features from each of the domains.

This centroid-based candidate fusion approach is based on the sum rule in decision fusion [18]. Since our textual location model produces logarithmic confidence scores Subsequently, these scores are used to generate normalized weights for the candidate fusion:

$$\begin{aligned} w_n= \frac{P(l_n|d)}{\sum _{i=1}^{N}P(l_i|d)}, \end{aligned}$$
(8.16)

where \(w_n\) is the weight of the \(n\)th candidate of a test video \(d\). The equation implies that the sum of weights is \(1\). Then, the candidate locations GPS(\(\cdot \)) are weighted combined using the sum rule, assuming that the most likely location has the highest weight. The location centroid \(\mathbf {x}^{v|t}\) arising from all visual(v) or textual(t) candidates is calculated as follows:

$$\begin{aligned} \mathbf {x}^{v|t}=\sum _{n=1}^N w_n \cdot \mathrm{GPS}(l_n). \end{aligned}$$
(8.17)

This forms the most likely location for a given video, whereas the centroid does not need to be an existing item within the training data. Then, we determine the location for a specific feature. Since we want to include several different features from different modalities in our framework, a confidence score for each calculated centroid is needed for the subsequent multimodal fusion. Here, we choose the standard deviation as inversely proportional weights. The more a feature correlates with a specific spatial location, the closer the likely candidates are located, which implies a small deviation and therefore a high-valued weight. The calculation of the spatial deviation is shown in the following equation:

$$\begin{aligned} \sigma ^{v|t}=\sqrt{\sum _{n=1}^N\left( \mathrm{GPS}\left( l_n \right) - \mathrm{GPS}\left( \mathbf {x}^{v|t} \right) \right) ^2 \cdot w_n} \end{aligned}$$
(8.18)

Using these formulas, a centroid is determined for each feature. The final decision for the video location \(X\) is specified as shown in Fig. 8.8 using the following multimodal fusion:

$$\begin{aligned} \mathbf {X}= w^t \cdot \mathbf {X}^t + w^v \cdot \mathbf {X}^v, \end{aligned}$$
(8.19)
Fig. 8.8
figure 8

Illustration of fusion of textual and visual candidates

where \(w^{v|t}\) are calculated according to (8.16), but with \(\frac{1}{\sigma }\) instead of \(P(l_i|d)\). Figure 8.9 shows the confidence scores of both modalities for an example videoFootnote 4 depicting a formula one scene captured in Montreal, Canada. The confidence score is coded in colours as follows; very unlikely areas are depicted in black, where the colour gets lighter with the increasing likelihood of the corresponding areas. As seen, areas around Montreal are more likely than most other areas. The scores of the visual approach using scalable colour as feature as depicted in Fig. 8.9a. Here there are many likely regions in the world: based on the visual information alone, this video

Fig. 8.9
figure 9

Confidence scores (in log scale) of the textual (a) and visual approach (b) for North America

Fig. 8.10
figure 10

Confidence scores of the visual approach (SCD) restricted to be in the most likely spatial segment determined by the textual approach (tf-idf)

Fig. 8.11
figure 11

Accuracy achieved by single methods within the hierarchy for selected margins of error

sequence may have been recorded at many locations in the world, hence we need textual metadata to reduce the number of possible candidates.

Figure 8.10 shows such a restriction. The tf-idf text model is based on the hierarchical spatial segmentation method explained in Sect. 8.2.1; it predicts the most likely segment at the highest hierarchy levels and the visual SCD model predicts locations within these segments. As shown, the previous example is correctly assigned to the city of Montreal, Canada. Here, the fusion of textual and visual methods is important to eliminate geographical ambiguities. The candidates of both modalities are combined using our centroid-based fusion as described in Sect. 8.5. The videos of the Placing Task dataset are well tagged, the textual model produced strong candidates and the combination with the visual candidates effect a location gain in small scale. In general, the fusion of several candidates of both modalities is important to eliminate geographical ambiguities. As depicted in Fig. 8.11, the centroid-based fusion improves the results, especially on smaller margins of error, overcoming the sparse nature of the dataset. The strongest gain is achieved by the gazetteer-based national border detection, which eliminates the geographical ambiguity in conjunction with the probabilistic models.

8.6 Conclusion

In this chapter, we discussed a variety of techniques for estimating the geographical location of a Flickr photo or video. We first discussed several methods for clustering the world into a finite number of disjoint geographic areas, as well as the idea of using national borders for this purpose. We also noted that such a spatial segmentation step can be done in a hierarchical fashion, which can have computational advantages (e.g., computationally more expensive methods may be feasible once the range of possible locations has been narrowed down to a particular country, or even to a particular city). Once a set of disjoint geographical areas has been identified, we can treat the problem of georeferencing Flickr photos or videos as a text classification problem, where the geographic areas are the classes and each photo is a document (with its tags as terms). Alternatively, georeferencing can be treated as a retrieval problem. Experimental evidence suggests that optimal performance is achieved by combining both methods, i.e., use a text classification method to find the most likely geographic area, and then use a retrieval method to find the most likely location within that area. Subsequently we discussed the value of visual features. We argued that such features can be very valuable to refine the locations predicted based on textual features, although visual features are not sufficiently powerful to come up with likely locations by themselves (except in very particular cases, such as when the photo contains an easily recognizable landmark). Finally, we discussed a method for centroid-based candidate fusion, which ameliorates the problem of data sparsity and enhances the distributions of single candidates using information from multiple modalities.