On the Role of Geometry in Geo-Localization

Humans can build a mental map of a geographical area to find their way and recognize places. The basic task we consider is geo-localization - finding the pose (position&orientation) of a camera in a large 3D scene from a single image. We aim to experimentally explore the role of geometry in geo-localization in a convolutional neural network (CNN) solution. We do so by ignoring the often available texture of the scene. We therefore deliberately avoid using texture or rich geometric details and use images projected from a simple 3D model of a city, which we term lean images. Lean images contain solely information that relates to the geometry of the area viewed (edges, faces, or relative depth). We find that the network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions of this paper are: (i) providing insight into the role of geometry in the CNN learning process; and (ii) demonstrating the power of CNNs for recovering camera pose using lean images.


Introduction
Recently, several works in the field focused on trying to understand how neural networks represent data and tackle their limits [33]. Our paper's main goal is to study the role of geometry in a CNN solution to the geo-localization task rather than propose a working system for application purposes.
What is the geo-localization task? Imagine you are brought blindfolded to a street corner of a city you know well. Now, you remove the blindfold. Can you tell where you are? In the computer vision field, this amounts to estimating the position (and sometimes the orientation) of a camera given its current view. Although localization devices such as Global Positioning Systems (GPS) have improved significantly over the last years, they often do not work well in city scenes and do not provide highly accurate results. Autonomous cars, drones, and IOT devices are expected to benefit tremendously from the ability to determine their pose (position & orientation) accurately in their environment.
A solution for geo-localization, either by a human or by a machine, can use appearance cues (e.g., texture of a unique building), geometric cues (e.g., a unique shape of a building), or both. In the '80s and '90s, many computer vision tasks were based on edge images (i.e., mostly geometry). More recently, significant improvements were obtained in object recognition, scene recognition, and localization tasks, largely by exploiting the appearance of the scene (e.g., color and texture and image features such as SIFT [26,15,14]). Later, these methods were improved by adding coarse geometric constraints to the image features (e.g., [20,2]). Nowadays, methods are based on machine learning, in particular convolutional neural networks (CNNs), where the input is the unprocessed image. Clearly, both appearance and geometry play an important role in these methods.
We aim to explore the role of geometry alone in geolocalization using end-to-end deep learning neural network, while ignoring the often available texture of the scene. To do this, we consider the geo-localization task using lean images. Lean images contain mostly information that relates to the geometry of the scene while lacking texture or rich geometric details. In particular, we use a city scene and consider two types of binary images that consist of the edges of the buildings' outline and the buildings' facades. In addition, we also consider depth images that contain more geometric information.  Examples of the three types of lean images are shown in Figure 1 (top). Note that in the first row, the view contains dominant landmarks, while the second row shows very little distinct information that might be expected to assist localization. Such non-distinct views are very common in large environments such as a city, making localization with lean images very challenging. Further note that we deliberately do not consider real images or synthetic images with texture, since our goal is to study only the information available from purely geometry information.
We use an untextured 3D mesh model of Berlin [3] to generate our data. A bird's-eye view of one of the areas is shown in Figure 2. Using such a model allows us to study the role of geometry for geo-localization in a controlled manner and in a larger scale than ever before, both in terms of the area covered (many city streets) and in terms of the number of images (up to hundreds of thousands). Our images are obtained simply by projecting the model onto various positions in the scene. Each image is defined by four parameters: (x, y) represents the camera position on the ground plane and (θ, φ) represents the Y aw and P itch angles of the camera respectively. We assume for simplicity that the picture is taken at a fixed height, and the roll angle is fixed as horizontal.
A typical geo-localization solution will return either the pose from which an image was taken or the most similar image from a database. We consider two variants of the geolocalization tasks. The first task is recognizing a previously seen view of the scene, which we refer to as Geo-Matching. The second is determining the camera pose in a previously unseen view, which we refer to as Geo-Interpolation.
The geo-matching task can be regarded as image retrieval from a database of all available views of the city. A naïve solution would store all images and then perform a brute-force search in the database. However, this is inefficient and can become infeasible as the database gets larger. Defining a compact representation and an efficient search is clearly desired, and it is often performed by manually engineered image representation (e.g., a dictionary of image features) and an image retrieval approach, including the metric between the stored representation and a target one (e.g., [27,15,21,34,18,25,7,14]). Neural networks were shown to be effective in geo-localization tasks (e.g., [12,31,17]). They may be used to perform both functions: provide storage and learn the metric. The questions we address here are (i) whether CNNs can also be used to solve the geo-matching task from lean images and (ii) whether geometric information is used by the CNN to solve the task. In the geo-interpolation task the image query is not part of the training set. In this case we ask (iii) can the CNN generalize and support geo-interpolation in such large environments using only geometric and spatial data?
As discussed in the results section (Sec. 6), we found positive answers to all these questions, but the results depend on the number of images and their sampling density. We believe this indicates that networks can learn some sort of a spatial map for an area using only geometric data, since no colors or textures are available in our data. The success of geo-localization also depends on the specific position. Figure 1 (bottom) shows how certain positions in the streets of a city are more recognizable than others. The paper presents an empirical study regarding the information that can be used by CNNs; we do not propose a practical solution based on lean images. The main contributions of our study are: (i) proposing a systematic method to study the role of geometry in CNNs when trained to solve geo-localization tasks; (ii) demonstrating the power of CNNs to use the geometric information efficiently; and (iii) showing that lean images contain sufficient information for solving the geo-matching and geo-interpolation tasks.

Related Work
Place recognition (e.g., recognizing the Eiffel Tower in an image) can be regarded as a coarse geo-localization task. Finding images of the same place is a basic tool for solving this task. Classic approaches use visual features to represent each image in a set of images of a given location (e.g., by a bag of words) and then match a target image with the stored representations (e.g., [26,27,15,21]). Hays & Efros [7] were the first to address the place recognition task using millions of geo-tagged images. In their study they compare the results obtained when various visual features are used (tiny images, color histograms, texton histograms, line features, gist descriptors with color and geometric context).
In our study, we consider the geo-localization task, where both position and orientation of a camera with respect to a scene should be estimated. A possible solution can be obtained using triangulation with images with known pose (e.g., [34]). In most studies, 3D models of the scene are used by means of point-clouds (e.g., [9,23,16,29]), Digital Elevation Maps (DEM) (e.g., [1,2]), or full 3D models (e.g., [20]). One of the main challenges of these works is to develop an efficient computation of 2D to 3D feature matching. The matching can then be used to determine the query image pose with respect to the model. Computing the matching requires dealing with a very large search space, and outliers must also be discarded. Works that deal with these challenges include classic studies of outliers removal (e.g., [5,6,29,13,24]).
New 3D feature representation have also been developed (e.g., [9,23]). Bansal & Daniilidis [2] introduce a feature more closely related to the lean images we consider. It consists of 3D corners and direction vectors extracted from a Digital Elevation Map (DEM) to be matched geometrically to the corners and roof-line edges of buildings visible in a street-level query image.
Efficiency and robustness become even more important when dealing with a city-scale 3D model. A fast method for inliers detection that enables solving the correspondence problem on such a scale was suggested by Svärm et al. [28]. Recent survey on existing localization methods can be found in [19].
One of the key ideas that bypasses the challenge of defining an efficient and robust 2D-3D feature matching required by the abovementioned methods is to use an end-to-end CNN solution that performs both feature extraction and matching. PoseNet [12] is an impressive CNN based approach for solving the pose of real images. A dataset of images was used for training Google LeNet [30] where the 6-DoF pose of the camera was used as ground truth. The softmax final layer of Google LeNet, which was used for an object classification task, was replaced by a vector for a regression task. The Google LeNet was pre-trained on the Im-ageNet dataset for the object recognition challenge [4,22]. Walch et al. [31] suggested an improvement to the PoseNet CNN architecture by adding an LSTM, which reduces the dimensionality of the feature vector. Melekhov et al. [17] used ResNet34 [8], which uses encoder-decoder structure to improve model accuracy. Kendall & Cipolla [10] improved their earlier work [12] by applying an uncertainty framework to the CNN pose regressor. In another work, Kendall & Cipolla [11] studied the affect of various loss functions on the result of PoseNet.
In our study we assume a 3D model of a city is given. Our setup is very challenging since the model and the images consists of only coarse 3D structure of the scene without texture for computing image features. On the other hand, our images are noise-free and there are no object-level occlusions such as trees, cars and people. Our method uses a CNN in a similar manner to PoseNet. However, we use the ResNet50 architecture, also modified for regression, which produced better results for our task. We trained our network from scratch since we use lean images, which are projections of an untextured 3D model, i.e. using pre-training done on texture images is irrelevant. In addition, we were not limited by data size, as we projected as many images as we chose.
Most importantly, our goal differs from that of the aforementioned methods: whereas they focus on obtaining a better and faster solution for geo-localization, we focus on trying to understand the role of geometry, alone, in geolocalization, by systematically training and testing the same neural network on controlled datasets.

Data
"Your network is as good as your data" is a common phrase in the world of neural networks. Our case is no different. In this respect, using a 3D model as the data source is highly advantageous: we can sample as many images as necessary from the 3D model in any position, orientation and resolution.
All images used in our study are projections of a 3D model of Berlin [3] without textures. This model is very simple, it contains only the geometry of building walls and rooftops, and does not contain any fine geometric details such as window frames or doors (see bird-eyes view in Figure 2). We consider three types of images: edge, face, and depth map, see Figure 1 (top), and we call them lean images since they contain no texture or structural details. Face images are actually the buildings' facades.
We generated several image datasets that are sampled uniformly along a 4D grid, where each image is defined by its camera pose. That is, (x, y, θ, φ), where (x, y) is the position on the ground plane and (θ, φ) is the camera orientation. The density in the (x, y) domain varies between the datasets but fixed in the (θ, φ) domain. Each set of images is created in a defined area of the city. The number of images in the set is determined by the size of the area and the grid sampling density. The three types of lean images were generated for each sample pose.
When dealing with lean images, care must be taken not to include empty images. For example when the camera is facing a building wall from a short distance. Such images contain almost no visual information and do not contribute to the learning process. We define an invalid image as an image that has less than 8 edges, or an image that does not contain a skyline (at least 50% of its top-most pixel row is sky). Moreover, images associated with a pose inside a building are irrelevant to geo-localization, and are also defined as invalid. We remove invalid images from both the training and the test sets (see Figure 3).
Although the representation of Euler angles using (θ, φ) is natural and easier to comprehend, it suffers from ambiguity and Gimbal lock. Therefore, in practice we use quaternions which offer stability, efficiency and compactness (see [12]). Each image sampled from the 3D model was defined by a 6D pose vector in the form of (x, y, q 1 , q 2 , q 3 , q 4 ), which in fact represents 4 degrees of freedom.
Our geo-matching task could have been defined as a classification task where each (x, y, θ, φ) is considered as a class. However, this would involve learning a huge number of classes (∼ 170K). In addition, a classification setup loses the spatial relations among the images because each class is considered completely unrelated to others. This prevents the network from exploiting the geometric structure and information, and can preclude an answer to one of our main research questions: Can a network use geometric information?
Thus, more suitable for our problem is to consider a CNN for solving a regression task. This also allows to use the same trained CNN for the geo-interpolation task, by directly returning the pose of unseen images in the test set. Otherwise, if a classification CNN was used, it would have required a post-processing of the result, e.g. by classic methods such as averaging the K-nearest classification matches. Because the considered CNNs were designed for classification tasks, we follow [12] and modify them to solve a regression task by simply removing the last softmax layer and replacing it by a fully connected layer of our result vector (x, y, q 1 , q 2 , q 3 , q 4 ). Although position and orientation are considered as different tasks [12] which should have some weighting factor during the learning process, we noticed that normalizing (x, y) with respect to the total area size eliminates the need for such weighting. Our loss function is 2 for the position (x, y) and 2 for the orientation (q 1 , q 2 , q 3 , q 4 ).
In a set of preliminary experiments we found that ResNet50 had the combination of smallest network size in terms of parameters and the best training and testing results. Therefore, we report our experiments using only the ResNet50 architecture. We decided not to use transfer learning using pre-trained weights, since the networks we tested were trained with ImageNet, which contains real photographs. Our assumption is that the pre-trained models are tuned for texture information that is not available in lean images. Hence, in our experiments, we trained the CNNs from scratch (note that we did use transfer learning within our setup; see Section 6.4).

Tasks & Hypothesis
We considered two localization tasks: retrieving the camera pose of an image from the training set (geomatching), and recovering the camera pose of an image not in the training set (geo-interpolation).
Our goal was to answer the following questions: (i) Does geometry play a role when training the CNN for localization? (ii) Can a CNN be trained to solve these localization tasks from lean images?

Geo-Matching Task
Given an image from the training set, we tested whether the correct camera pose could be determined. In a sense, the network is trained to overfit. However, this would mean that the network managed to encode the entire set of images in some feature space as well as compute an efficient matching function between the features of the images to find the right pose.
(A) Geometrically Correlated: We examined whether a CNN can solve the geo-matching task using lean images. In this test, the camera pose for generating the image was used as ground truth for training. Hence, the pose of nearby images is correlated and the network has access to this geometric information.
(B) Geometrically Decorrelated: An alternative interpretation of the geo-matching task is that the network solves a simple indexing task, where the image's pose serves as a 4D label. Under this interpretation, the CNN does not use the available geometric information. Hence, an arbitrary labels of images would work just as well as in task A. To test this, we used arbitrary poses as the image ground truth for task B. We randomly shuffled the pose information between images so that poses were not spatially correlated with respect to the images. If no geometry is used by the CNN, the results on this training data are expected to be similar to those obtained with the real pose as a ground truth.
Evaluation: Since our network is a regression network, the computed pose does not necessarily match exactly a pose of an image from the training set (see Figure 4a). We used the nearest neighbor (nn) grid sample to the computed pose as the pose retrieval. We report the percentage of images whose correct pose is the nearest neighbor (1nn) and also report the percentage of images whose correct pose is among the three nearest neighbors (3nn) of the computed pose. These evaluations were used for both (A) and (B) geo-localization tasks. An additional advantage of using this measure is that it is given in grid steps and not in meter/angle units, circumventing the difficulty of comparing distances and angles and enabling a comparison of results with different grid densities (we do provide numerical 2 errors in Table 2).

Geo-Interpolation Task (C)
We tested whether the network can estimate the pose of an image that is not in the training set. To avoid over-fitting and allow generalization, the network was trained until best result was achieved on a validation set. We do not expect the network to return a correct position that is outside the learned area. Thus, this task is viewed as an interpolation task from known samples on the grid to unknown positions. For this reason our test set is comprised of images sampled at midpoints of the training grid. These are images that are farthest from the training set samples.

Evaluation:
We considered the hyper-cube of the computed pose. A computed pose is considered correct if it lies within the same grid hyper-cube as the test sample. We report the number of correctly computed poses (D < 1). In addition, we considered the Manhattan distance between the hyper-cubes of the computed pose and the test sample (see Figure 4b). We report the number of images for which this distance is smaller than 3 (D < 3). Note that these measurements are invariant to the sampling step size. Thus, we are able to compare results of experiments that were sampled with different step sizes. For completness, we also provide the standard 2 errors in Table 2.

Experiments & Results
We tested and evaluated the ResNet50 network for the three tasks described in Section 5. The datasets, which are described in Section 3, are defined by the following parameters: 1. Area of interest (AOI): (x, y, width, height). 2. Grid-step, δ: the distance between adjacent (x, y) position of the sampling grid. That is, adjacent to (x, y, θ, φ) are (x ± δ, y, θ, φ) and (x, y ± δ, θ, φ). The grid density in (θ, φ) domain was fixed.
3. Input type: edges, faces, depth, edges + faces, edges + faces + depth. For the last two input types the images were fed to the network by stacking them channelwise.
4. Validation set created by randomly choosing 10% of the training samples.
5. Test set created by images that were sampled at midpoints of the training grid.
We used various step sizes for the camera position on the grid: δ = 10, 20, 40 in model units (1 unit ∼ 1 meter). θ (yaw) was sampled at 5 • steps between 0 • and 360 • , and φ (pitch) was sampled at 3 • steps between 0 • and 15 • . The height was set to a fixed value of z 1.7 above ground (human height) and no roll was used.
We report the main results in Table 1, for tasks (A)-(C). Each block of three rows consists of a different dataset, defined by the area size and δ. For each block we considered the different types of lean images and evaluated them on the three tasks as described in Section 5. Each entry is an average of three different AOIs. For completeness, Table 2 shows an example of the mean and median 2 errors of the pose estimation for edges+faces experiment. Similar results were obtained in other experiments. Table 4 and Table 3 show the results of testing the limitations of the CNN with respect to the sparsity of the grid (δ = 40) and the size of the datasets (> 630K images). We next discuss the obtained results.

Geo-Matching
Very poor results were obtained for the geo-matching task when arbitrary poses were used as ground truth (Table 1-Task (B)), i.e. when no geometric correlation between  Table 2: Examples of the 2 errors for an experiment with Edges+Faces image types in each sub-space: spatial (x, y) errors in (approx.) meters, and orientation (θ, φ)) errors in degrees. Similar to this example, usually the errors show a long-tail distribution: many images have small errors and a few have very large errors. the images and their ground truth was available. The highest percentage of correct matches (45%) was obtained for the smallest set of considered images (37K images). For the largest set (170K images), the percentage of correct matches was less than 10%. As can be seen, the quality of the results decreases as the number training samples increases. This is expected because for a fixed number of network parameters, memorization becomes impossible when more and more training samples are added. Note that we do not report on the 3nn measure, since it is meaningless for this randomized pose task.
In contrast, when the correct poses were used as ground truth (Table 1-Task (A)), the CNN succeeded in 1nn localization of more than 92% of the training samples in all cases. These results show that a CNN with around 8.5 million parameters is able to exploit the geometric structure of the scene and match an image with ∼ 170K images with accuracy of ∼ 90%. That is, an average of 42.5 parameters are used per image for images of size 160 × 120 = 19200 pixels. Our interpretation is that using a CNN makes it possible to avoid the direct storage of the images (or its edges) and their labels. Given the trained network, the matching is much faster 1 than with any traditional search algorithm on such a large dataset of images.
We believe the significant differences between the two geo-matching tasks (A) and (B) is due to the network exploiting the geometric correlations when learning a metric between images.
We also considered much larger datasets with more than 600K images ( Table 3). The percentage of correct matches dropped to 82% for a dense grid, δ = 10, and to 56% for a sparser grid, δ = 20. For δ = 20 and > 600K images, the network capacity is probably saturated. A comparison of these results to those reported in Table 1 Table 3: Testing network learning capacity. These results are from a single experiment where the image input type is only edges. The network ability to learn drops when the number of images grows beyond a certain point.
same δ values, indicates that both the number of images and the grid size determine how successfully the CNN models the data. In addition, we tested datasets with sparser sampling in the position domain (Table 4 top 2 blocks), and in both the position and the orientation domains (Table 4 bottom 2 blocks). For sparse sampling only in the position domain, the percentage of correct matches is reduced marginally. However, when reducing the sampling also in the orientation domain, the percentage of correct matches is dramatically dropped. This indicates that it is easier for the CNN to model a denser grid (probably because of higher geometric correlation between images), and it is easier to model fewer images (probably because of network capacity).

Geo-Interpolation
Here we tested whether the pose estimation by the CNN generalizes to unseen images. We used the same training as in geo-matching with the correct pose as a ground truth, and we tested it on images sampled from the mid point of each grid cell. We report our results with respect to the 2D position as well as with respect to the 4D parameters of a pose ( Table 1).
The network was able to generalize image position with good accuracy where ∼ 70% of images are correctly positioned in their grid cell, and above 80% of the computed poses are within three cells of the correct one. As expected, this task achieves better results on a tighter grid (δ = 10, ∼ 88%) than on a sparse grid (δ = 20, ∼ 70%). The 4D position error is lower bounded by the 2D position error, and hence is greater. Moreover, the sampling rate in the orientation domain is much higher that in the location. Hence a small error in orientation estimation has a greater effect on the 4D errors. Still, the accuracy in 4D for δ = 10 is ∼ 87%.
We further tested the effect of the grid density. It is clear from Table 1-Task(C) that for δ = 10 the results are better than for δ = 20, even if the number of images is larger. We further explore this observation for a sparser grid, δ = 40, where the percentage of correct estimations dropped significantly below 50% and 30% for 61K and 174K images, respectively (Table 4-Task(C)). For δ = 10 for 636K images, 80% of the estimations were correct (Table 3-Task(C). For this task, sparser sampling is more critical than the geomatching task as can be seen in Table 4. For very sparse sampling of the 4D space the network cannot really generalize to positions not seen before. Here again we believe that not only the number of images play a role but also their density. The denser the grid, the higher the correlation between images, and hence better generalization can be obtained.
A nice application of our results is the ability to rate the distinctiveness of positions in the city. In Figure 1 Table 4: Low grid density results. Datasets (single experiment each) with sparser spatial sampling (top two blocks), and sparser spatial and orientation sampling (bottom two blocks) where the pitch angle is ∈ [6,12], and yaw ∈ {45i} 7 i=0 . The sparser the data, the worse the results. Geo-interpolation could not succeed in very sparse and very small datasets.
we illustrate how certain places can be easily recognized (high geo-interploation success rate) while other are more difficult. Note, for instance, how open spaces are more distinct than narrow streets.

Effect of Data Type
We compared the results on several types of lean images separately, and their combination. Faces alone provides the least geometric information, and indeed in most cases was inferior to edges or depth results. Surprisingly, edges alone provides better information than depth alone.
When combining edges with faces or with faces+depth, we expect the results of all tasks to be improved with respect to the results obtained when edges alone are used. For the geo-matching task (A) with δ ≤ 20 (Table 1), similar results were obtained for all data types. However, for a very sparse grid (Table 4 2-upper blocks), richer geometric information improves the results. We believe it is because the results on δ ≤ 20 were very high to begin with with only edges.
For the geo-interpolation task (C), adding the faces information significantly improved the results, as expected. Surprisingly, the depth information did not show any significant performance gain when δ ≤ 20. This may indicate that edges+faces provide sufficient information for these cases. However, for a very sparse grid, δ = 40, with a relatively small number of images, adding the depth significantly improves the results (Table 4, 174K images).
For the data with geometrically decorrelated pose (Task B) and for the very sparse sampling (Table 4 bottom 2 blocks), the more information we add, the worse results were. The reason for this is still unclear to us. A possible explanation is that as the problem becomes more of a memorization task, the increase of information makes it harder for the CNN to find discriminant features.

Transfer Learning
Once we had a trained a CNN for some AOI, we applied transfer learning to a new AOI by using the learned weights as initialization values for the new area. As can be seen in Figure 5, doing so improved our learning rate. This indicates that the network managed to learn features of lean images that assist in other, similar experiments, and it does not depend only on memorization of the area for learning.

Conclusions
In this work we showed that (i) CNN can achieve good results in geo-localization tasks using only lean images taken from a very simple 3D model, and (ii) that geometry plays an important role in geo-localization, by achieving good results while ignoring texture and scene details. The results indicates that noise-free lean images are sufficient for solving the geo-matching task using a CNN, and that Figure 5: Transfer learning: learning from scratch vs. starting with pre-trained weights. These graphs are from one experiment where the input type was Edges+Faces, but similar behavior appeared in other experiments. the use of uncorrelated images makes it nearly impossible. In addition, our results indicate that (iii) geo-interpolation which is a generalization task, can also be solved by CNNs when using lean images.
From a more practical perspective, it is of interest to explore whether geometric information can be used for real life geo-localization tasks, also because 3D models, e.g., the Open Street Map project [32] are readily available.