Abstract
This work investigates the impact of the loss function on the performance of Neural Networks, in the context of a monocular, RGBonly, image localization task. A common technique used when regressing a camera’s pose from an image is to formulate the loss as a linear combination of positional and rotational mean squared error (using tuned hyperparameters as coefficients). In this work we observe that changes to rotation and position mutually affect the captured image, and in order to improve performance, a pose regression network’s loss function should include a term which combines the error of both of these coupled quantities. Based on task specific observations and experimental tuning, we present said loss term, and create a new model by appending this loss term to the loss function of the preexisting pose regression network ‘PoseNet’. We achieve improvements in the localization accuracy of the network for indoor scenes; with reductions of up to 26.7% and 24.0% in the median positional and rotational error respectively, when compared to the default PoseNet.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
In Convolutional Neural Networks (CNNs) and other Neural Network (NN) based architectures, a ‘loss’ function is provided which quantifies the error between the ground truth and the NN’s prediction. This scalar quantity is used during the backpropagation process, essentially ‘informing’ the NN on how to adjust its trainable parameters. Naturally, the design of this loss function greatly affects the training process, yet simple metrics such as mean squared error (MSE) are often used in place of more intuitive, task specific loss functions. In this work, we explore the design and subsequent impact of a NN’s loss function in the context of a monocular, RGBonly, image localization task.
The problem of image localization—that is; extracting the position and rotation (herein referred to collectively as the ‘pose’) of a camera, directly from an image—has been approached using a variety of traditional and deep learning based techniques in the recent years (Fig. 1).
The problem remains exceedingly relevant as it lies at the heart of numerous technologies in Computer Vision (CV) and robotics, e.g.geotagging, augmented reality and robotic navigation.
More colloquially, the problem can be understood as trying to find out where you are, and where you are looking, by considering only the information present in an RGB image.
CNN based approaches to image localization—such as PoseNet [4]—have found success in the recent years due to the availability of large datasets and powerful training hardware, but the performance gap between these systems and the more accurate SIFT featurebased pipelines remains large. For example, the SIFTbased Active Search algorithm [12] remains as a reminder that significant improvements need to be made before CNN techniques can be considered competitive when localizing images.
However, CNNbased approaches do possess number of characteristics which qualify them to handle this task well. Namely, CNNs are robust to changes in illumination and occlusion [9], they can operate in close to real time [7] (\(\sim \)30 frames per second) and can be trained from labelled data (which can easily be gathered via Structure from Motion (SfM) for any arbitrary scene [13, 14]). CNN based systems also tend to excel in textureless environments where SIFT based methods would typically fail [1]. They are also proven to operate well using purely RGB image data—making them an ideal solution for localizing small, cheap, robotic devices such as drones and unmanned ground vehicles. The major concern of this work is to extend existing pipelines whilst ensuring that the benefits provided by CNNs are preserved.
A key observation when considering existing CNN approaches is how position and rotation are treated separately in the loss function. It can be observed that altering a camera’s position or rotation both affect the image produced, and hence the error in the regressed position and the regressed rotation cannot be decoupled—each mutually affects the other. In order to optimize a CNN for regressing a camera’s pose accurately, a loss term should be used which combines both distinct quantities in an intuitive fashion.
This publication thus offers the following key contributions:

1.
The formulation of a loss term which considers the error in both the regressed position and rotation (Sect. 3).

2.
Comparison of a CNN trained with and without this loss term on common RGB image localization datasets (Sect. 5).

3.
An indoor image localization dataset (the Gemini dataset) with over 3000 poselabelled images perscene (Sect. 4.1).
2 Related Work
This work builds chiefly on the PoseNet architecture (a camera pose regression network [4]). PoseNet was one of the first CNNs to regress the 6 degrees of freedom in a camera’s pose. The network is pretrained on object detection datasets in order to maximize the quality of feature extraction, which occurs in the first stage of the network. It only requires a single RGB image as input, unlike other networks [11, 17], and operates in real time.
Notably, PoseNet is able to localize traditionally difficulttolocalize images, specifically those with large textureless areas (where SIFTbased methods fail). PoseNet’s endtoend nature and relatively simple ‘onestep’ training process makes it perfect for the purpose of modification, and in the case of this work, this comes in the form of changing its loss function.
PoseNet has had its loss function augmented in prior works. In [3] it was demonstrated that changing a pose regression network’s loss function is sufficient enough to cause an improvement in performance. The network was similarly ‘upgraded’ in [18] using LSTMs to correlate features at the CNN’s output. Additional improvements to the network were completed in [2], where a Bayesian CNN implementation was used to estimate relocalization accuracy.
More complex CNN approaches do exist [8,9,10]. For example, the pipeline outlined in [5] uses a CNN to regress the relative poses between a set of images which are similar to a query image. These relative pose estimates are coalesced in a fusion algorithm which produces an estimate for the camera pose of the query image.
Depth data has also been incorporated into the inputs of pose regression networks (to improve performance by leveraging multimodal input information). These RGBD input pipelines are commonplace in the image localization literature [1], and typically boast higher localization accuracy at the cost of requiring additional sensors, data and computation.
A variety of nonCNN solutions exist, with one of the more notable solutions being the Active Search algorithm [12], which uses SIFT features to inform a matching process. SIFT descriptors are calculated over the query image and are directly compared to a known 3D model’s SIFT features. SIFT and other nonCNN learned descriptors have been used to achieve high localization accuracy, but these descriptors tend to be susceptible to changes in the environment, and they often necessitate systems with large amounts of memory and computational power (comparatively to CNNs) [4].
The primary focus of this work is quantifying the impact of the loss function when training a pose regression CNN. Hence, we do not draw direct comparisons between the proposed model and significantly different pipelines—such as SIFTbased feature matching algorithms or PoseNet variations with highly modified architectures. Moreover, for the purpose of maximizing the number of available benchmark datasets, we consider pose regressors which handle purely RGB query images. In this way, this work deals specifically with CNN solutions to the monocular, RGBonly image localization task.
3 Formulating the Proposed Loss Term
When trying to accurately regress one’s pose based on visual data alone, the error in the two terms which define pose—position and rotation—obviously needs to be minimized. If these error terms were entirely minimized, the camera would be in the correct location and would be ‘looking’ in the correct direction.
Formally, pose regression networks—such as the default PoseNet—are trained to regress an estimate \(\hat{\vec {p}}\) for a camera’s true pose \(\vec {p}\). They do this by calculating the loss after every training iteration, which is formulated as the MSE between the predicted position \(\hat{\vec {x}}\) and the true position \(\vec {x}\), plus the MSE between the predicted rotation \(\hat{\vec {q}}\) and the true rotation \(\vec {q}\). Note that rotations are encoded as quaternions, since the space of rotations is continuous, and results can be easily normalized to the unit sphere in order to ensure valid rotations. Hyperparameters \(\alpha \) and \(\beta \) control the balance between positional and rotational error, as illustrated in Eq. (1). In practice, RGBonly pose regression networks reach a maximum localization accuracy when minimizing these error terms independently.
Rather than considering position and rotation as two separate quantities, we consider them together as a line in 3D space: the line travels in a direction defined by the rotation, and must travel through the position vector defined by the position \(\vec {x}\). We then introduce a ‘lineofsight’ term which constrains our predictions to lie on this line. The lineofsight term considers the cosine similarity between the direction of the pose \(\vec {p}\) and the direction of the difference vector \(\vec {d}= \vec {x} \hat{\vec {x}}\), as per Eq. (2) and Fig. 2. This term is only zero when the predicted position lies on the line defined by the ground truth pose, hence constraining the pose regression objective further. In the context of image localization, this ensures that the predicted poses lie on the lineofsight defined in the ground truth image.
We modify the default loss function presented in Eq. (1) by adding a weighted contribution of the lineofsight loss term, producing the proposed loss function in Eq. (3). In practice, the value of \(\gamma \) is chosen to roughly reflect the scale of the scene being considered, and is found via a hyperparameter grid search. Note that the lineofsight term can contribute to the loss through multiplication, higher order terms, etc. but it was determined that weighted addition produced the best performing networks.
In short, the final loss function used to train the proposed model (Eq. (3)) is the result of an exploration in the space of possible loss terms, and the term’s design was informed by task specific observations and experimentation.
4 Experiments
Our experiments are naturally centred around testing the performance of the proposed model (defined in Sect. 3). This performance is defined with respect to the following criteria:

Accuracy: the system should be able to regress a camera’s pose with a level of positional and rotational accuracy that is competitive with similar classes of algorithms. Accuracy is reported using perscene and average median positional and rotational error (See Sect. 5.1).

Robustness: the system should be robust to perceptual aliasing, motion blur and other challenges posed by the considered datasets (See Sect. 5.2 and Fig. 8).

Time performance: evaluation should occur in realtime (\(\sim \)30 frames per second), such that the system is suitable in hardware limited realtime applications, or on platforms with RGBonly image sensors, e.g.on mobile phones (See Sect. 5.3).
We compare our proposed model against the default PoseNet and other PoseNet variants.
4.1 Datasets
The following datasets are used to benchmark model performance. Each scene’s recommended train and test split is used throughout the following experiments (Figs. 3, 4, 5 and 6).
7Scenes [15]. 7 indoor locations in a domestic office context. The dataset features large training and testing sets (in the thousands). The camera paths move continuously while gathering images in distinct sequences. Images include motion blur, featureless spaces and specular reflections (see Fig. 8), making this a challenging dataset, and one that has been used prolifically in the image localization literature. The ground truths poses are gathered with KinectFusion, and the RGBD frames each have resolutions of \(640\times 480\) px.
Cambridge Landmarks [2, 4]. 6 outdoor locations in and around Cambridge, The United Kingdom. The larger spatial extent and restricted dataset size make this a challenging dataset to learn to regress pose from—methods akin to the one presented in this work typically only deliver positional accuracy in the scale of metres. However, the dataset does provide a common point of comparison, and also includes large expanses of textureless surfaces. Ground truth poses are generated by a SfM process, so some comparison can be drawn between this dataset and the one created in this work.
University [5]. 5 indoor scenes in a university context. Ground truth poses are gathered using odometry estimates and “manually generated location constraints in a posegraph optimization framework” [5]. The dataset, similarly to 7Scenes, includes challenging frames with high degrees of perceptual aliasing, where multiple frames (with different poses) give rise to similar images [20]. Although the scenes are registered to a common coordinate system in the University dataset and thus a network could be trained on the full dataset, the models created in this work are trained and tested scenewise for the purpose of consistency.
Gemini^{Footnote 1} . 2 indoor scenes in a university lab context. This dataset was created for the purpose of studying the effect of texture and colour on pose regression networks: both scenes survey the same environment, with one scene including decor (posters, screensavers, paintings etc.) and the other deliberately not including visually rich, textured, and colorful decor. As such the two scenes are labelled Decor and Plain. A photogrammetry pipeline (COLMAP [14]) was used to generate the ground truth poses. Images were captured in 15 separate video sequences using a FujiFilm XT20 with a 23 mm prime autofocus lens (in order to ensure a fixed calibration matrix between sequences). Visualizations of the with decor scene are provided in Fig. 7.
4.2 Architecture and Training
As stated, we primarily experiment with the PoseNet architecture (using TensorFlow). For the purpose of brevity we redirect the reader to the original publication [4], as here we only describe crucial elements of the network’s design and operation.
The PoseNet architecture is in itself based on the GoogLeNet architecture [16], a 22 layer deep network which performs classification and detection. PoseNet extracts GoogLeNet’s early feature extracting layers, and replaces the final three softmax classifiers with affine regressors. The network is pretrained using large classification datasets such as Places [21].
Strictly, the default loss function used is not exactly as defined in Eq. (1). Instead, PoseNet uses the predictions from all three affine regressors (hence there are three predictions for each quantity). We label the \(i^{th}\) affine regressor’s hyperparameters and predictions using a subscript i, as per Eq. (4). All three affine regressors’ predictions are used in the loss function, but each have different hyperparameter weightings: \(\alpha _{1}=\alpha _{2}=0.3\), \(\alpha _{3}=1\), \(\beta _{1}=\beta _{2}=150\) and \(\beta _{3}=500\).
In order to demonstrate the consistency and generalization of the proposed network, we train against all scenes in all datasets using the same experimental setup. For each scene we train PoseNet using the default loss (Eq. (4)) and the proposed loss (Eq. (3)) with the contribution from all three affine regressors. Each model is trained perscene over 300, 000 iterations with a batch size of 75 on a Tesla K40c, which takes \(\sim \)10 h to complete.
5 Results
We compare our proposed model to PoseNet and one of its variants—Bayesian PoseNet [18]—in Table 1. This is to show the proposed model’s performance when compared to other variants of PoseNet with modified loss functions. We then provide results specifically comparing the default PoseNet to our proposed model in Table 2. A discussion of our system’s performance regarding the criteria outlined in Sect. 4 follows.
5.1 Accuracy
It is observed that the proposed model outperforms the default version of PoseNet in approximately half the 7Scenes scenes—particularly the Stairs scene. In the Stairs scene, repetitious structures, e.g.staircases, make localization harder, yet the proposed model is robust to such challenges. The network is outperformed in others scenes; namely outdoor datasets with large spatial extents, but in general, performance is improved for the indoor datasets 7Scenes, University and Gemini.
A set of cumulative histograms for six of the evaluated scenes are provided in Table 3, where we compare the distribution of the positional errors and rotational errors. Median values (provided in Tables 1 and 2) are plotted for reference.
The proposed model’s errors are strictly less than the default PoseNet’s throughout the majority of the Chess and Coffee Room distributions. However, the default PoseNet outperforms our proposed model with respect to rotational accuracy in the \(10^\circ \)–\(30^\circ \) range in the Coffee Room scene.
Note the lesser performance observed from the proposed model on the King’s College scene; where the positional errors distributions for the two networks are nearly aligned. Moreover, the default PoseNet more accurately regresses rotation in this outdoor scene. See Sects. 5.2 and 6 for further discussion.
5.2 Robustness
The robustness of our system to challenging test frames—that is, images with motion blur, repeated structures or demonstrating perceptual aliasing [6]—can be determined via the cumulative histograms in Table 3. For the purpose of visualization, some difficult testing images from the 7Scenes dataset are displayed in Fig. 8.
The hardest frames in the test set by definition produce the greatest errors. Consider the positional error for the Meeting scene: our proposed model reaches a value of 1.0 on the yaxis before the default PoseNet does, meaning that the hardest frames in the test set have their position regressed more accurately. This analysis extends to each of the cumulative histograms in Table 3, thus confirming our proposed loss function’s robustness to difficult test scenarios, as the frames of greatest error consistently have less than or comparable errors when compared to the default PoseNet.
Moreover, the proposed model significantly exceeds the default PoseNet’s performance throughout the Gemini dataset. The performance gap in the Plain scene proves that our model is more robust to textureless spaces than the default PoseNet.
5.3 Efficiency
Training Time. The duration of the training stage compared between our implementation and default PoseNet is by design, very similar, and highly competitive when compared to the other systems analyzed in Table 1. This is due to the relatively inexpensive computing cost of introducing a simple lineofsight loss term into the network’s overall loss function. The average training time for default PoseNet and for our augmented PoseNet over the University dataset is 10 : 21 : 31 and 10 : 23 : 33 respectively (HH:MM:SS), where both tests are ran on the same hardware.
Testing Time. The network operation during the test time is naturally not affected by the loss function augmentation. The time performance when testing is similar to that of the default PoseNet and in general is competitive amongst camera localization pipelines (especially feature based matching techniques). We observe a total elapsed time of 16.04 s when evaluating the entire Coffee Room scene testing set, whereas it takes 16.03 s using the default PoseNet. In other words, both systems take \(\sim \)16.8 ms to complete a single inference on our hardware.
Memory Cost. Memory cost in general for CNNs is low—only the weights for the trained layers and the input image need to be loaded into memory. When compared to feature matching techniques, which need to store feature vectors for all instances in the test set, or SIFTbased matching methods with large memory and computational overheads, CNN approaches are in general quite desirable—especially in resource constrained environments. Both the proposed model and the default PoseNet take 8015MiB and 10947MiB to train and test respectively (as reported by nvidiasmi). For interest, the network weights for the proposed model’s TensorFlow implementation total only 200 MB.
6 Discussion and Future Work
Experimental results confirm that the proposed loss term has a positive impact on robustness and accuracy, whilst maintaining speed, memory usage, and robustness (to textureless spaces and so forth).
The network is outperformed by the SIFTbased image localization algorithm ‘Active Search’ [12], indicating that there is still some work required until the gap between SIFTbased algorithms and CNNs is closed (in the context of RGBonly image localization). However, SIFT localization operates on a much longer timescale, and can be highly computationally expensive depending on the dataset and pipeline being used [19].
Ultimately, the loss function described in this work illustrates that intuitive loss terms, designed with respect to a specific task (in this case image localization) can positively impact the performance of deep networks.
Possible avenues for future work include extending this loss function design methodology to other CV tasks, in order to achieve higher performance, or to consider RGBD pipelines. An investigation on the effect that such loss terms have on the convergence rate, and upper performance limit of NNs could also be explored.
7 Conclusion
In summary, the effect of adding a lineofsight loss term to an existing pose regression network is investigated. The performance of the proposed model is compared to other similar models across common image localization benchmarks and the newly introduced Gemini dataset. Improvements to performance in the image localization task are observed, without any drastic increase in evaluation speed or training time. Particularly, the median positional accuracy is—on average—increased for indoor datasets when compared to a version of the model without the suggested loss term.
This work suggests that means squared error between the ground truth and the regressed predictions—although often used as a measure of loss for many Neural Networks—can be improved upon. Specifically, loss functions designed with the network’s task in mind may yield better performing models. For pose regression networks, the distinct and coupled nature of positional and rotational quantities needs to be considered when designing a network’s loss function.
Notes
 1.
This dataset has been made available at https://github.com/anondatasets/gemini.
References
Brachmann, E., Rother, C.: Learning less is more  6D camera localization via 3D surface regression. In: Conference on Computer Vision and Pattern Recognition (CVPR) abs/1711.10228 (2017). http://arxiv.org/abs/1711.10228
Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: International Conference on Robotics and Automation (ICRA) abs/1509.05909 (2015). http://arxiv.org/abs/1509.05909
Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: Conference on Computer Vision and Pattern Recognition (CVPR), April 2017. http://arxiv.org/abs/1704.00390
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for realtime 6DOF camera relocalization. In: International Conference on Computer Vision (ICCV), May 2015. http://arxiv.org/abs/1505.07427
Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network. In: International Conference on Computer Vision (ICCV) (2017)
Li, X., Ylioinas, J., Kannala, J.: Fullframe scene coordinate regression for imagebased localization. Robot.: Sci. Syst. (2018). http://arxiv.org/abs/1802.03237
Massiceti, D., Krull, A., Brachmann, E., Rother, C., Torr, P.H.S.: Random forests versus neural networks  what’s best for camera relocalization? In: International Conference on Robotics and Automation (ICRA) abs/1609.05797 (2016). http://arxiv.org/abs/1609.05797
Melekhov, I., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. In: Advanced Concepts for Intelligent Vision Systems (ACIVS) abs/1702.01381 (2017). http://arxiv.org/abs/1702.01381
Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Imagebased localization using hourglass networks. In: International Conference on Computer Vision Workshops (ICCVW) abs/1703.07971 (2017). http://arxiv.org/abs/1703.07971
Purkait, P., Zhao, C., Zach, C.: SPPNet: deep absolute pose regression with synthetic views. In: British Machine Vision Conference (BMVC) abs/1712.03452 (2017). http://arxiv.org/abs/1712.03452
Radwan, N., Valada, A., Burgard, W.: VLocNet++: deep multitask learning for semantic visual localization and odometry. Robot. Autom. Lett. (RAL) 3 (2018). http://arxiv.org/abs/1804.08366
Sattler, T., Leibe, B., Kobbelt, L.: Efficient and effective prioritized matching for largescale imagebased localization. Trans. Pattern Anal. Mach. Intell. (PAMI) 39(09), 1744–1756 (2017). https://doi.org/10.1109/TPAMI.2016.2611662
Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multiview stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/9783319464879_31
Schönberger, J.L., Frahm, J.: Structurefrommotion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113, June 2016. https://doi.org/10.1109/CVPR.2016.445
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGBD images. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2930–2937 (2013). https://doi.org/10.1109/CVPR.2013.377
Szegedy, C., et al.: Going deeper with convolutions. In: Conference on Computer Vision and Pattern Recognition (CVPR) abs/1409.4842 (2014). http://arxiv.org/abs/1409.4842
Valada, A., Radwan, N., Burgard, W.: Deep auxiliary learning for visual localization and odometry. In: International Conference on Robotics and Automation (ICRA) abs/1803.03642 (2018). http://arxiv.org/abs/1803.03642
Walch, F., Hazirbas, C., LealTaixé, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Imagebased localization using LSTMs for structured feature correlation. In: International Conference on Computer Vision (ICCV), November 2016. http://arxiv.org/abs/1611.07890
Wu, C.: Towards lineartime incremental structure from motion. In: International Conference on 3D Vision (3DV), pp. 127–134, June 2013. https://doi.org/10.1109/3DV.2013.25
Zaval, L., Gureckis, T.M.: The impact of perceptual aliasing on exploration and learning in a dynamic decision making task. In: Proceedings of the Annual Meeting of the Cognitive Science Society (2010)
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 487–495. Curran Associates, Inc. (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ward, I.R., Jalwana, M.A.A.K., Bennamoun, M. (2020). Improving ImageBased Localization with Deep Learning: The Impact of the Loss Function. In: Dabrowski, J., Rahman, A., Paul, M. (eds) Image and Video Technology. PSIVT 2019. Lecture Notes in Computer Science(), vol 11994. Springer, Cham. https://doi.org/10.1007/9783030397708_9
Download citation
DOI: https://doi.org/10.1007/9783030397708_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030397692
Online ISBN: 9783030397708
eBook Packages: Computer ScienceComputer Science (R0)