1 Introduction

In Convolutional Neural Networks (CNNs) and other Neural Network (NN) based architectures, a ‘loss’ function is provided which quantifies the error between the ground truth and the NN’s prediction. This scalar quantity is used during the backpropagation process, essentially ‘informing’ the NN on how to adjust its trainable parameters. Naturally, the design of this loss function greatly affects the training process, yet simple metrics such as mean squared error (MSE) are often used in place of more intuitive, task specific loss functions. In this work, we explore the design and subsequent impact of a NN’s loss function in the context of a monocular, RGB-only, image localization task.

The problem of image localization—that is; extracting the position and rotation (herein referred to collectively as the ‘pose’) of a camera, directly from an image—has been approached using a variety of traditional and deep learning based techniques in the recent years (Fig. 1).

Fig. 1.
figure 1

A sample of the predicted pose positions (purple) generated for the ground truth poses (orange) in the 7Scenes Heads scene using our proposed model. The scene’s origin (white) and SfM reconstruction is rendered for reference. Image best viewed in color. The Heads scene has been rendered in blue to contrast with the plotted data points. (Color figure online)

The problem remains exceedingly relevant as it lies at the heart of numerous technologies in Computer Vision (CV) and robotics, e.g.geo-tagging, augmented reality and robotic navigation.

More colloquially, the problem can be understood as trying to find out where you are, and where you are looking, by considering only the information present in an RGB image.

CNN based approaches to image localization—such as PoseNet [4]—have found success in the recent years due to the availability of large datasets and powerful training hardware, but the performance gap between these systems and the more accurate SIFT feature-based pipelines remains large. For example, the SIFT-based Active Search algorithm [12] remains as a reminder that significant improvements need to be made before CNN techniques can be considered competitive when localizing images.

However, CNN-based approaches do possess number of characteristics which qualify them to handle this task well. Namely, CNNs are robust to changes in illumination and occlusion [9], they can operate in close to real time [7] (\(\sim \)30 frames per second) and can be trained from labelled data (which can easily be gathered via Structure from Motion (SfM) for any arbitrary scene [13, 14]). CNN based systems also tend to excel in textureless environments where SIFT based methods would typically fail [1]. They are also proven to operate well using purely RGB image data—making them an ideal solution for localizing small, cheap, robotic devices such as drones and unmanned ground vehicles. The major concern of this work is to extend existing pipelines whilst ensuring that the benefits provided by CNNs are preserved.

A key observation when considering existing CNN approaches is how position and rotation are treated separately in the loss function. It can be observed that altering a camera’s position or rotation both affect the image produced, and hence the error in the regressed position and the regressed rotation cannot be decoupled—each mutually affects the other. In order to optimize a CNN for regressing a camera’s pose accurately, a loss term should be used which combines both distinct quantities in an intuitive fashion.

This publication thus offers the following key contributions:

  1. 1.

    The formulation of a loss term which considers the error in both the regressed position and rotation (Sect. 3).

  2. 2.

    Comparison of a CNN trained with and without this loss term on common RGB image localization datasets (Sect. 5).

  3. 3.

    An indoor image localization dataset (the Gemini dataset) with over 3000 pose-labelled images per-scene (Sect. 4.1).

2 Related Work

This work builds chiefly on the PoseNet architecture (a camera pose regression network [4]). PoseNet was one of the first CNNs to regress the 6 degrees of freedom in a camera’s pose. The network is pretrained on object detection datasets in order to maximize the quality of feature extraction, which occurs in the first stage of the network. It only requires a single RGB image as input, unlike other networks [11, 17], and operates in real time.

Notably, PoseNet is able to localize traditionally difficult-to-localize images, specifically those with large textureless areas (where SIFT-based methods fail). PoseNet’s end-to-end nature and relatively simple ‘one-step’ training process makes it perfect for the purpose of modification, and in the case of this work, this comes in the form of changing its loss function.

PoseNet has had its loss function augmented in prior works. In [3] it was demonstrated that changing a pose regression network’s loss function is sufficient enough to cause an improvement in performance. The network was similarly ‘upgraded’ in [18] using LSTMs to correlate features at the CNN’s output. Additional improvements to the network were completed in [2], where a Bayesian CNN implementation was used to estimate re-localization accuracy.

More complex CNN approaches do exist [8,9,10]. For example, the pipeline outlined in [5] uses a CNN to regress the relative poses between a set of images which are similar to a query image. These relative pose estimates are coalesced in a fusion algorithm which produces an estimate for the camera pose of the query image.

Depth data has also been incorporated into the inputs of pose regression networks (to improve performance by leveraging multi-modal input information). These RGB-D input pipelines are commonplace in the image localization literature [1], and typically boast higher localization accuracy at the cost of requiring additional sensors, data and computation.

A variety of non-CNN solutions exist, with one of the more notable solutions being the Active Search algorithm [12], which uses SIFT features to inform a matching process. SIFT descriptors are calculated over the query image and are directly compared to a known 3D model’s SIFT features. SIFT and other non-CNN learned descriptors have been used to achieve high localization accuracy, but these descriptors tend to be susceptible to changes in the environment, and they often necessitate systems with large amounts of memory and computational power (comparatively to CNNs) [4].

The primary focus of this work is quantifying the impact of the loss function when training a pose regression CNN. Hence, we do not draw direct comparisons between the proposed model and significantly different pipelines—such as SIFT-based feature matching algorithms or PoseNet variations with highly modified architectures. Moreover, for the purpose of maximizing the number of available benchmark datasets, we consider pose regressors which handle purely RGB query images. In this way, this work deals specifically with CNN solutions to the monocular, RGB-only image localization task.

3 Formulating the Proposed Loss Term

When trying to accurately regress one’s pose based on visual data alone, the error in the two terms which define pose—position and rotation—obviously needs to be minimized. If these error terms were entirely minimized, the camera would be in the correct location and would be ‘looking’ in the correct direction.

Formally, pose regression networks—such as the default PoseNet—are trained to regress an estimate \(\hat{\vec {p}}\) for a camera’s true pose \(\vec {p}\). They do this by calculating the loss after every training iteration, which is formulated as the MSE between the predicted position \(\hat{\vec {x}}\) and the true position \(\vec {x}\), plus the MSE between the predicted rotation \(\hat{\vec {q}}\) and the true rotation \(\vec {q}\). Note that rotations are encoded as quaternions, since the space of rotations is continuous, and results can be easily normalized to the unit sphere in order to ensure valid rotations. Hyperparameters \(\alpha \) and \(\beta \) control the balance between positional and rotational error, as illustrated in Eq. (1). In practice, RGB-only pose regression networks reach a maximum localization accuracy when minimizing these error terms independently.

$$\begin{aligned} \mathcal {L}_{default}&= \alpha \cdot \Vert \hat{\vec {x}}- \vec {x}\Vert + \beta \cdot \Vert \hat{\vec {q}}- \vec {q}\Vert \end{aligned}$$

Rather than considering position and rotation as two separate quantities, we consider them together as a line in 3D space: the line travels in a direction defined by the rotation, and must travel through the position vector defined by the position \(\vec {x}\). We then introduce a ‘line-of-sight’ term which constrains our predictions to lie on this line. The line-of-sight term considers the cosine similarity between the direction of the pose \(\vec {p}\) and the direction of the difference vector \(\vec {d}= \vec {x}- \hat{\vec {x}}\), as per Eq. (2) and Fig. 2. This term is only zero when the predicted position lies on the line defined by the ground truth pose, hence constraining the pose regression objective further. In the context of image localization, this ensures that the predicted poses lie on the line-of-sight defined in the ground truth image.

$$\begin{aligned} 1 - \cos {\theta } = 1 - \frac{\vec {p}\cdot \vec {d}}{\Vert \vec {p}\Vert \cdot \Vert \vec {d}\Vert } \end{aligned}$$
Fig. 2.
figure 2

The important quantities required in the calculation of the proposed loss term in 2D. This process naturally extends to 3D. The Euclidean dot product formula is used to calculate a value for \(\theta \).

We modify the default loss function presented in Eq. (1) by adding a weighted contribution of the line-of-sight loss term, producing the proposed loss function in Eq. (3). In practice, the value of \(\gamma \) is chosen to roughly reflect the scale of the scene being considered, and is found via a hyperparameter grid search. Note that the line-of-sight term can contribute to the loss through multiplication, higher order terms, etc. but it was determined that weighted addition produced the best performing networks.

$$\begin{aligned} \mathcal {L}_{proposed}&= {L}_{default} + \gamma \cdot (1 - \cos {\theta }) \end{aligned}$$

In short, the final loss function used to train the proposed model (Eq. (3)) is the result of an exploration in the space of possible loss terms, and the term’s design was informed by task specific observations and experimentation.

4 Experiments

Our experiments are naturally centred around testing the performance of the proposed model (defined in Sect. 3). This performance is defined with respect to the following criteria:

  • Accuracy: the system should be able to regress a camera’s pose with a level of positional and rotational accuracy that is competitive with similar classes of algorithms. Accuracy is reported using per-scene and average median positional and rotational error (See Sect. 5.1).

  • Robustness: the system should be robust to perceptual aliasing, motion blur and other challenges posed by the considered datasets (See Sect. 5.2 and Fig. 8).

  • Time performance: evaluation should occur in real-time (\(\sim \)30 frames per second), such that the system is suitable in hardware limited real-time applications, or on platforms with RGB-only image sensors, e.g.on mobile phones (See Sect. 5.3).

We compare our proposed model against the default PoseNet and other PoseNet variants.

4.1 Datasets

The following datasets are used to benchmark model performance. Each scene’s recommended train and test split is used throughout the following experiments (Figs. 3, 4, 5 and 6).

Fig. 3.
figure 3

Sample images from each of the 7 scenes in the 7Scenes dataset.

Fig. 4.
figure 4

Sample images from each of the 6 scenes in the Cambridge Landmarks dataset.

Fig. 5.
figure 5

Sample images from each of the 5 scenes in the University dataset.

Fig. 6.
figure 6

Sample images from the 2 scenes in the Gemini dataset.

7Scenes [15]. 7 indoor locations in a domestic office context. The dataset features large training and testing sets (in the thousands). The camera paths move continuously while gathering images in distinct sequences. Images include motion blur, featureless spaces and specular reflections (see Fig. 8), making this a challenging dataset, and one that has been used prolifically in the image localization literature. The ground truths poses are gathered with KinectFusion, and the RGB-D frames each have resolutions of \(640\times 480\) px.

Cambridge Landmarks [2, 4]. 6 outdoor locations in and around Cambridge, The United Kingdom. The larger spatial extent and restricted dataset size make this a challenging dataset to learn to regress pose from—methods akin to the one presented in this work typically only deliver positional accuracy in the scale of metres. However, the dataset does provide a common point of comparison, and also includes large expanses of texture-less surfaces. Ground truth poses are generated by a SfM process, so some comparison can be drawn between this dataset and the one created in this work.

University [5]. 5 indoor scenes in a university context. Ground truth poses are gathered using odometry estimates and “manually generated location constraints in a pose-graph optimization framework” [5]. The dataset, similarly to 7Scenes, includes challenging frames with high degrees of perceptual aliasing, where multiple frames (with different poses) give rise to similar images [20]. Although the scenes are registered to a common coordinate system in the University dataset and thus a network could be trained on the full dataset, the models created in this work are trained and tested scene-wise for the purpose of consistency.

GeminiFootnote 1 . 2 indoor scenes in a university lab context. This dataset was created for the purpose of studying the effect of texture and colour on pose regression networks: both scenes survey the same environment, with one scene including decor (posters, screen-savers, paintings etc.) and the other deliberately not including visually rich, textured, and colorful decor. As such the two scenes are labelled Decor and Plain. A photogrammetry pipeline (COLMAP [14]) was used to generate the ground truth poses. Images were captured in 15 separate video sequences using a FujiFilm X-T20 with a 23 mm prime autofocus lens (in order to ensure a fixed calibration matrix between sequences). Visualizations of the with decor scene are provided in Fig. 7.

Fig. 7.
figure 7

(a) – (b) Varying views of the Gemini dataset.

4.2 Architecture and Training

As stated, we primarily experiment with the PoseNet architecture (using TensorFlow). For the purpose of brevity we redirect the reader to the original publication [4], as here we only describe crucial elements of the network’s design and operation.

The PoseNet architecture is in itself based on the GoogLeNet architecture [16], a 22 layer deep network which performs classification and detection. PoseNet extracts GoogLeNet’s early feature extracting layers, and replaces the final three softmax classifiers with affine regressors. The network is pretrained using large classification datasets such as Places [21].

Strictly, the default loss function used is not exactly as defined in Eq. (1). Instead, PoseNet uses the predictions from all three affine regressors (hence there are three predictions for each quantity). We label the \(i^{th}\) affine regressor’s hyperparameters and predictions using a subscript i, as per Eq. (4). All three affine regressors’ predictions are used in the loss function, but each have different hyperparameter weightings: \(\alpha _{1}=\alpha _{2}=0.3\), \(\alpha _{3}=1\), \(\beta _{1}=\beta _{2}=150\) and \(\beta _{3}=500\).

$$\begin{aligned} \mathcal {L}_{default}&= \alpha _{i} \cdot \Vert \hat{\vec {x}}_{i} - \vec {x}\Vert + \beta _{i} \cdot \Vert \hat{\vec {q}}_{i} - \vec {q}\Vert \end{aligned}$$

In order to demonstrate the consistency and generalization of the proposed network, we train against all scenes in all datasets using the same experimental setup. For each scene we train PoseNet using the default loss (Eq. (4)) and the proposed loss (Eq. (3)) with the contribution from all three affine regressors. Each model is trained per-scene over 300, 000 iterations with a batch size of 75 on a Tesla K40c, which takes \(\sim \)10 h to complete.

5 Results

We compare our proposed model to PoseNet and one of its variants—Bayesian PoseNet [18]—in Table 1. This is to show the proposed model’s performance when compared to other variants of PoseNet with modified loss functions. We then provide results specifically comparing the default PoseNet to our proposed model in Table 2. A discussion of our system’s performance regarding the criteria outlined in Sect. 4 follows.

5.1 Accuracy

It is observed that the proposed model outperforms the default version of PoseNet in approximately half the 7Scenes scenes—particularly the Stairs scene. In the Stairs scene, repetitious structures, e.g.staircases, make localization harder, yet the proposed model is robust to such challenges. The network is outperformed in others scenes; namely outdoor datasets with large spatial extents, but in general, performance is improved for the indoor datasets 7Scenes, University and Gemini.

Table 1. The results of various pose regression networks for various image localization datasets. Median positional and rotational error is reported in the form: metres, degrees. The lowest errors are emboldened. Note that our proposed model is competitive in indoor datasets with respect to median positional error.
Table 2. A study on the direct effects of using our proposed loss function, instead of the default loss function when training PoseNet. Median positional and rotational error is reported in the form: metres, degrees. The lowest errors of each group are emboldened. Note that our contribution majorly outperforms the default PoseNet in both median positional and median rotational error throughout the University dataset and the Gemini dataset. In the Gemini dataset, decreases of 26.7% and 24.0% in the median positional and rotational error are observed in the Decor scene, and an overall increase in accuracy demonstrates the proposed model’s robustness to textureless indoor environments (when compared to the default PoseNet).

A set of cumulative histograms for six of the evaluated scenes are provided in Table 3, where we compare the distribution of the positional errors and rotational errors. Median values (provided in Tables 1 and 2) are plotted for reference.

The proposed model’s errors are strictly less than the default PoseNet’s throughout the majority of the Chess and Coffee Room distributions. However, the default PoseNet outperforms our proposed model with respect to rotational accuracy in the \(10^\circ \)\(30^\circ \) range in the Coffee Room scene.

Note the lesser performance observed from the proposed model on the King’s College scene; where the positional errors distributions for the two networks are nearly aligned. Moreover, the default PoseNet more accurately regresses rotation in this outdoor scene. See Sects. 5.2 and 6 for further discussion.

Table 3. Cumulative histograms of positional and rotational errors, with median values plotted as a dotted line. Note that the proposed model’s positional error distribution is strictly less than (shifted to the left of) the default PoseNet’s positional error distribution for the indoor scenes (except Conference, where performance is comparable). Additionally, the maximum error of the proposed model is lower in the scenes Meeting, Coffee Room and Kitchen, meaning that our implementation is robust to some of the most difficult frames offered by the University dataset. Images best viewed in colour.

5.2 Robustness

The robustness of our system to challenging test frames—that is, images with motion blur, repeated structures or demonstrating perceptual aliasing [6]—can be determined via the cumulative histograms in Table 3. For the purpose of visualization, some difficult testing images from the 7Scenes dataset are displayed in Fig. 8.

The hardest frames in the test set by definition produce the greatest errors. Consider the positional error for the Meeting scene: our proposed model reaches a value of 1.0 on the y-axis before the default PoseNet does, meaning that the hardest frames in the test set have their position regressed more accurately. This analysis extends to each of the cumulative histograms in Table 3, thus confirming our proposed loss function’s robustness to difficult test scenarios, as the frames of greatest error consistently have less than or comparable errors when compared to the default PoseNet.

Fig. 8.
figure 8

(a) – (c) Images from the 7Scenes dataset where accurately regressing pose is challenging.

Moreover, the proposed model significantly exceeds the default PoseNet’s performance throughout the Gemini dataset. The performance gap in the Plain scene proves that our model is more robust to textureless spaces than the default PoseNet.

5.3 Efficiency

Training Time. The duration of the training stage compared between our implementation and default PoseNet is by design, very similar, and highly competitive when compared to the other systems analyzed in Table 1. This is due to the relatively inexpensive computing cost of introducing a simple line-of-sight loss term into the network’s overall loss function. The average training time for default PoseNet and for our augmented PoseNet over the University dataset is 10 : 21 : 31 and 10 : 23 : 33 respectively (HH:MM:SS), where both tests are ran on the same hardware.

Testing Time. The network operation during the test time is naturally not affected by the loss function augmentation. The time performance when testing is similar to that of the default PoseNet and in general is competitive amongst camera localization pipelines (especially feature based matching techniques). We observe a total elapsed time of 16.04 s when evaluating the entire Coffee Room scene testing set, whereas it takes 16.03 s using the default PoseNet. In other words, both systems take \(\sim \)16.8 ms to complete a single inference on our hardware.

Memory Cost. Memory cost in general for CNNs is low—only the weights for the trained layers and the input image need to be loaded into memory. When compared to feature matching techniques, which need to store feature vectors for all instances in the test set, or SIFT-based matching methods with large memory and computational overheads, CNN approaches are in general quite desirable—especially in resource constrained environments. Both the proposed model and the default PoseNet take 8015MiB and 10947MiB to train and test respectively (as reported by nvidia-smi). For interest, the network weights for the proposed model’s TensorFlow implementation total only 200 MB.

6 Discussion and Future Work

Experimental results confirm that the proposed loss term has a positive impact on robustness and accuracy, whilst maintaining speed, memory usage, and robustness (to textureless spaces and so forth).

The network is outperformed by the SIFT-based image localization algorithm ‘Active Search’ [12], indicating that there is still some work required until the gap between SIFT-based algorithms and CNNs is closed (in the context of RGB-only image localization). However, SIFT localization operates on a much longer timescale, and can be highly computationally expensive depending on the dataset and pipeline being used [19].

Ultimately, the loss function described in this work illustrates that intuitive loss terms, designed with respect to a specific task (in this case image localization) can positively impact the performance of deep networks.

Possible avenues for future work include extending this loss function design methodology to other CV tasks, in order to achieve higher performance, or to consider RGB-D pipelines. An investigation on the effect that such loss terms have on the convergence rate, and upper performance limit of NNs could also be explored.

7 Conclusion

In summary, the effect of adding a line-of-sight loss term to an existing pose regression network is investigated. The performance of the proposed model is compared to other similar models across common image localization benchmarks and the newly introduced Gemini dataset. Improvements to performance in the image localization task are observed, without any drastic increase in evaluation speed or training time. Particularly, the median positional accuracy is—on average—increased for indoor datasets when compared to a version of the model without the suggested loss term.

This work suggests that means squared error between the ground truth and the regressed predictions—although often used as a measure of loss for many Neural Networks—can be improved upon. Specifically, loss functions designed with the network’s task in mind may yield better performing models. For pose regression networks, the distinct and coupled nature of positional and rotational quantities needs to be considered when designing a network’s loss function.