1 Introduction

Estimating the six degrees-of-freedom (6-DoF) camera pose from a given RGB image is a key component in many computer vision systems such as augmented reality, autonomous driving, and robotics. Classical methods (Sattler et al., 2011, 2012, 2016a; Taira et al., 2018; Sarlin et al., 2019) establish 2D-2D(-3D) correspondences between query and database local descriptors, followed by PnP-based camera pose estimation. Although powerful, these methods are memory and computationally inefficient requiring to keep an immense amount of local image descriptors and to perform hierarchical descriptor matching in a RANSAC loop to infer camera pose.

Fig. 1
figure 1

HSCNet architecture. The ground-truth scene 3D coordinates are hierarchically quantized into regions and sub-regions. Direct branches of the network sequentially predict discrete regions and sub-regions, and continuous 3D coordinates, with the processing of each branch being conditioned on the result of the previous one. Given an input image, HSCNet predicts 3D coordinates for 2D image pixels, which then form the input to PnP-RANSAC for 6DoF pose estimation

On the other hand, end-to-end pose regression methods that directly regress the camera pose parameters are much faster and scalable (Kendall et al., 2015; Balntas et al., 2018; Chen et al., 2021; Shavit & Keller, 2022). However, such methods are significantly less accurate than the ones based on local descriptors. A better trade-off between accuracy and computational efficiency is offered by structured localization approaches (Brachmann et al., 2017; Brachmann & Rother, 2018, 2021; Shotton et al., 2013; Li et al., 2020; Wang et al., 2021). Structured methods are trained to learn an implicit representation of the 3D environment by directly regressing 3D scene coordinates corresponding to a 2D pixel location in a given input image. This directly provides 2D-3D correspondences and avoids storing and explicitly matching database local descriptors with the query. For small-scale scenes, the scene-coordinate regression methods work on par (Brachmann et al., 2021) or outperform (Brachmann & Rother, 2018, 2021) local image descriptors-based approaches. Nevertheless, the storage and computational benefits of structured-based methods are superior to their classical counterparts.

Existing scene-coordinate regression approaches (Brachmann et al., 2017; Brachmann & Rother, 2018, 2021) are designed to predict scene coordinates from a small local image patch that provides robustness to viewpoint changes. However, such methods are limited in applicability to larger scenes where ambiguity from visually similar local image patches cannot be resolved with a limited receptive field. Using larger receptive field sizes, up to the full image, to regress the coordinates can mitigate the issues from ambiguities by encoding more context. This, however, has been shown to be prone to overfitting the larger input patterns in the case of limited training data, even if data augmentation alleviates this problem to some extent (Li et al., 2018; Brachmann & Rother, 2021).

Increasing context by enlarging the receptive field while maintaining the distinctiveness of local descriptors or not overfitting is a challenging problem. We address this using a special network architecture, called HSCNet (Li et al., 2020), which hierarchically encodes scene context using a series of classification layers before making the final coordinate prediction. The overall pipeline is illustrated in Fig. 1. Specifically, the network predicts scene coordinates progressively in a coarse-to-fine manner, where predictions correspond to a region in the scene at the coarse level and coordinate residuals at the finest level. The predictions at each level are conditioned on both descriptors and predictions from the preceding level which is the key component in large scenes as we experimentally demonstrate in this work. This conditioning leverages FiLM (Perez et al., 2018) layers that allow to gradually increase the receptive field. The HSCNet approach utilizes CNNs to encode the descriptors and predictions. In this work, we extend this idea and propose the transformer-based (Vaswani et al., 2017) conditioning mechanism, named HSCNet++, which is more efficient in capturing global context into local representations through attention and does not require heavy conventional layers to enlarge the receptive field. The architecture manages to improve coordinate prediction at all levels, both coarse and fine. We integrate dynamic position information in the form of predicted coarse positional encoding, without the need to learn or construct explicitly position embeddings and show promising results on several camera relocalization benchmarks.

We further extend HSCNet++ by removing the dependency on dense ground truth scene coordinates. Dense coordinates limit the applicability of HSCNet to outdoor scenes. Similar to Brachmann and Rother (2018), HSCNet addressed the issue of sparse data on Cambridge dataset (Kendall et al., 2015) by using MVS-based densification (Schönberger et al., 2016). However, these methods either introduce additional noise and are costly to obtain. Directly training HSCNet with sparse supervision leads to a significant performance drop. In HSCNet++, we propose a simple yet effective pseudo-labelling method, where ground-truth labels at each pixel location are propagated to a fixed spatial neighbourhood. This is based on the assumption that nearby pixels share similar statistics. To provide robustness to pseudo-label noise, symmetric objective functions based on cross-entropy and re-projection loss are proposed. While the symmetric cross-entropy cost function provides robustness to the classification layers of HSCNet, the re-projection loss rectifies the noise in pseudo-labelled 3D scene coordinates.

This work is a summary and extension of HSCNet. We validate our approach on three datasets used in previous works: 7-Scenes (Shotton et al., 2013), 12-Scenes (Valentin et al., 2016), and Cambridge Landmarks (Kendall et al., 2015). Our approach demonstrates consistently better performance and achieves state-of-the-art results for single-image camera relocalization. In addition, by compiling the 7-Scenes and 12-Scenes datasets into single large scenes we show that our approach scales more robustly to larger environments. In summary, our contributions are as follows:

  1. 1.

    Compared to HSCNet, we utilize an improved transformer based conditioning mechanism that efficiently and effectively encodes global spatial information to scene coordinate prediction pipeline, resulting in a significant performance improvement from 84.8% to 88.7% on indoor localization while requiring only 57% of the memory footprint;

  2. 2.

    We extend HSCNet to optionally leverage the sparse ground truth only in the training procedure by introducing pseudo ground truth labels and angle-based re-projection errors. When using sparse supervision for training, the proposed approach achieves better performance on the Cambridge outdoor camera relocalization dataset compared to the MVS-based densified training data;

  3. 3.

    We show that the classical pixel-based positional encoding in our conditioning mechanism suffers from a significant performance drop, especially in scenes exhibiting substantial repetitive patterns. Our spatial positional encoding inspired by the FiLM layer eliminates this problem and achieves SoTA performance on several image-based localization benchmarks.

2 Related Work

Existing methods for visual localization are reviewed depending on the category they belong to.

Classical visual localization methods assume that a scene is represented by a 3D model, which is a result of processing a set of database images. Each 3D point of the model is associated with one or several database local descriptors. Given a query image, a sparse set of keypoints and their local descriptors are obtained using traditional (Calonder et al., 2010; Lowe, 2004; Rublee et al., 2011; Bay et al., 2006) or learned CNN-based (DeTone et al., 2018; Revaud et al., 2019; Dusmanu et al., 2019; Melekhov et al., 2021, 2020; Luo et al., 2019; Wang et al., 2020; Tian et al., 2017; Balntas et al., 2016; Zagoruyko & Komodakis, 2015; Han et al., 2015; Melekhov et al., 2017; Simo-Serra et al., 2015; Mishchuk et al., 2017) approaches. The query local descriptors are then matched with local descriptors extracted from database images to establish tentative 2D-3D matches. These tentative matches are then geometrically verified using RANSAC (Fischler & Bolles, 1981) and the camera pose is estimated via PnP. Although these methods produce a very accurate pose estimate, the computational cost of sparse keypoint matching becomes a limitation, especially for large-scale environments. The large computational cost is addressed by image retrieval-based methods (Arandjelović et al., 2016; Radenović et al., 2016) restricting matching query descriptors to local descriptors extracted from top-ranked database images only. Moreover, despite the recent advancements of learned keypoint detectors and descriptors (Wang et al., 2020; Dusmanu et al., 2019; Melekhov et al., 2020, 2021; Sun et al., 2021; Zhou et al., 2021; Revaud et al., 2019; Tyszkiewicz et al., 2020), extracting discriminative local descriptors which are robust to different viewpoint and illumination changes is still an open problem.

Absolute camera pose regression (APR) methods aim to alleviate the limitations of structure-based methods by using a neural network that directly regresses the camera pose of a query image (Kendall et al., 2015; Brahmbhatt et al., 2018; Kendall & Cipolla, 2016, 2017; Melekhov et al., 2017; Walch et al., 2017; Chen et al., 2021, 2022) that is given as input to the network. The network is trained on database images with ground-truth poses by optimizing a weighted combination of orientation and translation L2 losses (Kendall et al., 2015; Melekhov et al., 2017), leveraging uncertainty (Kendall et al., 2018), utilizing temporal consistency of the sequential images (Walch et al., 2017; Radwan et al., 2018; Valada et al., 2018; Xue et al., 2019) or using GNNs (Xue et al., 2020) and Transformers (Shavit et al., 2021). The APR methods are scalable, fast, and memory efficient since they do not require storing a 3D model. However, their accuracy is an order of magnitude lower compared to the one obtained by structure-based localization approaches and comparable with image retrieval methods (Sattler et al., 2019). Moreover, the APR approaches require a different network to be trained and evaluated per scene when the scenes are registered to different coordinate frames.

Relative camera pose regression (RPR) methods, in contrast to APR, train a network to predict relative pose between the query image and each of the top-ranked database images (Ding et al., 2019; Laskar et al., 2017; Balntas et al., 2018), obtained by image retrieval (Arandjelović et al., 2016; Radenović et al., 2016). The camera location is then obtained via triangulation from two relative translation estimations verified by RANSAC. This leads to better generalization performance without using scene-specific training. However, the RPR methods suffer from low localization accuracy similarly to APR.

Scene coordinate regression (SCR) methods learn the first stage of the pipeline in the structure-based approaches. Namely, either a random forest (Brachmann et al., 2016; Cavallari et al., 2020, 2017; Guzmán-Rivera et al., 2014; Massiceti et al., 2017; Meng et al., 2017, 2018; Shotton et al., 2013; Valentin et al., 2015) or a neural network (Brachmann et al., 2017; Brachmann & Rother, 2018, 2019a, c, 2021; Budvytis et al., 2019; Bui et al., 2018; Cavallari et al., 2019; Li et al., 2018; Massiceti et al., 2017) is trained to directly predict 3D scene coordinates for the pixels and thus the 2D-3D correspondences are established. These methods do not explicitly rely on feature detection, description, and matching, and are able to provide correspondences densely. They are more accurate than traditional feature-based methods at small and medium scales, but usually do not scale well to larger scenes (Brachmann & Rother, 2018, 2019a). In order to generalize well to novel viewpoints, these methods typically rely on only local image patches to produce the scene coordinate predictions. However, this may introduce ambiguities due to similar local appearances, especially when the scale of the scene is large. To resolve local appearance ambiguities, we introduce element-wise conditioning layers to modulate the intermediate feature maps of the network using coarse discrete location information. We show this leads to better localization performance, and we can robustly scale to larger environments.

Joint classification-regression frameworks have been proven effective in solving various vision tasks. For example, Rogez et al. (2017, 2019) proposed a classification-regression approach for human pose estimation from single images. In Brachmann et al. (2016), a joint classification-regression forest is trained to predict scene identifiers and scene coordinates. In Weinzaepfel et al. (2019), a CNN is used to detect and segment a predefined set of planar Objects-of-Interest (OOIs), and then, to regress dense matches to their reference images. In Budvytis et al. (2019), scene coordinate regression is formulated as two separate tasks of object instance recognition and local coordinate regression. In Brachmann and Rother (2019a), multiple scene coordinate regression networks are trained as a mixture of experts along with a gating network which assesses the relevance of each expert for a given input, and the final pose estimate is obtained using a novel RANSAC framework, i.e. Expert Sample Consensus (ESAC). In contrast to existing approaches, in our work, we use spatially dense discrete location labels defined for all pixels, and propose FiLM-like (Perez et al., 2018) conditioning layers to propagate information in the hierarchy. We show that our novel framework allows us to achieve high localization accuracy with one single compact model.

Transformer has already shown a positive impact on the problem of visual localization. Shavit et al. (2021) show that multi-headed transformer architectures can be used to improve end-to-end absolute camera pose localization in multiple scenes with a single trained model. Similarly, SuperGlue, LoFTR and COTR (Sarlin et al., 2020; Sun et al., 2021; Jiang et al., 2021) demonstrate the usefulness of transformer architectures in learning local descriptor models. Inspired by the above success, the paper proposes methods to extend transformer architecture to the structured localization method.

3 Problem Formulation and Notation

The goal of camera pose estimation is to predict the 6-DoF pose \(p(x)\in \mathbb {R}^6\) for an RGB image x. We adopt a standard two-step approach. As a first step, 3D coordinates are predicted for each pixel, or some of the pixels, of an image. Those are the coordinates from a known 3D scene. Such predictions result in a set of 2D-3D correspondences. As a second and final step, these correspondences are fed into the PnP algorithm that estimates the camera pose. In this work, we focus on the 3D coordinate prediction task.

We rely on a function \(f: [0,1]^{W\times H \times 3} \rightarrow \mathbb {R}^{w\times h \times 3}\), \(w = W/8\) and \(h = H/8\)Footnote 1 that provides such coordinate predictions given an input image x of resolution equal to \(W\times H\) pixels; the predicted coordinates for image x are given by \(\hat{y}(x) = f(x)\).

The known 3D environment is represented by a set of training images, with known ground-truth labels per pixel in the form of 3D coordinates. The training set comprises pairs of the form (xy(x)) for image x and ground-truth 3D coordinates y(x). In case ground-truth is available only sparsely, i.e. on small part of the image pixels, a corresponding binary mask \(m(x)\in \{0,1\}^{w\times h}\) denotes which are the valid pixels. The value of ground-truth or prediction at a particular pixel is denoted by subscript i, e.g. \(y(x)_i\) for the ground-truth coordinate of pixel i.

4 HSCNet++: Hierarchical Scene Coordinate Prediction with Transformers

4.1 Overview

A baseline conventional approach for this task is to use a fully convolutional network (FCN) that maps input images to 3D coordinate predictions and is trained with a regression loss. The proposed architecture extends this scheme by constructing a hierarchy of labels, from coarse-level to fine-level, and by adding extra layers to predict those labels. Hierarchical discrete labels are defined by partitioning the ground-truth 3D points of the scene with hierarchical k-means. The number of levels in the hierarchy is fixed to 2 in this work. In this way, in addition to the ground-truth 3D scene coordinates, each pixel in a training image is also associated with two discrete labels, namely region and sub-region labels, obtained at different levels of the clustering hierarchy. Region and sub-region labels are denoted by one-hot encodings \(y_r(x)\in \{0, 1\}^{w\times h\times k_1}\) and \(y_s(x)\in \{0,1\}^{w\times h\times k_2}\), respectively. The fine-level information is given by the residual between the ground-truth 3D point and the corresponding sub-region center, which we denote by \(y_{3D}(x)\in \mathbb {R}^{w\times h\times 3}\). Ground-truth 3D pixel coordinates y(x) are replaced by \(y_r(x)\), \(y_s(x)\), and \(y_{3D}(x)\). Sub-region centers and residuals, when combined by addition, compose the pixel 3D coordinates, i.e. \(y(x) = c(y_r(x) \times k_2 + y_s(x))+y_{3D}(x)\), where c is a function providing the sub-region center.

The proposed architecture includes two classification branches for regions and sub-regions, which provide the label predictions in the form of the k-dimensional probability distributions, and a regression branch for the residual prediction. Regions, sub-region and residual predictions are denoted by \(\hat{y}_r(x)\), \(\hat{y}_s(x)\), and \(\hat{y}_{3D}(x)\), respectively. A key ingredient is to propagate coarse region information to inform the predictions at finer levels, which is achieved by conditioning layers before the classification/regression layers.

Fig. 2
figure 2

An overview of the proposed HSCNet++. The figure shows the network architecture of the proposed HSCNet++. The depicted losses correspond to the case of learning with dense ground-truth. Note that the switch is applied during inference when the predicted labels are encoded instead of the ground-truth labels

4.2 Preliminaries

We describe FiLM layers and transformer blocks, which we use in the proposed architecture.

The FiLM Perez et al. (2018) conditioning layer represents a block whose processing is conditioned on an auxiliary input. Conditioning relies on parameter generators \(\gamma , \beta \) to generate a set of scaling and shifting parameters \(\gamma (l)\) and \(\beta (l)\), the auxiliary input \(l \in R^{w\times h \times d}\) is the (sub-)region label encoding. The conditioning is processed by

$$\begin{aligned} \phi (F,l) = \gamma (l) \odot F + \beta (l), \end{aligned}$$
(1)

where \(\odot \) is the Hadamard product, \(F \in \mathbb {R}^{w\times h \times d}\) is the main input. Therefore, the parameters of the FiLM layer are conditioned on the auxiliary. The FiLM-based processing is a way to jointly encode the main and the auxiliary input. In the following, it is used to encode the predicted (sub-)regions information together with the image features.

Fig. 3
figure 3

HSCNet++ detailed architecture. The figure shows the detailed network architecture of the main pipeline and the FiLM conditioning network. For experiments on the combined scenes we added two more layers in the first conditioning generator, \(g_s\) that are marked in (dotted) red. We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively (Color figure online)

Transformer We view a 3D activation tensor of size \(w\times h\times d\) as a set of \(w\times h\) vectors/tokens and provide them as input to transformer blocks. The vanilla transformer has the computational complexity that is quadratic in the cardinality \(n=w\times h\) of input set, which is computationally unaffordable in our case. Inspired by prior work (Sun et al., 2021), we apply the linear transformer (Katharopoulos et al., 2020) that reduces the complexity from \(\mathcal {O}(N^2)\) to \(\mathcal {O}(N)\) by using the associativity property of matrix products and replacing the exponential similarity kernel with a linear dot-product kernel.

Consequently, the transformer modules that are part of our architecture do not have a significant impact on run time.

4.3 HSCNet++ Architecture

This section presents the model architecture for HSCNet++ and discusses the difference compared to the original HSCNet architecture.

Overview of the model architecture The overall architecture of HSCNet++ is summarized in Fig. 2. We first present the model as it operates during inference and then clarify the differences between training and inference. An FCN backbone is used for dense feature encoding and is denoted by \(\mathcal {F}(x) \in \mathbb {R}^{w\times h \times d}\). This is a mapping of the input image to a dense feature tensor which represents the appearance of the input image.

Prediction of region labels is performed first. A module \(g_r: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times d}\) is used that consists of convolutional layers and a transformer block. Its input is feature map \(\mathcal {F}(x)\) and the output is given by \(\textbf{x}_r = g_r(\mathcal {F}(x))\). Feature map processing is performed within the local context of the receptive field with convolutions and within a global context with the transformer. The region predictor \(h_r: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times k_1}\) comprises a \(1\times 1\) convolutional layer and is used to obtain the region prediction denoted by \(\hat{y}_r(x) = h_r(\textbf{x}_r)\).

Then, sub-region prediction is performed. A module \(g_s: \mathbb {R}^{w\times h \times d} \times \mathbb {R}^{w\times h \times k_2} \rightarrow \mathbb {R}^{w\times h \times d}\) is used, which consists of convolutional layers and transformer blocks, but also FiLM layers, therefore the two inputs. The main input is the feature map \(\mathcal {F}(x)\), while the auxiliary input is the region prediction \(\hat{y}_r(x)\) from the earlier stage. In practice, \(\hat{y}_r(x)\) is passed through a series of convolutional layers before inputted to the FiLM layer as shown in Fig. 3 (c, middle block). Conditioning on region predictions is a way to jointly encode appearance and geometry which comes in the form of region prediction. Therefore, conditioning on region predictions is used to improve sub-region predictions. Then, \(\textbf{x}_s = g_s(\mathcal {F}(x), \hat{y}_r(x))\) is fed into the sub-region predictor \(h_s: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times k_2}\) comprising a \(1\times 1\) convolution layer, whose output is denoted by \(\hat{y}_s(x) = h_s(\textbf{x}_s)\) and constitutes the sub-region prediction.

Now, residual prediction is performed. Similar to the earlier stage, feature map \(\mathcal {F}(x)\) is processed by conditioning on the concatenation of region and sub-region predictions, i.e. \(\hat{y}_r(x)\) and \(\hat{y}_s(x)\). This is denoted by module \(g_{3D}: \mathbb {R}^{w\times h \times d} \times \mathbb {R}^{w\times h \times (k_1 + k_2)} \rightarrow \mathbb {R}^{w\times h \times d}\) and consists of convolutional and FiLM layers and transformer blocks. Similarly, as before, concatenated region and sub-region predictions are passed through a series of convolutional layers before being inputted to the FiLM layer as an auxiliary input (c.f.Fig. 3 (c, right block)). Then, \(\textbf{x}_{3D} = g_{3D}(\mathcal {F}(x), \hat{y}_s(x))\) is fed into the residual predictor to obtain \(\hat{y}_{3D}(x) = h_{3D}(\textbf{x}_{3D})\), where \(h_{3D}: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times 3}\) consists of a \(1\times 1\) convolution. The detailed architecture for HSCNet++ and the different modules is shown in Fig. 3.

Synergy between FiLM and transformers Modules \(g_s\) and \(g_{3D}\) include the use of FiLM layers followed by transformer blocks. Transformers typically rely on the use of 2D positional encodings (Vaswani et al., 2017) in order to take the position of activations into account. Discarding those positions is not an appropriate choice for our task. Nevertheless, our architecture design dispenses with the need for those classical positional encodings. This is due to the fact that FiLM layers jointly encode appearance with 3D coordinate predictions, instead of the 2D positions within the image. To the best of our knowledge, such form of geometry encoding for transformers has not appeared in the computer vision or machine literature before. We experimentally show that this is an effective design choice.

Compared to HSCNet, the proposed HSCNet++ incorporates the use of transformers. The design choice of placing them right after FiLM layers supports their synergy due to the mentioned case of encoding positions.

4.4 Training

When training with dense supervision, the following losses are adopted. Classification loss \(\ell _c\) is applied to the output of the two classification branches,

$$\begin{aligned} \ell _c = \ell _{ce}(\hat{y}_r(x),y_r(x) ) + \ell _{ce}(\hat{y}_s(x),y_s(x) ) \end{aligned}$$
(2)

Where \(\ell _{ce}\) is the cross-entropy loss. Additionally, regression loss \(\ell _r\), in particular mean squared error, is applied on \(\hat{y}_{3D}(x)\) and \(y_{3D}(x)\). The total loss \(\mathcal {L}\) is a weighted sum of the two classification losses and the regression loss.

$$\begin{aligned} \mathcal {L} = \lambda _1\ell _c + \lambda _2\ell _r \end{aligned}$$
(3)

Where \(\lambda _1\) and \(\lambda _2\) are the weights for each term. We observe that the regression prediction is more sensitive to localization performance. Thus, a larger weight is assigned to the \(\ell _r\).

4.5 Inference

During inference, the predicted 3D coordinates \(\hat{y}(x)\) and their corresponding 2D pixels are fed into the PnP-RANSAC loop to estimate the 6-DoF camera pose. These predicted 3D coordinates are obtained by simply summing the center of predicted sub-regions \(c(y_r(x) \times k_2 + y_s(x))\) and predicted residuals \(\hat{y}_{3D}(x)\).

We differentiate on how conditioning is conducted during training and inference as shown in Fig. 2. At training time, conditioning is performed using the ground truth (sub-)region labels, i.e. \(y_r(x)\) and \(y_s(x)\) are the second inputs of the conditioning blocks. At test time, conditioning is implemented using predicted (sub-)region labels. Specifically, the one-hot encodings of the \(\mathop {\textrm{argmax}}\limits \) operation of \(\hat{y}_r(x)\) and \(\hat{y}_s(x)\) are the second inputs of the conditioning blocks.

4.6 Training with Sparse Supervision

When only sparse ground truth of 3D coordinates, indicated by mask m(x) for image x, is available, the straightforward approach is to apply the loss only on pixels where the mask value is 1, which we refer to as valid pixel. Instead, we propose to perform propagation of the available labels to nearby pixels and use two additional losses that are appropriately handling the scarcity of the labels. We refer to the HSCNet++ model trained with such sparse supervision as HSCNet++(S).

Label propagation (LP) We rely on a smoothness assumption: labels do not change much in a small pixel neighborhood. Consequently, we propagate the labels in a local neighborhood around each pixel. The neighborhood is defined by a square area of size \(z \times z\). All neighbors of a valid pixel are marked as valid too and ground-truth maps, namely \(y_r(x)\), \(y_s(x)\), and \(y_{3D}(x)\), are updated by replicating the label of the original pixel to the neighboring pixels. Then, the classification and regression losses are applied to the newly obtained valid pixels after propagation. This is seen as some form of pseudo-labeling that increases the density of the available labels.

Symmetric cross-entropy loss (SCE) Pseudo-labels are expected to include noise. This noise will typically be larger if propagation reaches background pixels starting from a foreground-object valid pixel. We quantitatively analyze the percentage of noisy labels with the increasing of neighbor radius in Sect. 5.4. Thus, we face a challenging task which is learning correct classification with noisy labels. The traditional cross-entropy loss is not reliable in such a scenario as it exhibits overfitting to noisy labels on some "easy" classes and suffers from under learning on some “hard” classes (Wang et al., 2019).

Following (Wang et al., 2019), we increase the robustness of the classification with minimal cost by introducing the symmetric cross-entropy loss. The additional reverse cross entropy loss in SCE is a noise-tolerant term that exhibits the property of overestimating and underestimating the target value resulting in the same loss. This property makes it more adaptive to noisy labels and allows the model to cope better with label noise. The SCE loss is defined as a weighted summation of the following terms:

$$\begin{aligned} l_{sce} =\lambda _{ce} l_{ce} + \lambda _{sce} l_{rce} \end{aligned}$$
(4)

Where \(l_{rce}\) is the reverse cross-entropy loss. For a valid pixel \(i \in I\), the \(l_{rce}\) is:

$$\begin{aligned} \ell _{rce}(x,i) = \hat{y}_r(x)_i \log y_r(x)_i, \end{aligned}$$
(5)

compared to the conventional one defined as follows:

$$\begin{aligned} \ell _{ce}(x,i) = y_r(x)_i \log \hat{y}_r(x)_i, \end{aligned}$$
(6)

Re-projection error loss (Rep) Besides the SCE Loss to predict the correct labels from noise, we also adopt the re-projection loss as a semi-supervised term to further enhance both the labels and distance residual prediction. The loss term is especially efficient in scenes with a large amount of textureless or repeating patterns. However, the vanilla re-projection loss requires careful initialization to avoid the impact of unstable gradients from degenerate 3D predictions (e.g.  too far or behind the camera). Training with vanilla re-projection loss requires extra geometric constraints and a long convergence time (Brachmann & Rother, 2021). Inspired by Li et al. (2018), we employ the angle-based re-projection loss which aims to minimize the angle \(\theta \) between two rays that share the camera center. This strategy forces predictions to lie in front of the camera, ensuring smoother gradients during training. Consequently, it eliminates the need for a time-consuming initialization step and mitigates the burden of related geometric constraints.

Given ground-truth camera pose P, the loss for pixel i of image x, whose 2D coordinates in the image are denoted by \(p_i\), is given by

$$\begin{aligned} \ell _{rep}(x,i) = ||\gamma _{i} P^{-1}\hat{y}_i(x) - fC^{-1}p_i||, \end{aligned}$$
(7)

where \(\gamma _i = ||fC^{-1}p_i||/||P^{-1}\hat{y}(x)_i||\), f is the focal length, and C is the intrinsic matrix. The angle-based re-projection loss is computed in the camera coordinate system between two points on a 3D sphere centered at the camera center and touching the image plane at the ground-truth pixel location, i.e. radius of the sphere is \(||fC^{-1}p_i||\). The two points on the sphere correspond to the locations where the vector from the camera center to the predicted 3D point and ground-truth pixel location (both in camera coordinate system) intersect the 3D sphere represented by first and second terms in Eq. 7 respectively.

Note that the re-projection loss is not added to the total loss in the beginning epochs for a fast training convergence. Similar to our dense setting, the total loss for sparse supervision is the weighted summation of regression loss, symmetric classification loss, and re-projection loss, \(\ell _{sparse} = \ell _{sce} + \lambda _2 \ell _r + \lambda _3 \ell _{rep}\).

5 Experiments

In this section, we discuss the experimental setup and employed datasets, present our results, and compare our approach to state-of-the-art localization methods.

Table 1 Indoor localization: individual scene setting (7-Scenes)
Table 2 Indoor localization: individual scene setting (12-Scenes)

5.1 Experimental Setup

Datasets We use three standard benchmarks for the evaluation; namely, 7-Scenes (Shotton et al., 2013), 12-Scenes (Valentin et al., 2016), and Cambridge Landmarks (Kendall et al., 2015), The 7-Scenes dataset covers a volume of \(\sim 6 m^3\) for each individual scene. The 3D models and ground truth poses are included in the dataset. 12-Scenes is another indoor RGB-D dataset that contains 4 large scenes with a total of 12 rooms, the volume ranges \(14 \text {--} 79 m^3\) for each room. The union of these two datasets forms the 19-Scenes dataset. Cambridge Landmarks dataset is a standard benchmark for evaluating scene coordinate methods in outdoor scenes. It is a small-scale outdoor dataset consisting of 6 individual scenes, and the ground truth pose is provided by structure-from-motion.

Following prior work (Brachmann & Rother, 2019a), we conduct experiments per scene, i.e. the individual scenes setting, but also by training a single model on all scenes of a corresponding dataset, i.e. the combined scenes setting. The combined settings of the given indoor localization benchmarks are denoted by i7-Scenes, i12-Scenes, and i19-Scenes, respectively.

Competing methods In this work, we compare the proposed approach with the following methods: (1) pose regression methods that directly regress absolute or relative camera pose parameters: MapNet (Brahmbhatt et al., 2018), Geometric PoseNet (Kendall & Cipolla, 2017), AttTxf (Shavit et al., 2021), LSTM-Pose (Walch et al., 2017), AnchorNet (Saha et al., 2018) and LENS (Moreau et al., 2021); (2) local feature based pipelines based on SIFT such as Active Search (AS) (Sattler et al., 2016a) and HLoc (Sarlin et al., 2019) based on CNN descriptors; (3) DSAC\(^\star \)(3D) (Brachmann & Rother, 2021): the latest scene coordinate regression approach with 3D model; (4) VS-Net (Huang et al., 2021): scene-specific segmentation and voting; (5) PixLoc (Sarlin et al., 2021): scene-agnostic network; (6) SFT-CR (Guan et al., 2021): scene coordinate regression with global context-guidance. In addition, we also compare with (7) ESAC (Brachmann & Rother, 2019a) on the combined scenes. We also consider a baseline called Reg-only without the hierarchical classification layers.

Table 3 Indoor localization: combined scene setting

Evaluation metrics We report the median translation and orientation error (cm,\(^\circ \)) as well as the accuracy of test images under the threshold of (\(5cm, 5^\circ \)) on indoor scenes. On Outdoor Cambridge Landmarks (Kendall et al., 2015), we report only the median pose error as in previous methods (Brachmann & Rother, 2021; Brachmann et al., 2017; Li et al., 2020).

Training details We generate the region labels by hierarchical K-means. For 7-Scenes, 12-Scenes, and Cambridge landmarks, we adopt 2-level ground truth labels with a branching factor of 25 for all the levels. Furthermore, for combined scenes, i7-Scenes, i12-Scenes, and i19-Scenes, the first level branching factor is set to \(7\times 25\), \(12\times 25\), and \(19\times 25\), respectively. For the individual scene setting, training is performed for 300K iterations with Adam optimizer. For the combined scenes the number of iterations is set to 900K. Throughout all experiments, we use a batch size of 1 with the initial learning rate of \(10^{-4}\).

The classification loss weights \(\lambda _1\) is set to 1 for all datasets, while regression loss weight \(\lambda _2\) is 10 for single scenes and \(10^5\) for combined scenes. In the sparse supervision setting, \(\lambda _{ce}\) and \(\lambda _{rce}\) are set to 0.1 and 1, respectively, while \(\lambda _2\) follows the dense setting, and \(\lambda _3\) is increased from 0 to 0.1 after first 10 epochs. We initialize the network by training with \(l_r\) using pseudo-label coordinates and later also add \(l_{rep}\) after 10 epochs. When training with sparse supervision, we select the neighborhood size \(z = 11\) to propagate labels, and use the cluster centers obtained from dense scene coordinates for a direct comparison.

Data augmentation is also effective in increasing the final accuracy. Thus, similar to HSCNet (Li et al., 2020), we randomly augment training images using translation, rotation, scaling and shearing by uniform sampling from [\(-20\)%, 20%], [\(-30\) \(^\circ \), 30\(^\circ \)], [0.7, 1.5], [\(-10\) \(^\circ \), 10\(^\circ \)] respectively. In addition, images are augmented with additive brightness uniformly sampled from [\(-20\), 20].

Pose estimation We follow the same PnP-RANSAC pipeline and parameters setting as in Brachmann and Rother (2018). The inlier threshold and the softness factor are set to \(\tau = 10\) and \(\beta = 0.5\), respectively. We randomly select 4 correspondences to formulate a minimal set for a PnP algorithm to generate a camera pose hypothesis, and a set of 256 initial hypotheses are sampled. Similar to Brachmann and Rother (2018, 2021), a pose refinement process is performed until convergence for a maximum of 100 iterations.

Architecture details The detailed architecture of HSCNet++ is shown in Fig. 3; we also visualize the block details of the FiLM conditioning network and the transformer modules. By removing the transformer layers, we derive the architecture of HSCNet. Additionally, the number of channels in the last branch, \(g_{3D}\) of HSCNet is 4096, while it is 2048 for HSCNet++ that reduces memory cost (c.f.Sect. 5.6). For experiments on the combined scenes we added two more layers in the first conditioning generator, \(g_s\) that are marked in (dotted) red. We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively. For individual scenes, we add 2 multi-head attention layers (MHA) to both classification and regression conditioning blocks, while in the combined setting, the number of MHA is set to 5.

Table 4 Outdoor localization: individual scene setting (Cambridge)

5.2 Results for HSCNet and HSCNet++

Individual scenes setting. We present results on 7-Scenes and 12-Scenes in Table 1 and Table 2, accordingly. All models are trained and evaluated individually on each scene of the corresponding dataset. Results show that HSCNet is still competitive with respect to methods published later. With the addition of transformers, HSCNet++ further boosts the average performance by  4% on 7-Scenes and obtains the best accuracy on 7-Scenes among the competitors.

Combined scenes setting To test the scalability of scene-coordinate regression methods, we go beyond small-scale environments such as individual scenes in 7-Scenes and 12-Scenes and use the combined scenes, i.e. i7-Scenes, i12-Scenes, and i19-Scenes by combining the former datasets.

Results on the combined scenes setting presented in Table 3 including comparison with the regression-only baseline and ESAC. Results show that our method scales well with increase in number of scenes compared to Reg-only baseline. It is to be noted that ESAC requires training and storing multiple networks specializing in local parts of the environment, whereas our approach requires only a single model. Results show that our approach outperforms ESAC on i7-Scenes and i12-Scenes, while performing comparably on i19-Scenes (87.9% vs.88.1%). ESAC and our approach could be combined for very large-scale scenes, but we do not explore this option in this work. HSCNet++ advances the state-of-the-art on all datasets, demonstrating the utility of transformers for this task.

Cambridge Landmarks Table 4 reports the results of three types of visual localization methods on Cambridge landmarks. AS (Sattler et al., 2016a) and HLoc (Sarlin et al., 2019) estimate the camera poses with sparse SfM ground truth. DSAC++, DSAC* and our approaches train a scene-coordinate regression model with MVS-densified depth maps, VS-Net leverages the hybrid of the two. Both HSCNet and HSCNet++ perform better than other scene coordinate methods DSAC++ and DSAC*. The performance is comparable to more recent approaches. However, we observe that the models trained with MVS-densified pseudo ground truth show a slightly worse performance compared to the approaches that use the sparse SfM 3D map. HSCNet++ shows even worse performance by adding the transformer modules. Such results motivated us to extend the HSCNet++ to train with sparse supervision and our hypothesis is that the MVS densification introduces more noise to the dense supervision. The performance of HSCNet++(S) trained with sparse supervision on Cambridge landmarks in Sect. 5.5 verified our hypothesis.

5.3 Ablations: HSCNet

Data augmentation Using geometric and color data augmentation provides robustness to lighting and viewpoint changes (DeTone et al., 2018; Melekhov et al., 2021). We investigate the impact of data augmentation and summarize the obtained results in Table 5a. Applying data augmentations leads to better localization accuracy. Note that without data augmentation, the proposed approach still provides comparable results to state of the art methods (c.f.ESAC (Brachmann & Rother, 2019a) in Table 3vs.row 3 of Table 5a).

Conditioning mechanism The two key components of HSCNet are the coarse-to-fine joint classification-regression module and its combination with the conditioning mechanism. Their impact is evaluated and results are shown in Table 5a. We train a variant of our network without the conditioning mechanism, i.e. we remove all the conditioning generators and layers. The network still estimates scene coordinates in a coarse-to-fine manner by using the predicted location labels, but there is no coarse location information that is fed to influence the network activations at the finer levels. Results indicate the importance of the conditioning mechanism for accurate scene coordinate prediction. Compared to single scene setting in Tables 1 and 3, the performance of regression only baseline drops significantly in the combined scene setting as shown in Table 5a.

Table 5 Ablation for HSCNet
Table 6 Ablations for HSCNet++. We analyze the influence of different design choices of the proposed approach on i7-Scenes

Hierarchy and partition granularity The robustness of HSCNet to the label hierarchy hyperparameter by varying depth and width are reported in Table 5. The results show that the performance of our approach is robust w.r.t. the choice of these hyperparameters, with a significant drop in performance observed only for the smallest 2-level label hierarchy. Increasing the number of classification layers from 2 is not always beneficial and only brings marginal improvement in 7-Scenes, while increasing the computational costs. We observe the best trade-off for the partition of \(25 \times 25\) for both 7-Scenes and \(175 \times 25\) for i7-Scenes (\(175=7\times 25\) due to 7 scenes combined).

Fig. 4
figure 4

Scene coordiantes visualization on i7-Scenes. We visualize the scene coordinate predictions for three test images with HSCNet, HSCNet++, and HSCNet++(S) on i7-Scenes. The XYZ coordinates are mapped to the heatmap, and the ground truth scene coordinates are computed from the depth maps. For each image, the left column is the correct predicted label and the right column is the predicted scene coordinates

Fig. 5
figure 5

Median Error for HSCNet++(S). We show the frames with median pose estimation error in each scene and visualize the accuracy by overlaying the query image (right) with a rendered image (left, grayscale) using the estimated pose and the ground truth 3D model

5.4 Ablations: HSCNet++

Impact of internal transformer encoder layers. In this ablation, we remove transformers encoders \(t_r\) and \(t_s\), while only \(t_{3D}\) remains. This variant is denoted by HSCNet++\(^\dagger \) and Table 6a shows a small to noticeable drop in all cases.

To factor out the impact of multi-headed attention (MHA) layers, we report results in Table 6a, which shows that increasing the number of MHA layers in HSCNet++\(^\dagger \) does not lead to performance improvement. It is worth mentioning that HSCNet++\(^\dagger \) with 8 MHA layers has 2 million more parameters than HSCNet++. Our intuition is that this happens due to the improvement of predictions at coarse levels of the network. To test the above hypothesis, we compute the accuracy of the sub-region predictions. For each valid pixel in a query image, this metric evaluates whether the valid pixel is correctly classified. Results in  Table 6c show that adding transformers at classification branches helps to improve the label classification accuracy. However, the sub-region prediction accuracy does not always correlate with the localization performance. This can be attributed to RANSAC-based filtering of final 3D scene coordinates for camera pose estimation. That is, incorrect 3D scene predictions due to erroneous sub-region predictions can be detected as outliers by RANSAC.

Impact of positional encoding. We compare the proposed way of providing region (position) information to the transformer blocks with the classical positional encoding used in transformers. As label encoding is an inherent part of HSCNet, for a direct comparison with positional encoding, we additionally add the positional encoding right before the transformer block and perform experiments on i7-Scenes. Results presented in Table 6b show that with the additional position encoding the results noticeably drop.

5.5 Results for HSCNet++(S)

Table 7 HSCNet++(S) results
Table 8 Ablations for HSCNet++(S)
Fig. 6
figure 6

Impact of neighborhood size z. The percentage of accurate labels and valid pixels change with the increasing of neighborhood window size z

We now present results for HSCNet++(S) with sparse supervision and study the pseudo-labeling and loss functions in detail. For indoor scenes, we synthetically sparsify dense coordinates using sparse SIFT-based SfM reconstruction. That is, we select the subset of dense 3D coordinates whose 2D re-projections (pixel locations) are also registered in the SfM reconstruction. For the outdoor Cambridge dataset, we directly obtain the keypoints of training images from the provided SfM models.

Table 9 Impact of z on pose estimation

The localization performance on 7-Scenes, i7-Scenes, and Cambridge datasets is provided in Fig. 5 and Table 7. Results show that even with sparse coordinate supervision, HSCNet++(S) achieves competitive results on 7-Scenes with respect to the dense counterpart, even outperforming HSCNet. On the more challenging combined scene setup of i7-Scenes, HSCNet++(S) lacks by 10% indicating a further requirement for future research in this direction. However, on the outdoor dataset Cambridge Landmarks, where only sparse coordinate data is available in most cases, HSCNet++(S) outperforms HSCNet and HSCNet++, which are trained on MVS-densified (Brachmann & Rother, 2018; Schönberger et al., 2016; Li et al., 2020) data, by a large margin. It demonstrates the effectiveness of our label propagation and supports our hypothesis that noisy dense ground truth from MVS harms the training process. The largest improvement is observed on Kings College, Great Court and Old Hospital with median pose errors (\(cm/\circ \)) of 15/0.24, 18/0.11 and 15/0.30 respectively (c.f.Table 4). On average median pose error, HSCNet++ (S) outperforms PixLoc (15/0.25), VSNet (13.6/0.24) and DSAC* (20.6/0.34).

Table 10 Comparison of the model capacity and runtime

Component ablations We formulate ablations on 7-Scenes to examine the components in the proposed HSCNet++(S). We first train the model without the proposed label propagation, i.e. only with sparse keypoint pixels only as the baseline. Then, for the HSCNet++(S), we present three variants by removing each component - transformers, symmetric cross-entropy and re-projection loss in HSCNet++(S) as shown in Table 8. The baseline achieves only 62.0% on average accuracy which is significantly worse than our result (85.2%). Variants without the use of transformer layers (w/o Txf), SCE and Rep models show worse performance compared to HSCNet++(S) on average. Results demonstrate that the synergy of individual components leads the superior results.

Impact of LP neighborhood size. In this section, we analyze the impact of the LP neighborhood window size, z. We vary the neighborhood size z range from \(0 \rightarrow 19\) on RedKitchen as ablation, and the results are reported in Fig. 6 and Table 9. Figure 6 shows that increasing the size of z, also increases pseudo-label noise shown by a decrease in the percentage of accurate labels. For e.g. when \(z = 11\) the fraction of noisy labels is 15%. Results in Table 9 shows that there is a trade-off between increasing z, and camera localization accuracy. This effect is more pronounced in the outdoor scene, Great Court from the Cambridge dataset, where increasing z from \(0 \rightarrow 11\) reduces median pose error (t/r) from \(32/0.28 \rightarrow 18/0.11 \). But increasing z further from \(11 \rightarrow 19\) increases median pose error from \(18/0.11 \rightarrow 35/0.2\). Limiting the spatial proximity of pseudo-labels to initial sparse labels seems a suitable choice.

5.6 Model Capacity and Efficiency

Model capacity As mentioned in Sect. 4.3, we prune some heavy convolution layers compared to HSCNet. To demonstrate the efficiency of this setting, Table 10 reports the model size of HSCNet and HSCNet++ on 7-Scenes and i7-Scenes. Our method has a memory footprint reduction of 43% compared to HSCNet on the individual scene training and 30% reduction on the combined scenes.

Runtime For a fair comparison of the running time, we run all the experiments on NVIDIA GeForce RTX 2080 Ti GPU and AMD Ryzen Threadripper 2950x CPU. It takes \(\sim \)7.4 h for 300k iterations on individual scene training for HSCNet++ and \(\sim \)10.4 h on HSCNet with the same setting. We show the approximate training time for one iteration in Table 10. It is clear that HSCNet++ has a smaller memory footprint and faster training time while offering higher accuracy. We also notice that the training time grows with the number of multi-head attention layers increases.

We have not observed a clear difference between the two methods in the inference running time. The running time varies from around 85 ms to 130 ms to localize one image. This is mainly dependent on the accuracy of predicted 2D-3D correspondences fed into the RANSAC-PnP loop.

6 Conclusion

We have proposed a novel hierarchical coarse-to-fine approach for scene coordinate prediction. The network benefits from FiLM-like conditioning of coarse region predictions for better scene coordinate prediction. Experimentally we demonstrate that both hierarchical and prediction conditioning are required for improvement. The method is extended to handle sparse labels using the proposed pseudo-labeling approach. Adaptation of symmetric cross-entropy and re-projection losses provides robustness to pseudo-label noise. We also show that the synergy of each component proposed in this work is needed for the best performance.

Fig. 7
figure 7

An overview of the proposed HSCNet++(A) on Aachen. The figure shows the network architecture of the modified HSCNet++ for large-scale Aachen dataset. Here, \(y_0\), \(y_1\), \(y_2\), \(y_3\) are coarse-to-fine label predictions

Results show that the proposed hierarchical scene coordinate network is more accurate than previous regression only approaches for single-image RGB localization. The proposed method is also more scalable as shown by results on three indoor datasets. In addition, the proposed method is extended to handle sparse labels using less costly methods than existing methods and obtaining better results on outdoor scenes.