HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Wang, Shuzhe; Laskar, Zakaria; Melekhov, Iaroslav; Li, Xiaotian; Zhao, Yi; Tolias, Giorgos; Kannala, Juho

doi:10.1007/s11263-023-01982-9

HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Open access
Published: 06 February 2024

Volume 132, pages 2530–2550, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Download PDF

Shuzhe Wang ORCID: orcid.org/0000-0003-1281-4370¹,
Zakaria Laskar²,
Iaroslav Melekhov¹,
Xiaotian Li¹,
Yi Zhao¹,
Giorgos Tolias² &
…
Juho Kannala¹

1553 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and the combined indoor scenes.

DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points

Learning Deeply Supervised Good Features to Match for Dense Monocular Reconstruction

Learning Deep Representation for Place Recognition in SLAM

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Estimating the six degrees-of-freedom (6-DoF) camera pose from a given RGB image is a key component in many computer vision systems such as augmented reality, autonomous driving, and robotics. Classical methods (Sattler et al., 2011, 2012, 2016a; Taira et al., 2018; Sarlin et al., 2019) establish 2D-2D(-3D) correspondences between query and database local descriptors, followed by PnP-based camera pose estimation. Although powerful, these methods are memory and computationally inefficient requiring to keep an immense amount of local image descriptors and to perform hierarchical descriptor matching in a RANSAC loop to infer camera pose.

On the other hand, end-to-end pose regression methods that directly regress the camera pose parameters are much faster and scalable (Kendall et al., 2015; Balntas et al., 2018; Chen et al., 2021; Shavit & Keller, 2022). However, such methods are significantly less accurate than the ones based on local descriptors. A better trade-off between accuracy and computational efficiency is offered by structured localization approaches (Brachmann et al., 2017; Brachmann & Rother, 2018, 2021; Shotton et al., 2013; Li et al., 2020; Wang et al., 2021). Structured methods are trained to learn an implicit representation of the 3D environment by directly regressing 3D scene coordinates corresponding to a 2D pixel location in a given input image. This directly provides 2D-3D correspondences and avoids storing and explicitly matching database local descriptors with the query. For small-scale scenes, the scene-coordinate regression methods work on par (Brachmann et al., 2021) or outperform (Brachmann & Rother, 2018, 2021) local image descriptors-based approaches. Nevertheless, the storage and computational benefits of structured-based methods are superior to their classical counterparts.

Existing scene-coordinate regression approaches (Brachmann et al., 2017; Brachmann & Rother, 2018, 2021) are designed to predict scene coordinates from a small local image patch that provides robustness to viewpoint changes. However, such methods are limited in applicability to larger scenes where ambiguity from visually similar local image patches cannot be resolved with a limited receptive field. Using larger receptive field sizes, up to the full image, to regress the coordinates can mitigate the issues from ambiguities by encoding more context. This, however, has been shown to be prone to overfitting the larger input patterns in the case of limited training data, even if data augmentation alleviates this problem to some extent (Li et al., 2018; Brachmann & Rother, 2021).

Increasing context by enlarging the receptive field while maintaining the distinctiveness of local descriptors or not overfitting is a challenging problem. We address this using a special network architecture, called HSCNet (Li et al., 2020), which hierarchically encodes scene context using a series of classification layers before making the final coordinate prediction. The overall pipeline is illustrated in Fig. 1. Specifically, the network predicts scene coordinates progressively in a coarse-to-fine manner, where predictions correspond to a region in the scene at the coarse level and coordinate residuals at the finest level. The predictions at each level are conditioned on both descriptors and predictions from the preceding level which is the key component in large scenes as we experimentally demonstrate in this work. This conditioning leverages FiLM (Perez et al., 2018) layers that allow to gradually increase the receptive field. The HSCNet approach utilizes CNNs to encode the descriptors and predictions. In this work, we extend this idea and propose the transformer-based (Vaswani et al., 2017) conditioning mechanism, named HSCNet++, which is more efficient in capturing global context into local representations through attention and does not require heavy conventional layers to enlarge the receptive field. The architecture manages to improve coordinate prediction at all levels, both coarse and fine. We integrate dynamic position information in the form of predicted coarse positional encoding, without the need to learn or construct explicitly position embeddings and show promising results on several camera relocalization benchmarks.

We further extend HSCNet++ by removing the dependency on dense ground truth scene coordinates. Dense coordinates limit the applicability of HSCNet to outdoor scenes. Similar to Brachmann and Rother (2018), HSCNet addressed the issue of sparse data on Cambridge dataset (Kendall et al., 2015) by using MVS-based densification (Schönberger et al., 2016). However, these methods either introduce additional noise and are costly to obtain. Directly training HSCNet with sparse supervision leads to a significant performance drop. In HSCNet++, we propose a simple yet effective pseudo-labelling method, where ground-truth labels at each pixel location are propagated to a fixed spatial neighbourhood. This is based on the assumption that nearby pixels share similar statistics. To provide robustness to pseudo-label noise, symmetric objective functions based on cross-entropy and re-projection loss are proposed. While the symmetric cross-entropy cost function provides robustness to the classification layers of HSCNet, the re-projection loss rectifies the noise in pseudo-labelled 3D scene coordinates.

This work is a summary and extension of HSCNet. We validate our approach on three datasets used in previous works: 7-Scenes (Shotton et al., 2013), 12-Scenes (Valentin et al., 2016), and Cambridge Landmarks (Kendall et al., 2015). Our approach demonstrates consistently better performance and achieves state-of-the-art results for single-image camera relocalization. In addition, by compiling the 7-Scenes and 12-Scenes datasets into single large scenes we show that our approach scales more robustly to larger environments. In summary, our contributions are as follows:

1.
Compared to HSCNet, we utilize an improved transformer based conditioning mechanism that efficiently and effectively encodes global spatial information to scene coordinate prediction pipeline, resulting in a significant performance improvement from 84.8% to 88.7% on indoor localization while requiring only 57% of the memory footprint;
2.
We extend HSCNet to optionally leverage the sparse ground truth only in the training procedure by introducing pseudo ground truth labels and angle-based re-projection errors. When using sparse supervision for training, the proposed approach achieves better performance on the Cambridge outdoor camera relocalization dataset compared to the MVS-based densified training data;
3.
We show that the classical pixel-based positional encoding in our conditioning mechanism suffers from a significant performance drop, especially in scenes exhibiting substantial repetitive patterns. Our spatial positional encoding inspired by the FiLM layer eliminates this problem and achieves SoTA performance on several image-based localization benchmarks.

2 Related Work

Existing methods for visual localization are reviewed depending on the category they belong to.

Classical visual localization methods assume that a scene is represented by a 3D model, which is a result of processing a set of database images. Each 3D point of the model is associated with one or several database local descriptors. Given a query image, a sparse set of keypoints and their local descriptors are obtained using traditional (Calonder et al., 2010; Lowe, 2004; Rublee et al., 2011; Bay et al., 2006) or learned CNN-based (DeTone et al., 2018; Revaud et al., 2019; Dusmanu et al., 2019; Melekhov et al., 2021, 2020; Luo et al., 2019; Wang et al., 2020; Tian et al., 2017; Balntas et al., 2016; Zagoruyko & Komodakis, 2015; Han et al., 2015; Melekhov et al., 2017; Simo-Serra et al., 2015; Mishchuk et al., 2017) approaches. The query local descriptors are then matched with local descriptors extracted from database images to establish tentative 2D-3D matches. These tentative matches are then geometrically verified using RANSAC (Fischler & Bolles, 1981) and the camera pose is estimated via PnP. Although these methods produce a very accurate pose estimate, the computational cost of sparse keypoint matching becomes a limitation, especially for large-scale environments. The large computational cost is addressed by image retrieval-based methods (Arandjelović et al., 2016; Radenović et al., 2016) restricting matching query descriptors to local descriptors extracted from top-ranked database images only. Moreover, despite the recent advancements of learned keypoint detectors and descriptors (Wang et al., 2020; Dusmanu et al., 2019; Melekhov et al., 2020, 2021; Sun et al., 2021; Zhou et al., 2021; Revaud et al., 2019; Tyszkiewicz et al., 2020), extracting discriminative local descriptors which are robust to different viewpoint and illumination changes is still an open problem.

Absolute camera pose regression (APR) methods aim to alleviate the limitations of structure-based methods by using a neural network that directly regresses the camera pose of a query image (Kendall et al., 2015; Brahmbhatt et al., 2018; Kendall & Cipolla, 2016, 2017; Melekhov et al., 2017; Walch et al., 2017; Chen et al., 2021, 2022) that is given as input to the network. The network is trained on database images with ground-truth poses by optimizing a weighted combination of orientation and translation L2 losses (Kendall et al., 2015; Melekhov et al., 2017), leveraging uncertainty (Kendall et al., 2018), utilizing temporal consistency of the sequential images (Walch et al., 2017; Radwan et al., 2018; Valada et al., 2018; Xue et al., 2019) or using GNNs (Xue et al., 2020) and Transformers (Shavit et al., 2021). The APR methods are scalable, fast, and memory efficient since they do not require storing a 3D model. However, their accuracy is an order of magnitude lower compared to the one obtained by structure-based localization approaches and comparable with image retrieval methods (Sattler et al., 2019). Moreover, the APR approaches require a different network to be trained and evaluated per scene when the scenes are registered to different coordinate frames.

Relative camera pose regression (RPR) methods, in contrast to APR, train a network to predict relative pose between the query image and each of the top-ranked database images (Ding et al., 2019; Laskar et al., 2017; Balntas et al., 2018), obtained by image retrieval (Arandjelović et al., 2016; Radenović et al., 2016). The camera location is then obtained via triangulation from two relative translation estimations verified by RANSAC. This leads to better generalization performance without using scene-specific training. However, the RPR methods suffer from low localization accuracy similarly to APR.

Scene coordinate regression (SCR) methods learn the first stage of the pipeline in the structure-based approaches. Namely, either a random forest (Brachmann et al., 2016; Cavallari et al., 2020, 2017; Guzmán-Rivera et al., 2014; Massiceti et al., 2017; Meng et al., 2017, 2018; Shotton et al., 2013; Valentin et al., 2015) or a neural network (Brachmann et al., 2017; Brachmann & Rother, 2018, 2019a, c, 2021; Budvytis et al., 2019; Bui et al., 2018; Cavallari et al., 2019; Li et al., 2018; Massiceti et al., 2017) is trained to directly predict 3D scene coordinates for the pixels and thus the 2D-3D correspondences are established. These methods do not explicitly rely on feature detection, description, and matching, and are able to provide correspondences densely. They are more accurate than traditional feature-based methods at small and medium scales, but usually do not scale well to larger scenes (Brachmann & Rother, 2018, 2019a). In order to generalize well to novel viewpoints, these methods typically rely on only local image patches to produce the scene coordinate predictions. However, this may introduce ambiguities due to similar local appearances, especially when the scale of the scene is large. To resolve local appearance ambiguities, we introduce element-wise conditioning layers to modulate the intermediate feature maps of the network using coarse discrete location information. We show this leads to better localization performance, and we can robustly scale to larger environments.

Joint classification-regression frameworks have been proven effective in solving various vision tasks. For example, Rogez et al. (2017, 2019) proposed a classification-regression approach for human pose estimation from single images. In Brachmann et al. (2016), a joint classification-regression forest is trained to predict scene identifiers and scene coordinates. In Weinzaepfel et al. (2019), a CNN is used to detect and segment a predefined set of planar Objects-of-Interest (OOIs), and then, to regress dense matches to their reference images. In Budvytis et al. (2019), scene coordinate regression is formulated as two separate tasks of object instance recognition and local coordinate regression. In Brachmann and Rother (2019a), multiple scene coordinate regression networks are trained as a mixture of experts along with a gating network which assesses the relevance of each expert for a given input, and the final pose estimate is obtained using a novel RANSAC framework, i.e. Expert Sample Consensus (ESAC). In contrast to existing approaches, in our work, we use spatially dense discrete location labels defined for all pixels, and propose FiLM-like (Perez et al., 2018) conditioning layers to propagate information in the hierarchy. We show that our novel framework allows us to achieve high localization accuracy with one single compact model.

Transformer has already shown a positive impact on the problem of visual localization. Shavit et al. (2021) show that multi-headed transformer architectures can be used to improve end-to-end absolute camera pose localization in multiple scenes with a single trained model. Similarly, SuperGlue, LoFTR and COTR (Sarlin et al., 2020; Sun et al., 2021; Jiang et al., 2021) demonstrate the usefulness of transformer architectures in learning local descriptor models. Inspired by the above success, the paper proposes methods to extend transformer architecture to the structured localization method.

3 Problem Formulation and Notation

The goal of camera pose estimation is to predict the 6-DoF pose $p(x)\in \mathbb {R}^6$ for an RGB image x. We adopt a standard two-step approach. As a first step, 3D coordinates are predicted for each pixel, or some of the pixels, of an image. Those are the coordinates from a known 3D scene. Such predictions result in a set of 2D-3D correspondences. As a second and final step, these correspondences are fed into the PnP algorithm that estimates the camera pose. In this work, we focus on the 3D coordinate prediction task.

We rely on a function $f: [0,1]^{W\times H \times 3} \rightarrow \mathbb {R}^{w\times h \times 3}$, $w = W/8$ and $h = H/8$^{Footnote 1} that provides such coordinate predictions given an input image x of resolution equal to $W\times H$ pixels; the predicted coordinates for image x are given by $\hat{y}(x) = f(x)$.

The known 3D environment is represented by a set of training images, with known ground-truth labels per pixel in the form of 3D coordinates. The training set comprises pairs of the form (x, y(x)) for image x and ground-truth 3D coordinates y(x). In case ground-truth is available only sparsely, i.e. on small part of the image pixels, a corresponding binary mask $m(x)\in \{0,1\}^{w\times h}$ denotes which are the valid pixels. The value of ground-truth or prediction at a particular pixel is denoted by subscript i, e.g. $y(x)_i$ for the ground-truth coordinate of pixel i.

4 HSCNet++: Hierarchical Scene Coordinate Prediction with Transformers

4.1 Overview

A baseline conventional approach for this task is to use a fully convolutional network (FCN) that maps input images to 3D coordinate predictions and is trained with a regression loss. The proposed architecture extends this scheme by constructing a hierarchy of labels, from coarse-level to fine-level, and by adding extra layers to predict those labels. Hierarchical discrete labels are defined by partitioning the ground-truth 3D points of the scene with hierarchical k-means. The number of levels in the hierarchy is fixed to 2 in this work. In this way, in addition to the ground-truth 3D scene coordinates, each pixel in a training image is also associated with two discrete labels, namely region and sub-region labels, obtained at different levels of the clustering hierarchy. Region and sub-region labels are denoted by one-hot encodings $y_r(x)\in \{0, 1\}^{w\times h\times k_1}$ and $y_s(x)\in \{0,1\}^{w\times h\times k_2}$, respectively. The fine-level information is given by the residual between the ground-truth 3D point and the corresponding sub-region center, which we denote by $y_{3D}(x)\in \mathbb {R}^{w\times h\times 3}$. Ground-truth 3D pixel coordinates y(x) are replaced by $y_r(x)$, $y_s(x)$, and $y_{3D}(x)$. Sub-region centers and residuals, when combined by addition, compose the pixel 3D coordinates, i.e. $y(x) = c(y_r(x) \times k_2 + y_s(x))+y_{3D}(x)$, where c is a function providing the sub-region center.

The proposed architecture includes two classification branches for regions and sub-regions, which provide the label predictions in the form of the k-dimensional probability distributions, and a regression branch for the residual prediction. Regions, sub-region and residual predictions are denoted by $\hat{y}_r(x)$, $\hat{y}_s(x)$, and $\hat{y}_{3D}(x)$, respectively. A key ingredient is to propagate coarse region information to inform the predictions at finer levels, which is achieved by conditioning layers before the classification/regression layers.

4.2 Preliminaries

We describe FiLM layers and transformer blocks, which we use in the proposed architecture.

The FiLM Perez et al. (2018) conditioning layer represents a block whose processing is conditioned on an auxiliary input. Conditioning relies on parameter generators $\gamma , \beta $ to generate a set of scaling and shifting parameters $\gamma (l)$ and $\beta (l)$, the auxiliary input $l \in R^{w\times h \times d}$ is the (sub-)region label encoding. The conditioning is processed by

$$\begin{aligned} \phi (F,l) = \gamma (l) \odot F + \beta (l), \end{aligned}$$

(1)

where $\odot $ is the Hadamard product, $F \in \mathbb {R}^{w\times h \times d}$ is the main input. Therefore, the parameters of the FiLM layer are conditioned on the auxiliary. The FiLM-based processing is a way to jointly encode the main and the auxiliary input. In the following, it is used to encode the predicted (sub-)regions information together with the image features.

Transformer We view a 3D activation tensor of size $w\times h\times d$ as a set of $w\times h$ vectors/tokens and provide them as input to transformer blocks. The vanilla transformer has the computational complexity that is quadratic in the cardinality $n=w\times h$ of input set, which is computationally unaffordable in our case. Inspired by prior work (Sun et al., 2021), we apply the linear transformer (Katharopoulos et al., 2020) that reduces the complexity from $\mathcal {O}(N^2)$ to $\mathcal {O}(N)$ by using the associativity property of matrix products and replacing the exponential similarity kernel with a linear dot-product kernel.

Consequently, the transformer modules that are part of our architecture do not have a significant impact on run time.

4.3 HSCNet++ Architecture

This section presents the model architecture for HSCNet++ and discusses the difference compared to the original HSCNet architecture.

Overview of the model architecture The overall architecture of HSCNet++ is summarized in Fig. 2. We first present the model as it operates during inference and then clarify the differences between training and inference. An FCN backbone is used for dense feature encoding and is denoted by $\mathcal {F}(x) \in \mathbb {R}^{w\times h \times d}$. This is a mapping of the input image to a dense feature tensor which represents the appearance of the input image.

Prediction of region labels is performed first. A module $g_r: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times d}$ is used that consists of convolutional layers and a transformer block. Its input is feature map $\mathcal {F}(x)$ and the output is given by $\textbf{x}_r = g_r(\mathcal {F}(x))$. Feature map processing is performed within the local context of the receptive field with convolutions and within a global context with the transformer. The region predictor $h_r: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times k_1}$ comprises a $1\times 1$ convolutional layer and is used to obtain the region prediction denoted by $\hat{y}_r(x) = h_r(\textbf{x}_r)$.

Then, sub-region prediction is performed. A module $g_s: \mathbb {R}^{w\times h \times d} \times \mathbb {R}^{w\times h \times k_2} \rightarrow \mathbb {R}^{w\times h \times d}$ is used, which consists of convolutional layers and transformer blocks, but also FiLM layers, therefore the two inputs. The main input is the feature map $\mathcal {F}(x)$, while the auxiliary input is the region prediction $\hat{y}_r(x)$ from the earlier stage. In practice, $\hat{y}_r(x)$ is passed through a series of convolutional layers before inputted to the FiLM layer as shown in Fig. 3 (c, middle block). Conditioning on region predictions is a way to jointly encode appearance and geometry which comes in the form of region prediction. Therefore, conditioning on region predictions is used to improve sub-region predictions. Then, $\textbf{x}_s = g_s(\mathcal {F}(x), \hat{y}_r(x))$ is fed into the sub-region predictor $h_s: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times k_2}$ comprising a $1\times 1$ convolution layer, whose output is denoted by $\hat{y}_s(x) = h_s(\textbf{x}_s)$ and constitutes the sub-region prediction.

Now, residual prediction is performed. Similar to the earlier stage, feature map $\mathcal {F}(x)$ is processed by conditioning on the concatenation of region and sub-region predictions, i.e. $\hat{y}_r(x)$ and $\hat{y}_s(x)$. This is denoted by module $g_{3D}: \mathbb {R}^{w\times h \times d} \times \mathbb {R}^{w\times h \times (k_1 + k_2)} \rightarrow \mathbb {R}^{w\times h \times d}$ and consists of convolutional and FiLM layers and transformer blocks. Similarly, as before, concatenated region and sub-region predictions are passed through a series of convolutional layers before being inputted to the FiLM layer as an auxiliary input (c.f.Fig. 3 (c, right block)). Then, $\textbf{x}_{3D} = g_{3D}(\mathcal {F}(x), \hat{y}_s(x))$ is fed into the residual predictor to obtain $\hat{y}_{3D}(x) = h_{3D}(\textbf{x}_{3D})$, where $h_{3D}: \mathbb {R}^{w\times h \times d} \rightarrow \mathbb {R}^{w\times h \times 3}$ consists of a $1\times 1$ convolution. The detailed architecture for HSCNet++ and the different modules is shown in Fig. 3.

Synergy between FiLM and transformers Modules $g_s$ and $g_{3D}$ include the use of FiLM layers followed by transformer blocks. Transformers typically rely on the use of 2D positional encodings (Vaswani et al., 2017) in order to take the position of activations into account. Discarding those positions is not an appropriate choice for our task. Nevertheless, our architecture design dispenses with the need for those classical positional encodings. This is due to the fact that FiLM layers jointly encode appearance with 3D coordinate predictions, instead of the 2D positions within the image. To the best of our knowledge, such form of geometry encoding for transformers has not appeared in the computer vision or machine literature before. We experimentally show that this is an effective design choice.

Compared to HSCNet, the proposed HSCNet++ incorporates the use of transformers. The design choice of placing them right after FiLM layers supports their synergy due to the mentioned case of encoding positions.

4.4 Training

When training with dense supervision, the following losses are adopted. Classification loss $\ell _c$ is applied to the output of the two classification branches,

$$\begin{aligned} \ell _c = \ell _{ce}(\hat{y}_r(x),y_r(x) ) + \ell _{ce}(\hat{y}_s(x),y_s(x) ) \end{aligned}$$

(2)

Where $\ell _{ce}$ is the cross-entropy loss. Additionally, regression loss $\ell _r$, in particular mean squared error, is applied on $\hat{y}_{3D}(x)$ and $y_{3D}(x)$. The total loss $\mathcal {L}$ is a weighted sum of the two classification losses and the regression loss.

$$\begin{aligned} \mathcal {L} = \lambda _1\ell _c + \lambda _2\ell _r \end{aligned}$$

(3)

Where $\lambda _1$ and $\lambda _2$ are the weights for each term. We observe that the regression prediction is more sensitive to localization performance. Thus, a larger weight is assigned to the $\ell _r$.

4.5 Inference

During inference, the predicted 3D coordinates $\hat{y}(x)$ and their corresponding 2D pixels are fed into the PnP-RANSAC loop to estimate the 6-DoF camera pose. These predicted 3D coordinates are obtained by simply summing the center of predicted sub-regions $c(y_r(x) \times k_2 + y_s(x))$ and predicted residuals $\hat{y}_{3D}(x)$.

We differentiate on how conditioning is conducted during training and inference as shown in Fig. 2. At training time, conditioning is performed using the ground truth (sub-)region labels, i.e. $y_r(x)$ and $y_s(x)$ are the second inputs of the conditioning blocks. At test time, conditioning is implemented using predicted (sub-)region labels. Specifically, the one-hot encodings of the $\mathop {\textrm{argmax}}\limits $ operation of $\hat{y}_r(x)$ and $\hat{y}_s(x)$ are the second inputs of the conditioning blocks.

4.6 Training with Sparse Supervision

When only sparse ground truth of 3D coordinates, indicated by mask m(x) for image x, is available, the straightforward approach is to apply the loss only on pixels where the mask value is 1, which we refer to as valid pixel. Instead, we propose to perform propagation of the available labels to nearby pixels and use two additional losses that are appropriately handling the scarcity of the labels. We refer to the HSCNet++ model trained with such sparse supervision as HSCNet++(S).

Label propagation (LP) We rely on a smoothness assumption: labels do not change much in a small pixel neighborhood. Consequently, we propagate the labels in a local neighborhood around each pixel. The neighborhood is defined by a square area of size $z \times z$. All neighbors of a valid pixel are marked as valid too and ground-truth maps, namely $y_r(x)$, $y_s(x)$, and $y_{3D}(x)$, are updated by replicating the label of the original pixel to the neighboring pixels. Then, the classification and regression losses are applied to the newly obtained valid pixels after propagation. This is seen as some form of pseudo-labeling that increases the density of the available labels.

Symmetric cross-entropy loss (SCE) Pseudo-labels are expected to include noise. This noise will typically be larger if propagation reaches background pixels starting from a foreground-object valid pixel. We quantitatively analyze the percentage of noisy labels with the increasing of neighbor radius in Sect. 5.4. Thus, we face a challenging task which is learning correct classification with noisy labels. The traditional cross-entropy loss is not reliable in such a scenario as it exhibits overfitting to noisy labels on some "easy" classes and suffers from under learning on some “hard” classes (Wang et al., 2019).

Following (Wang et al., 2019), we increase the robustness of the classification with minimal cost by introducing the symmetric cross-entropy loss. The additional reverse cross entropy loss in SCE is a noise-tolerant term that exhibits the property of overestimating and underestimating the target value resulting in the same loss. This property makes it more adaptive to noisy labels and allows the model to cope better with label noise. The SCE loss is defined as a weighted summation of the following terms:

$$\begin{aligned} l_{sce} =\lambda _{ce} l_{ce} + \lambda _{sce} l_{rce} \end{aligned}$$

(4)

Where $l_{rce}$ is the reverse cross-entropy loss. For a valid pixel $i \in I$, the $l_{rce}$ is:

$$\begin{aligned} \ell _{rce}(x,i) = \hat{y}_r(x)_i \log y_r(x)_i, \end{aligned}$$

(5)

compared to the conventional one defined as follows:

$$\begin{aligned} \ell _{ce}(x,i) = y_r(x)_i \log \hat{y}_r(x)_i, \end{aligned}$$

(6)

Re-projection error loss (Rep) Besides the SCE Loss to predict the correct labels from noise, we also adopt the re-projection loss as a semi-supervised term to further enhance both the labels and distance residual prediction. The loss term is especially efficient in scenes with a large amount of textureless or repeating patterns. However, the vanilla re-projection loss requires careful initialization to avoid the impact of unstable gradients from degenerate 3D predictions (e.g. too far or behind the camera). Training with vanilla re-projection loss requires extra geometric constraints and a long convergence time (Brachmann & Rother, 2021). Inspired by Li et al. (2018), we employ the angle-based re-projection loss which aims to minimize the angle $\theta $ between two rays that share the camera center. This strategy forces predictions to lie in front of the camera, ensuring smoother gradients during training. Consequently, it eliminates the need for a time-consuming initialization step and mitigates the burden of related geometric constraints.

Given ground-truth camera pose P, the loss for pixel i of image x, whose 2D coordinates in the image are denoted by $p_i$, is given by

$$\begin{aligned} \ell _{rep}(x,i) = ||\gamma _{i} P^{-1}\hat{y}_i(x) - fC^{-1}p_i||, \end{aligned}$$

(7)

where $\gamma _i = ||fC^{-1}p_i||/||P^{-1}\hat{y}(x)_i||$, f is the focal length, and C is the intrinsic matrix. The angle-based re-projection loss is computed in the camera coordinate system between two points on a 3D sphere centered at the camera center and touching the image plane at the ground-truth pixel location, i.e. radius of the sphere is $||fC^{-1}p_i||$. The two points on the sphere correspond to the locations where the vector from the camera center to the predicted 3D point and ground-truth pixel location (both in camera coordinate system) intersect the 3D sphere represented by first and second terms in Eq. 7 respectively.

Note that the re-projection loss is not added to the total loss in the beginning epochs for a fast training convergence. Similar to our dense setting, the total loss for sparse supervision is the weighted summation of regression loss, symmetric classification loss, and re-projection loss, $\ell _{sparse} = \ell _{sce} + \lambda _2 \ell _r + \lambda _3 \ell _{rep}$.

5 Experiments

In this section, we discuss the experimental setup and employed datasets, present our results, and compare our approach to state-of-the-art localization methods.

Table 1 Indoor localization: individual scene setting (7-Scenes)

Full size table

Table 2 Indoor localization: individual scene setting (12-Scenes)

Full size table

5.1 Experimental Setup

Datasets We use three standard benchmarks for the evaluation; namely, 7-Scenes (Shotton et al., 2013), 12-Scenes (Valentin et al., 2016), and Cambridge Landmarks (Kendall et al., 2015), The 7-Scenes dataset covers a volume of $\sim 6 m^3$ for each individual scene. The 3D models and ground truth poses are included in the dataset. 12-Scenes is another indoor RGB-D dataset that contains 4 large scenes with a total of 12 rooms, the volume ranges $14 \text {--} 79 m^3$ for each room. The union of these two datasets forms the 19-Scenes dataset. Cambridge Landmarks dataset is a standard benchmark for evaluating scene coordinate methods in outdoor scenes. It is a small-scale outdoor dataset consisting of 6 individual scenes, and the ground truth pose is provided by structure-from-motion.

Following prior work (Brachmann & Rother, 2019a), we conduct experiments per scene, i.e. the individual scenes setting, but also by training a single model on all scenes of a corresponding dataset, i.e. the combined scenes setting. The combined settings of the given indoor localization benchmarks are denoted by i7-Scenes, i12-Scenes, and i19-Scenes, respectively.

Competing methods In this work, we compare the proposed approach with the following methods: (1) pose regression methods that directly regress absolute or relative camera pose parameters: MapNet (Brahmbhatt et al., 2018), Geometric PoseNet (Kendall & Cipolla, 2017), AttTxf (Shavit et al., 2021), LSTM-Pose (Walch et al., 2017), AnchorNet (Saha et al., 2018) and LENS (Moreau et al., 2021); (2) local feature based pipelines based on SIFT such as Active Search (AS) (Sattler et al., 2016a) and HLoc (Sarlin et al., 2019) based on CNN descriptors; (3) DSAC$^\star $(3D) (Brachmann & Rother, 2021): the latest scene coordinate regression approach with 3D model; (4) VS-Net (Huang et al., 2021): scene-specific segmentation and voting; (5) PixLoc (Sarlin et al., 2021): scene-agnostic network; (6) SFT-CR (Guan et al., 2021): scene coordinate regression with global context-guidance. In addition, we also compare with (7) ESAC (Brachmann & Rother, 2019a) on the combined scenes. We also consider a baseline called Reg-only without the hierarchical classification layers.

Table 3 Indoor localization: combined scene setting

Full size table

Evaluation metrics We report the median translation and orientation error (cm,$^\circ $) as well as the accuracy of test images under the threshold of ($5cm, 5^\circ $) on indoor scenes. On Outdoor Cambridge Landmarks (Kendall et al., 2015), we report only the median pose error as in previous methods (Brachmann & Rother, 2021; Brachmann et al., 2017; Li et al., 2020).

Training details We generate the region labels by hierarchical K-means. For 7-Scenes, 12-Scenes, and Cambridge landmarks, we adopt 2-level ground truth labels with a branching factor of 25 for all the levels. Furthermore, for combined scenes, i7-Scenes, i12-Scenes, and i19-Scenes, the first level branching factor is set to $7\times 25$, $12\times 25$, and $19\times 25$, respectively. For the individual scene setting, training is performed for 300K iterations with Adam optimizer. For the combined scenes the number of iterations is set to 900K. Throughout all experiments, we use a batch size of 1 with the initial learning rate of $10^{-4}$.

The classification loss weights $\lambda _1$ is set to 1 for all datasets, while regression loss weight $\lambda _2$ is 10 for single scenes and $10^5$ for combined scenes. In the sparse supervision setting, $\lambda _{ce}$ and $\lambda _{rce}$ are set to 0.1 and 1, respectively, while $\lambda _2$ follows the dense setting, and $\lambda _3$ is increased from 0 to 0.1 after first 10 epochs. We initialize the network by training with $l_r$ using pseudo-label coordinates and later also add $l_{rep}$ after 10 epochs. When training with sparse supervision, we select the neighborhood size $z = 11$ to propagate labels, and use the cluster centers obtained from dense scene coordinates for a direct comparison.

Data augmentation is also effective in increasing the final accuracy. Thus, similar to HSCNet (Li et al., 2020), we randomly augment training images using translation, rotation, scaling and shearing by uniform sampling from [$-20$%, 20%], [$-30$ $^\circ $, 30$^\circ $], [0.7, 1.5], [$-10$ $^\circ $, 10$^\circ $] respectively. In addition, images are augmented with additive brightness uniformly sampled from [$-20$, 20].

Pose estimation We follow the same PnP-RANSAC pipeline and parameters setting as in Brachmann and Rother (2018). The inlier threshold and the softness factor are set to $\tau = 10$ and $\beta = 0.5$, respectively. We randomly select 4 correspondences to formulate a minimal set for a PnP algorithm to generate a camera pose hypothesis, and a set of 256 initial hypotheses are sampled. Similar to Brachmann and Rother (2018, 2021), a pose refinement process is performed until convergence for a maximum of 100 iterations.

Architecture details The detailed architecture of HSCNet++ is shown in Fig. 3; we also visualize the block details of the FiLM conditioning network and the transformer modules. By removing the transformer layers, we derive the architecture of HSCNet. Additionally, the number of channels in the last branch, $g_{3D}$ of HSCNet is 4096, while it is 2048 for HSCNet++ that reduces memory cost (c.f.Sect. 5.6). For experiments on the combined scenes we added two more layers in the first conditioning generator, $g_s$ that are marked in (dotted) red. We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively. For individual scenes, we add 2 multi-head attention layers (MHA) to both classification and regression conditioning blocks, while in the combined setting, the number of MHA is set to 5.

Table 4 Outdoor localization: individual scene setting (Cambridge)

Full size table

5.2 Results for HSCNet and HSCNet++

Individual scenes setting. We present results on 7-Scenes and 12-Scenes in Table 1 and Table 2, accordingly. All models are trained and evaluated individually on each scene of the corresponding dataset. Results show that HSCNet is still competitive with respect to methods published later. With the addition of transformers, HSCNet++ further boosts the average performance by 4% on 7-Scenes and obtains the best accuracy on 7-Scenes among the competitors.

Combined scenes setting To test the scalability of scene-coordinate regression methods, we go beyond small-scale environments such as individual scenes in 7-Scenes and 12-Scenes and use the combined scenes, i.e. i7-Scenes, i12-Scenes, and i19-Scenes by combining the former datasets.

Results on the combined scenes setting presented in Table 3 including comparison with the regression-only baseline and ESAC. Results show that our method scales well with increase in number of scenes compared to Reg-only baseline. It is to be noted that ESAC requires training and storing multiple networks specializing in local parts of the environment, whereas our approach requires only a single model. Results show that our approach outperforms ESAC on i7-Scenes and i12-Scenes, while performing comparably on i19-Scenes (87.9% vs.88.1%). ESAC and our approach could be combined for very large-scale scenes, but we do not explore this option in this work. HSCNet++ advances the state-of-the-art on all datasets, demonstrating the utility of transformers for this task.

Cambridge Landmarks Table 4 reports the results of three types of visual localization methods on Cambridge landmarks. AS (Sattler et al., 2016a) and HLoc (Sarlin et al., 2019) estimate the camera poses with sparse SfM ground truth. DSAC++, DSAC* and our approaches train a scene-coordinate regression model with MVS-densified depth maps, VS-Net leverages the hybrid of the two. Both HSCNet and HSCNet++ perform better than other scene coordinate methods DSAC++ and DSAC*. The performance is comparable to more recent approaches. However, we observe that the models trained with MVS-densified pseudo ground truth show a slightly worse performance compared to the approaches that use the sparse SfM 3D map. HSCNet++ shows even worse performance by adding the transformer modules. Such results motivated us to extend the HSCNet++ to train with sparse supervision and our hypothesis is that the MVS densification introduces more noise to the dense supervision. The performance of HSCNet++(S) trained with sparse supervision on Cambridge landmarks in Sect. 5.5 verified our hypothesis.

5.3 Ablations: HSCNet

Data augmentation Using geometric and color data augmentation provides robustness to lighting and viewpoint changes (DeTone et al., 2018; Melekhov et al., 2021). We investigate the impact of data augmentation and summarize the obtained results in Table 5a. Applying data augmentations leads to better localization accuracy. Note that without data augmentation, the proposed approach still provides comparable results to state of the art methods (c.f.ESAC (Brachmann & Rother, 2019a) in Table 3vs.row 3 of Table 5a).

Conditioning mechanism The two key components of HSCNet are the coarse-to-fine joint classification-regression module and its combination with the conditioning mechanism. Their impact is evaluated and results are shown in Table 5a. We train a variant of our network without the conditioning mechanism, i.e. we remove all the conditioning generators and layers. The network still estimates scene coordinates in a coarse-to-fine manner by using the predicted location labels, but there is no coarse location information that is fed to influence the network activations at the finer levels. Results indicate the importance of the conditioning mechanism for accurate scene coordinate prediction. Compared to single scene setting in Tables 1 and 3, the performance of regression only baseline drops significantly in the combined scene setting as shown in Table 5a.

Table 5 Ablation for HSCNet

Full size table

Table 6 Ablations for HSCNet++. We analyze the influence of different design choices of the proposed approach on i7-Scenes

Full size table

Hierarchy and partition granularity The robustness of HSCNet to the label hierarchy hyperparameter by varying depth and width are reported in Table 5. The results show that the performance of our approach is robust w.r.t. the choice of these hyperparameters, with a significant drop in performance observed only for the smallest 2-level label hierarchy. Increasing the number of classification layers from 2 is not always beneficial and only brings marginal improvement in 7-Scenes, while increasing the computational costs. We observe the best trade-off for the partition of $25 \times 25$ for both 7-Scenes and $175 \times 25$ for i7-Scenes ($175=7\times 25$ due to 7 scenes combined).

5.4 Ablations: HSCNet++

Impact of internal transformer encoder layers. In this ablation, we remove transformers encoders $t_r$ and $t_s$, while only $t_{3D}$ remains. This variant is denoted by HSCNet++$^\dagger $ and Table 6a shows a small to noticeable drop in all cases.

To factor out the impact of multi-headed attention (MHA) layers, we report results in Table 6a, which shows that increasing the number of MHA layers in HSCNet++$^\dagger $ does not lead to performance improvement. It is worth mentioning that HSCNet++$^\dagger $ with 8 MHA layers has 2 million more parameters than HSCNet++. Our intuition is that this happens due to the improvement of predictions at coarse levels of the network. To test the above hypothesis, we compute the accuracy of the sub-region predictions. For each valid pixel in a query image, this metric evaluates whether the valid pixel is correctly classified. Results in Table 6c show that adding transformers at classification branches helps to improve the label classification accuracy. However, the sub-region prediction accuracy does not always correlate with the localization performance. This can be attributed to RANSAC-based filtering of final 3D scene coordinates for camera pose estimation. That is, incorrect 3D scene predictions due to erroneous sub-region predictions can be detected as outliers by RANSAC.

Impact of positional encoding. We compare the proposed way of providing region (position) information to the transformer blocks with the classical positional encoding used in transformers. As label encoding is an inherent part of HSCNet, for a direct comparison with positional encoding, we additionally add the positional encoding right before the transformer block and perform experiments on i7-Scenes. Results presented in Table 6b show that with the additional position encoding the results noticeably drop.

5.5 Results for HSCNet++(S)

Table 7 HSCNet++(S) results

Full size table

Table 8 Ablations for HSCNet++(S)

Full size table

We now present results for HSCNet++(S) with sparse supervision and study the pseudo-labeling and loss functions in detail. For indoor scenes, we synthetically sparsify dense coordinates using sparse SIFT-based SfM reconstruction. That is, we select the subset of dense 3D coordinates whose 2D re-projections (pixel locations) are also registered in the SfM reconstruction. For the outdoor Cambridge dataset, we directly obtain the keypoints of training images from the provided SfM models.

Table 9 Impact of z on pose estimation

Full size table

The localization performance on 7-Scenes, i7-Scenes, and Cambridge datasets is provided in Fig. 5 and Table 7. Results show that even with sparse coordinate supervision, HSCNet++(S) achieves competitive results on 7-Scenes with respect to the dense counterpart, even outperforming HSCNet. On the more challenging combined scene setup of i7-Scenes, HSCNet++(S) lacks by 10% indicating a further requirement for future research in this direction. However, on the outdoor dataset Cambridge Landmarks, where only sparse coordinate data is available in most cases, HSCNet++(S) outperforms HSCNet and HSCNet++, which are trained on MVS-densified (Brachmann & Rother, 2018; Schönberger et al., 2016; Li et al., 2020) data, by a large margin. It demonstrates the effectiveness of our label propagation and supports our hypothesis that noisy dense ground truth from MVS harms the training process. The largest improvement is observed on Kings College, Great Court and Old Hospital with median pose errors ($cm/\circ $) of 15/0.24, 18/0.11 and 15/0.30 respectively (c.f.Table 4). On average median pose error, HSCNet++ (S) outperforms PixLoc (15/0.25), VSNet (13.6/0.24) and DSAC* (20.6/0.34).

Table 10 Comparison of the model capacity and runtime

Full size table

Component ablations We formulate ablations on 7-Scenes to examine the components in the proposed HSCNet++(S). We first train the model without the proposed label propagation, i.e. only with sparse keypoint pixels only as the baseline. Then, for the HSCNet++(S), we present three variants by removing each component - transformers, symmetric cross-entropy and re-projection loss in HSCNet++(S) as shown in Table 8. The baseline achieves only 62.0% on average accuracy which is significantly worse than our result (85.2%). Variants without the use of transformer layers (w/o Txf), SCE and Rep models show worse performance compared to HSCNet++(S) on average. Results demonstrate that the synergy of individual components leads the superior results.

Impact of LP neighborhood size. In this section, we analyze the impact of the LP neighborhood window size, z. We vary the neighborhood size z range from $0 \rightarrow 19$ on RedKitchen as ablation, and the results are reported in Fig. 6 and Table 9. Figure 6 shows that increasing the size of z, also increases pseudo-label noise shown by a decrease in the percentage of accurate labels. For e.g. when $z = 11$ the fraction of noisy labels is 15%. Results in Table 9 shows that there is a trade-off between increasing z, and camera localization accuracy. This effect is more pronounced in the outdoor scene, Great Court from the Cambridge dataset, where increasing z from $0 \rightarrow 11$ reduces median pose error (t/r) from $32/0.28 \rightarrow 18/0.11 $. But increasing z further from $11 \rightarrow 19$ increases median pose error from $18/0.11 \rightarrow 35/0.2$. Limiting the spatial proximity of pseudo-labels to initial sparse labels seems a suitable choice.

5.6 Model Capacity and Efficiency

Model capacity As mentioned in Sect. 4.3, we prune some heavy convolution layers compared to HSCNet. To demonstrate the efficiency of this setting, Table 10 reports the model size of HSCNet and HSCNet++ on 7-Scenes and i7-Scenes. Our method has a memory footprint reduction of 43% compared to HSCNet on the individual scene training and 30% reduction on the combined scenes.

Runtime For a fair comparison of the running time, we run all the experiments on NVIDIA GeForce RTX 2080 Ti GPU and AMD Ryzen Threadripper 2950x CPU. It takes $\sim $7.4 h for 300k iterations on individual scene training for HSCNet++ and $\sim $10.4 h on HSCNet with the same setting. We show the approximate training time for one iteration in Table 10. It is clear that HSCNet++ has a smaller memory footprint and faster training time while offering higher accuracy. We also notice that the training time grows with the number of multi-head attention layers increases.

We have not observed a clear difference between the two methods in the inference running time. The running time varies from around 85 ms to 130 ms to localize one image. This is mainly dependent on the accuracy of predicted 2D-3D correspondences fed into the RANSAC-PnP loop.

6 Conclusion

We have proposed a novel hierarchical coarse-to-fine approach for scene coordinate prediction. The network benefits from FiLM-like conditioning of coarse region predictions for better scene coordinate prediction. Experimentally we demonstrate that both hierarchical and prediction conditioning are required for improvement. The method is extended to handle sparse labels using the proposed pseudo-labeling approach. Adaptation of symmetric cross-entropy and re-projection losses provides robustness to pseudo-label noise. We also show that the synergy of each component proposed in this work is needed for the best performance.

Results show that the proposed hierarchical scene coordinate network is more accurate than previous regression only approaches for single-image RGB localization. The proposed method is also more scalable as shown by results on three indoor datasets. In addition, the proposed method is extended to handle sparse labels using less costly methods than existing methods and obtaining better results on outdoor scenes.

Data Availibility

The datasets generated during and/or analysed during the current study are available in the RGB-D Dataset 7-Scenes, RGB Camera Relocalization 12-Scenes, and PoseNet project repositories.

Notes

The spatial resolution of the prediction is smaller, by a factor of 8, than that of the input image. The coordinate predictions are provided for a down-sampled version of the image, which is aligned with the use of deep CNNs that inherently perform such down-sampling.

References

Arandjelović, R., Gronat, P., Torii, A., Pajdla, T. & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5297–5307).
Balntas, V., Li, S. & Prisacariu, V. (2018). RelocNet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European conference on computer vision (ECCV) (pp. 751–767). Springer International Publishing.
Balntas, V., Riba, E., Ponsa, D. & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proceedings of the British machine vision conference (BMVC)
Bay, H., Tuytelaars, T. & Van Gool, L. (2006). SURF: Speeded up robust features. In Proceedings of the European conference on computer vision (ECCV) (pp. 404–417). Springer International Publishing.
Brachmann, E., Humenberger, M., Rother, C. & Sattler, T. (2021). On the limits of pseudo ground truth in visual camera re-localisation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 6218–6228).
Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S. & Rother, C. (2017). DSAC - Differentiable RANSAC for camera localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6684–6692).
Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S. & Rother, C. (2016). Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3364–3372).
Brachmann, E. & Rother, C. (2018). Learning less is more - 6D camera localization via 3D surface regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4654–4662)
Brachmann, E. & Rother, C. (2019). Expert sample consensus applied to camera re-localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 7524–7533)
Brachmann, E. & Rother, C. (2019). Expert sample consensus applied to camera re-localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 7525–7534).
Brachmann, E. & Rother, C. (2019). Neural-guided RANSAC: Learning where to sample model hypotheses. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 4322–4331).
Brachmann, E., & Rother, C. (2021). Visual camera re-localization from RGB and RGB-D images using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5847–5865.
Google Scholar
Brahmbhatt, S., Gu, J., Kim, K., Hays, J. & Kautz, J. (2018). Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2616–2625).
Budvytis, I., Teichmann, M., Vojir, T. & Cipolla, R. (2019). Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. In Proceedings of the British machine vision conference (BMVC)
Bui, M., Albarqouni, S., Ilic, S. & Navab, N. (2018). Scene coordinate and correspondence learning for image-based localization. In Proceedings of the British machine vision conference (BMVC)
Calonder, M., Lepetit, V., Strecha, C. & Fua, P. (2010). BRIEF: Binary robust independent elementary features. In Proceedings of the European conference on computer vision (ECCV) (pp. 778–792). Springer Berlin Heidelberg
Cavallari, T., Bertinetto, L., Mukhoti, J., Torr, P. & Golodetz, S. (2019). Let’s take this online: Adapting scene coordinate regression network predictions for online RGB-D camera relocalisation. In: International conference on 3D vision (3DV) (pp. 564–573).
Cavallari, T., Golodetz, S., Lord, N., Valentin, J., Prisacariu, V., Di Stefano, L., & Torr, P. H. (2020). Real-time RGB-D camera pose estimation in novel scenes using a relocalisation cascade. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2465–2477.
Article Google Scholar
Cavallari, T., Golodetz, S., Lord, N.A., Valentin, J., Di Stefano, L. & Torr, P.H. (2017). On-the-fly adaptation of regression forests for online camera relocalisation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4457–4466).
Chen, S., Li, X., Wang, Z. & Prisacariu, V. (2022). Dfnet: Enhance absolute pose regression with direct feature matching. In Proceedings of the European conference on computer vision (ECCV) (pp. 1–17). Springer Nature Switzerland.
Chen, S., Wang, Z. & Prisacariu, V. (2021). Direct-posenet: Absolute pose regression with photometric consistency. In International conference on 3D vision (3DV) (pp. 1175–1185).
DeTone, D., Malisiewicz, T. & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 224–236).
Ding, M., Wang, Z., Sun, J., Shi, J. & Luo, P. (2019). CamNet: Coarse-to-fine retrieval for camera re-localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2871–2880).
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A. & Sattler, T. (2019). D2-Net: A trainable CNN for joint detection and description of local features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8092–8101).
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Article MathSciNet Google Scholar
Guan, P., Cao, Z., Yu, J., Zhou, C., & Tan, M. (2021). Scene coordinate regression network with global context-guided spatial feature transformation for visual relocalization. IEEE Robotics and Automation Letters, 6(3), 5737–5744.
Article Google Scholar
Guzmán-Rivera, A., Kohli, P., Glocker, B., Shotton, J., Sharp, T., Fitzgibbon, A.W. & Izadi, S. (2014). Multi-output learning for camera relocalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1114–1121).
Han, X., Leung, T., Jia, Y., Sukthankar, R. & Berg, A.C. (2015). Matchnet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3279–3286).
Huang, Z., Zhou, H., Li, Y., Yang, B., Xu, Y., Zhou, X., Bao, H., Zhang, G. & Li, H. (2021). VS-Net: Voting with segmentation for visual localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6101–6111).
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A. & Yi, K.M. (2021). COTR: Correspondence transformer for matching across images. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 6207–6217).
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th international conference on machine learning (ICML) (pp. 5156–5165). JMLR
Kendall, A. & Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the IEEE international conference on robotics and automation (ICRA) (pp. 4762–4769).
Kendall, A., Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5974–5983).
Kendall, A., Gal, Y. & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 7482–7491).
Kendall, A., Grimes, M. & Cipolla, R. (2015). PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp 2938–2946).
Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980
Laskar, Z., Melekhov, I., Kalia, S. & Kannala, J. (2017). Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops (pp. 929–938).
Li, X., Wang, S., Zhao, Y., Verbeek, J. & Kannala, J. (2020). Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11,983–11,992).
Li, X., Ylioinas, J. & Kannala, J. (2018). Full-frame scene coordinate regression for image-based localization. In Proceedings of robotics: science and systems (RSS)
Li, X., Ylioinas, J., Verbeek, J. & Kannala, J. (2018). Scene coordinate regression with angle-based reprojection loss for camera relocalization. In Proceedings of the European conference on computer vision (ECCV) workshops (pp 229–245). Springer International Publishing.
Li, X., Ylioinas, J., Verbeek, J. & Kannala, J. (2018). Scene coordinate regression with angle-based reprojection loss for camera relocalization. In Proceedings of the European conference on computer vision (ECCV) workshops (pp. 0–0).
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., Fang, T. & Quan, L. (2019). Contextdesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 2527–2536).
Massiceti, D., Krull, A., Brachmann, E., Rother, C., & Torr, P.H. (2017). Random forests versus neural networks—What’s best for camera localization? In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (pp. 5118–5125).
Melekhov, I., Brostow, G.J., Kannala, J. & Turmukhambetov, D. (2020). Image stylization for robust features. ArXiv preprint arXiv:2008.06959.
Melekhov, I., Kannala, J. & Rahtu, E. (2017). Image patch matching using convolutional descriptors with euclidean distance. In Proceedings of the Asian conference on computer vision (ACCV) workshops (pp. 638–653). Springer.
Melekhov, I., Laskar, Z., Li, X., Wang, S. & Juho, K. (2021). Digging into self-supervised learning of feature descriptors. In: International conference on 3D vision (3DV)( pp. 1144–1155).
Melekhov, I., Ylioinas, J., Kannala, J. & Rahtu, E. (2017). Image-based localization using hourglass networks. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) Workshops (pp. 879–886).
Meng, L., Chen, J., Tung, F., Little, J.J., Valentin, J. & de Silva, C.W. (2017). Backtracking regression forests for accurate camera relocalization. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 6886–6893).
Meng, L., Tung, F., Little, J.J., Valentin, J. & de Silva, C.W. (2018). Exploiting points and lines in regression forests for RGB-D camera relocalization. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 6827–6834).
Mishchuk, A., Mishkin, D., Radenovic, F. & Matas, J. (2017). Working hard to know your neighbor’s margins: Local descriptor learning loss. In Advances in Neural Information Processing Systems (NIPS)( vol. 30, pp. 4826–4837). Curran Associates, Inc.
Moreau, A., Piasco, N., Tsishkou, D., Stanciulescu, B. & de La Fortelle, A. (2021). LENS: Localization enhanced by neRF synthesis. In Annual conference on robot learning
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 3942–3951.
Article Google Scholar
Radenović, F., Tolias, G. & Chum, O. (2016). CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–20). Springer International Publishing.
Radwan, N., Valada, A., & Burgard, W. (2018). VLocNet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics and Automation Letters, 3(4), 4407–4414.
Article Google Scholar
Revaud, J., De Souza, C., Humenberger, M. & Weinzaepfel, P. (2019). R2D2: Reliable and repeatable detector and descriptor. In: Advances in neural information processing systems (NeurIPS) (Vol. 32, pp. 12,405–12,415). Curran Associates, Inc.
Rogez, G., Weinzaepfel, P. & Schmid, C. (2017). LCR-Net: Localization-classification-regression for human pose. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3433–3441).
Rogez, G., Weinzaepfel, P., & Schmid, C. (2019). LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1146–1161.
Google Scholar
Rublee, E., Rabaud, V., Konolige, K. & Bradski, G.R. (2011). ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2564–2571).
Saha, S., Varma, G. & Jawahar, C. (2018). Improved visual relocalization by discovering anchor points. In Proceedings of the British machine vision conference (BMVC)
Sarlin, P.E., Cadena, C., Siegwart, R. & Dymczyk, M. (2019). From coarse to fine: Robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 12,716–12,725).
Sarlin, P.E., DeTone, D., Malisiewicz, T. & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 4938–4947).
Sarlin, P.E., Unagar, A., Larsson, M., Germain, H., Toft, C., Larsson, V., Pollefeys, M., Lepetit, V., Hammarstrand, L., Kahl, F., & Sattler, T. (2021). Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3247–3257).
Sattler, T., Leibe, B., & Kobbelt, L. (2011). Fast image-based localization using direct 2d-to-3d matching. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 667–674).
Sattler, T., Leibe, B., & Kobbelt, L. (2012). Improving image-based localization by active correspondence search. In Proceedings of the European Conference on computer vision (ECCV) (pp. 752–765). Springer International Publishing.
Sattler, T., Leibe, B., & Kobbelt, L. (2016). Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9), 1744–1756.
Article Google Scholar
Sattler, T., Leibe, B., & Kobbelt, L. (2016). Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis And Machine Intelligence, 39(9), 1744–1756.
Article Google Scholar
Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J., Kahl, F., Pajdla, T. (2018). Benchmarking 6DoF outdoor visual localization in changing conditions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8601–8610).
Sattler, T., Zhou, Q., Pollefeys, M. & Leal-Taixe, L. (2019). Understanding the limitations of CNN-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3302–3312).
Schönberger, J.L., Zheng, E., Pollefeys, M. & Frahm, J.M. (2016). Pixelwise view selection for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV)
Shavit, Y., Ferens, R. & Keller, Y. (2021). Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp 2733–2742).
Shavit, Y. & Keller, Y. (2022). Camera pose auto-encoders for improving pose regression. In Proceedings of the European conference on computer vision (ECCV) (pp. 140–157). Springer International Publishing
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., & Fitzgibbon, A. (2013). Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2930–2937).
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 118–126).
Sun, J., Shen, Z., Wang, Y., Bao, H. & Xiaowei, Z. (2021). LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8922–8931).
Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., & Torii, A. (2018). Inloc: Indoor visual localization with dense matching and view synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7199–7209).
Tian, Y., Fan, B. & Wu, F. (2017). L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 661–669).
Tyszkiewicz, M., Fua, P., & Trulls, E. (2020). DISK: Learning local features with policy. In Advances in neural information processing systems (NeurIPS) (Vol. 33, pp. 14,254–14,265). Curran Associates, Inc.
Valada, A., Radwan, N. & Burgard, W. (2018). Deep auxiliary learning for visual localization and odometry. In Proceedings of the IEEE international conference on robotics and automation (ICRA) (pp. 6939–6946).
Valentin, J., Dai, A., Nießner, M., Kohli, P., Torr, P., Izadi, S., & Keskin, C. (2016). Learning to navigate the energy landscape. In International conference on 3D vision (3DV) (pp. 323–332).
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., & Torr, P.H. (2015). Exploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4400–4408).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NeurIPS) (Vol. 30, pp. 5998–6008). Curran Associates, Inc.
Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., & Cremers, D. (2017). Image-based localization using LSTMs for structured feature correlation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 627–637).
Wang, Q., Zhou, X., Hariharan, B., & Snavely, N. (2020). Learning feature descriptors using camera pose supervision. In Proceedings of the European conference on computer vision (ECCV) (pp. 757–774). Springer International Publishing
Wang, S., Laskar, Z., Melekhov, I., Li, X., & Kannala, J. (2021). Continual learning for image-based camera localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 3252–3262).
Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., & Bailey, J. (2019). Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 322–330).
Weinzaepfel, P., Csurka, G., Cabon, Y., & Humenberger, M. (2019). Visual localization by learning objects-of-interest dense match regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5634–5643).
Xue, F., Wang, X., Yan, Z., Wang, Q., Wang, J., & Zha, H. (2019). Local supports global: Deep camera relocalization with sequence enhancement. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2841–2850).
Xue, F., Wu, X., Cai, S., & Wang, J. (2020). Learning multi-view camera relocalization with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11,375–11,384).
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4353–4361).
Zhou, Q., Sattler, T., & Leal-Taixé, L. (2021). Patch2Pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4669–4678).

Download references

Acknowledgements

This work was supported by the Academy of Finland (grant No. 327911, 353138), Junior Star GACR (Grant No. GM 21-28830 M) and Programme Johannes Amos Comenius CZ.02.01.01/00/22_010/0003405. We acknowledge the computational resources provided by the Aalto Science-IT project, CSC-IT Center for Science, Finland, and OP VVV funded project CZ.02.1.01/ 0.0/0.0/16_019/0000765 “Research Center for Informatics”. We thank Dr. Jakob Verbeek for contributing to the HSCNet publication.

Funding

Open Access funding provided by Aalto University.

Author information

Authors and Affiliations

Aalto University, Espoo, Finland
Shuzhe Wang, Iaroslav Melekhov, Xiaotian Li, Yi Zhao & Juho Kannala
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague 6, Czechia
Zakaria Laskar & Giorgos Tolias

Authors

Shuzhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zakaria Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Iaroslav Melekhov
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotian Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Giorgos Tolias
View author publications
You can also search for this author in PubMed Google Scholar
Juho Kannala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuzhe Wang.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Communicated by Xiaowei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

We detail our experiments on Aachen dataset (Sattler et al., 2018). Following the earlier HSCNet work (Li et al., 2020) we modify the architecture for this experiment, and therefore present it separately in this appendix.

1.1 HSCNet++ Architecture for Aachen

For large-scale datasets such as Aachen Day-Night, the scene coordinate network is underperforming due to the challenge of extracting reliable features in the end-to-end training procedure. Thus, instead of training a feature extractor from scratch as in the HSCNet dense setting, we leverage the pretrained SuperPoint network (DeTone et al., 2018) to extract more reliable image features as input. We modify our network to consider the SuperPoint features as input. Therefore, the dense set of local features is replaced by a sparse set of features. As a consequence, in the follow-up processing we are using convolutional layers with 1 $\times $ 1 convolutions. FiLM conditioning layers together with transformer modules are integrated in a similar way.

Table 11 Accuracy on the Aachen dataset

Full size table

Due to the large scale of the scene, a retrieval process is used during inference to provide contextual evidence. Predictions are conditioned on the retrieved image id. During training, the image id of each training image is used as additional input, in the same spirit as the region labels. It is seen as the coarsest piece of localization information within the large scale scene; next coarsest is the discretized region labels. During inference, the image id of the retrieved image is provided as additional input. The retrieval method used is NetVLAD (Arandjelović et al., 2016) and the search is performed with the test image as query and the training images as database. We use multiple top retrieved images, perform the process for each of them, and maintain the predicted camera pose associated with the largest number of inliers. The detailed architecture of HSCNet++ for Aachen is shown in Fig. 7 and denoted by HSCNet++(A). This variant only relies on classification branches and no regression branch is used, which means that the final predictions are quantized 3D coordinates. There are four classification branches in total. K-means with a branching factor of 100 is used, which results in 685k valid clusters at the finest level. Removing transformer modules from this architecture results in the HSCNet(A) architecture.

The network is trained for 900K iterations with a batch size of 1 and a learning rate of $10^{-4}$. We use Adam (Kingma & Ba, 2014) optimizer and halve the learning rate every 50K iterations for the last 200K iterations. During training, only those Superpoint keypoints are kept that are triangulated in the sparse 3D model. At test time, top 2K Superpoint keypoints are kept per image based on keypoint scores after non-maximum suppression (NMS).

Results are presented in Table 11. Using more neighbors provides a good performance boost, while not conditioning on the image ids, therefore not using retrieval at all during inference, results in a large drop in performance. Changing the large branch into regression instead of classification compromises performance as well. The transformer modules noticeably boost the performance in this experiment as well. We compare with ESAC (Brachmann & Rother, 2019a), PixLoc (Sarlin et al., 2021) and local feature-based methods AS (Sattler et al., 2016b) and HLoc (Sarlin et al., 2020). The results indicate HSCNet++ surpasses end-to-end methods, ESAC and PixLoc across most thresholds. The proposed approach, alongside other end-to-end methods, falls short compared to local feature-based methods such as HLoc (Sarlin et al., 2020). The performance gap becomes more evident in night-time settings showing the limited robustness of end-to-end methods to illumination variations. However, HLoc’s reliance on maintaining 3D maps can be quite challenging for large-scale environments, especially on mobile devices constrained by storage and communication bandwidth limitations. Therefore, the consideration of the memory-accuracy trade-off is imperative. While our model only requires 0.24GB, local feature methods like HLoc demand 7.8GB for their local descriptor database. Nevertheless, the accuracy of the proposed method is susceptible to a notable decline when faced with a substantial increase in scene scale for a fixed model size. This limitation could be addressed by deploying different models for distinct parts of a large scene by maintaining the memory-accuracy trade-off.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, S., Laskar, Z., Melekhov, I. et al. HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer. Int J Comput Vis 132, 2530–2550 (2024). https://doi.org/10.1007/s11263-023-01982-9

Download citation

Received: 10 February 2023
Accepted: 25 December 2023
Published: 06 February 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11263-023-01982-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Abstract

Similar content being viewed by others

DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points

Learning Deeply Supervised Good Features to Match for Dense Monocular Reconstruction

Learning Deep Representation for Place Recognition in SLAM

Explore related subjects

1 Introduction

2 Related Work

3 Problem Formulation and Notation

4 HSCNet++: Hierarchical Scene Coordinate Prediction with Transformers

4.1 Overview

4.2 Preliminaries

4.3 HSCNet++ Architecture

4.4 Training

4.5 Inference

4.6 Training with Sparse Supervision

5 Experiments

5.1 Experimental Setup

5.2 Results for HSCNet and HSCNet++

5.3 Ablations: HSCNet

5.4 Ablations: HSCNet++

5.5 Results for HSCNet++(S)

5.6 Model Capacity and Efficiency

6 Conclusion

Data Availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 HSCNet++ Architecture for Aachen

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation