HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12 Scenes, Cambridge Landmarks datasets, and the combined indoor scenes.


Introduction
Estimating the six degrees-of-freedom (6-DoF) camera pose from a given RGB image is a key component in many computer vision systems such as augmented reality, autonomous driving, and robotics.Classical methods [58,[61][62][63]71] establish 2D-2D(-3D) correspondences between query and database local descriptors, followed by PnP-based camera pose estimation.This incurs both storage and computational costs by necessitating storage of millions of database local descriptors and hierarchical descriptor matching in a RANSAC loop.
On the other hand, end-to-end pose regression methods that directly regress the camera pose parameters are much faster and memory efficient [2,20,34,67].However, such methods are significantly less accurate than local descriptor ones.A better trade-off between accuracy and computational efficiency is offered by structured localization approaches [6,8,11,36,68,80].Structured methods are trained to learn an implicit representation of the 3D environment by directly regressing 3D scene coordinates corresponding to a 2D pixel location in a given input image.This directly provides 2D-3D correspondences and avoids storage and explicit matching of database local descriptors with the query.For small-scale scenes, it is shown that scene-coordinate methods [8,11] outperform classical local descriptorbased methods, but later [5] show that the performance is indeed comparable.Nevertheless, the storage and computational benefits of structured-based methods are superior to classical local descriptor matching methods.
Existing scene-coordinate regression methods [6,8,11] are designed to predict scene coordinates from a small local image patch that provides robustness to viewpoint changes.On the other hand, such methods are limited in applicability to larger scenes where ambiguity from visually similar local image patches cannot be resolved with a limited receptive field.Using larger receptive field sizes, up to the full image, to regress the coordinates can mitigate the issues from ambiguities by encoding larger context.This, however, is shown to be prone to overfitting the larger input patterns in the Fig. 1: HSCNet architecture.The ground-truth scene 3D coordinates are hierarchical quantized into regions and sub-regions.Dirrent branches of the the network sequentially predicts discrete regions and sub-regions, and continous 3D coordinates, with the processing of each branch being conditioned on the result of the previous one.Given an input image, HSCNet predicts 3D coordinates for 2D image pixels, which then form the input to PnP-RANSAC for 6DoF pose estimation.
case of limited training data, even if data augmentation alleviates this problem to some extent [11,37].
Increasing context by enlarging the receptive field while maintaining local distinctiveness of descriptors or not overfitting is a challenging problem.We address this using a special network architecture, called HSC-Net [36], which hierarchically encodes scene context using a series of classification layers before making the final coordinate prediction.The overall pipeline is shown in Fig 1 .Particularly, the network predicts scene coordinates progressively in a coarse-to-fine manner, where predictions correspond to a region in the scene at the coarse level and coordinate residuals at the finest level.The predictions at each level are conditioned on both descriptors and predictions from the preceding level which we experimentally show is the key component in large scenes.This conditioning leverages FiLM [50] layers that allow for a gradual increase in the receptive field.Instead of leveraging simple CNNs as in HSC-Net to encode the descriptors and predictions, this work extends it to utilize the transformer-based [77] conditioning mechanism, named HSCNet++, which is more efficient in capturing global context into local representations through attention and doesn't require heavy conventional layers to enlarge the receptive field.The architecture manages to improve coordinate prediction at all levels, both coarse and fine.We integrate dynamic position information in the form of predicted coarse positional encoding, without the need to learn or construct explicitly position embeddings and show promising results on several benchmarks.We further extend HSCNet++ by removing the dependency on dense ground truth scene coordinates.Dense coordinates limit the applicability of HSCNet to outdoor scenes.Similar to [8], HSCNet addressed the issue of sparse data on Cambridge dataset [34] by using MVSbased densification [65].However, these methods either introduce additional noise and are costly to obtain.Directly training HSCNet with sparse supervision leads to a significant performance drop.In HSCNet++, we propose a simple yet effective pseudo-labelling method, where ground-truth labels at each pixel location are propagated to a fixed spatial neighbourhood.This is based on the assumption that nearby pixels share similar statistics.To provide robustness to pseudo-label noise, symmetric loss functions based on cross-entropy and reprojection loss are proposed.While the symmetric cross-entropy loss provides robustness to the classification layers of HSCNet, the reprojection loss rectifies the noise in pseudo-labelled 3D scene coordinates.
This work is a summary and extension of HSC-Net.We validate our approach on three datasets used in previous works: 7-Scenes [68], 12-Scenes [75], and Cambridge Landmarks [34].Our approach shows consistently better performance and achieves state-of-theart results for single-image RGB localization.In addition, by compiling the 7-Scenes and 12-Scenes datasets into single large scenes, we show that our approach scales more robustly to larger environments.In summary, our contributions are as follows: 1. Compared to HSCNet, we utilize an improved transformer based conditioning mechanism that efficiently and effectively encodes global spatial information to scene coordinate prediction pipeline, resulting in a significant performance improvement from 84.8% to 88.7% on indoor localization while requiring only 57% of the memory footprint.2. We extend HSCNet to optionally leverage the sparse ground truth only in the training procedure by introducing pseudo ground truth labels and anglebased reprojection errors.When using sparse supervision for training, HSCNet++(S) achieves better accuracy on the Cambridge dataset compared to HSCNet++ trained on MVS-densified data.3. We show that the classical pixel-based positional encoding in our conditioning mechanism suffers from significant performance drop, especially in scenes which have massive repetitive patterns.Our spatial positional encoding by the FiLM layer eliminates this problem and achieves SoTA performance on several benchmarks.

Related Work
Existing methods for visual localization are reviewed depending on the category they belong to.
Classical visual localization methods assume that a scene is represented by a 3D model, which is a result of processing a set of database images.Each 3D point of the model is associated with one or several database local descriptors.Given a query image, a sparse set of keypoints and their local descriptors are obtained using traditional [4,15,39,56] or learned CNN-based [3,21,23,27,40,[42][43][44]48,53,69,72,79,85] approaches.The query local descriptors are then matched with local descriptors extracted from database images to establish tentative 2D-3D matches.These tentative matches are then geometrically verified using RANSAC [24] and the camera pose is estimated via PnP.Although these methods produce a very accurate pose estimate, the computational cost of sparse keypoint matching becomes a limitation, especially for large-scale environments.The large computational cost is addressed by image retrievalbased methods [1,51] restricting matching query descriptors to local descriptors extracted from top-ranked database images only.Moreover, despite the recent advancements of learned keypoint detectors and descriptors [23,42,44,53,70,73,79,86], extracting discriminative local descriptors which are robust to different viewpoint and illumination changes is still an open problem.
Absolute camera pose regression (APR) methods aim to alleviate the limitations of structure-based methods by using a neural network that directly regresses the camera pose of a query image [12,19,20,31,32,34,45,78] that is given as input to the network.The network is trained on database images with groundtruth poses by optimizing a weighted combination of orientation and translation L2 losses [34,45], leveraging uncertainty [33], utilizing temporal consistency of the sequential images [52,74,78,83] or using GNNs [84] and Transformers [66].The APR methods are scalable, fast, and memory efficient since they do not require storing a 3D model.However, their accuracy is an order of magnitude lower compared to the one obtained by structure-based localization approaches and comparable with image retrieval methods [64].Moreover, the APR approaches require a different network to be trained and evaluated per scene when the scenes are registered to different coordinate frames.
Relative camera pose regression (RPR) methods, in contrast to APR, train a network to predict relative pose between the query image and each of the topranked database images [2,22,35], obtained by image retrieval [1,51].The camera location is then obtained via triangulation from two relative translation estimations verified by RANSAC.This leads to better generalization performance without using scene-specific training.However, the RPR methods suffer from low localization accuracy similarly to APR.
Scene coordinate regression methods learn the first stage of the pipeline in the structure-based approaches.Namely, either a random forest [7,17,18,26,41,46,47,68,76] or a neural network [6,[8][9][10][11]13,14,16,37,38,41] is trained to directly predict 3D scene coordinates for the pixels and thus the 2D-3D correspondences are established.These methods do not explicitly rely on feature detection, description, and matching, and are able to provide correspondences densely.They are more accurate than traditional feature-based methods at small and medium scales, but usually do not scale well to larger scenes [8,9].In order to generalize well to novel viewpoints, these methods typically rely on only local image patches to produce the scene coordinate predictions.However, this may introduce ambiguities due to similar local appearances, especially when the scale of the scene is large.To resolve local appearance ambiguities, we introduce element-wise conditioning layers to modulate the intermediate feature maps of the network using coarse discrete location information.We show this leads to better localization performance, and we can robustly scale to larger environments.
Joint classification-regression frameworks have been proven effective in solving various vision tasks.For example, [54,55] proposed a classification-regression approach for human pose estimation from single images.In [7], a joint classification-regression forest is trained to predict scene identifiers and scene coordinates.In [82], a CNN is used to detect and segment a predefined set of planar Objects-of-Interest (OOIs), and then, to regress dense matches to their reference images.In [13], scene coordinate regression is formulated as two separate tasks of object instance recognition and local coordinate regression.In [9], multiple scene coordinate regression networks are trained as a mixture of experts along with a gating network which assesses the relevance of each expert for a given input, and the final pose estimate is obtained using a novel RANSAC framework, i.e., Expert Sample Consensus (ESAC).In contrast to existing approaches, in our work, we use spatially dense discrete location labels defined for all pixels, and propose FiLM-like [50] conditioning layers to propagate information in the hierarchy.We show that our novel framework allows us to achieve high localization accuracy with one single compact model.
Transformers are already shown to have a positive impact on the problem of visual localization.Shavit et al. [66] show that multi-headed transformer architectures can be used to improve end-to-end absolute camera pose localization in multiple scenes with a single trained model.Similarly, SuperGlue, LoFTR and COTR [29,59,70] demonstrate the usefulness of transformer architectures in learning local descriptor models.Inspired by the above success, the paper proposes methods to extend transformer architecture to the structured localization method.

Problem Formulation and notation
The goal of camera pose estimation is to predict the 6-DoF pose p(x) ∈ R 6 for an RGB image x.Handling camera pose estimation as dense 3D coordinate scene regression is performed by first predicting the corresponding 3D coordinates of a known 3D environment for each pixel of an image, given by ŷ(x).As a second and final step, these 2D-3D correspondences are then fed into the PnP algorithm that estimates the camera pose.In this work, we focus on function f : [0, 1] W ×H×3 → R w×h×3 , w = W/8 and h = H/8 1 , that provides such 3D coordinate predictions given an input image x, i.e. ŷ(x) = f (x).The known 3D environment is represented by a set of training images, with known ground-truth labels per pixel in the form of 3D coordinates.The training set comprises pairs of the form (x, y(x)) for image x and ground-truth 3D co-ordinates y(x).In case ground-truth is available only sparsely, i.e. on small part of the image pixels, a corresponding binary mask m(x) ∈ {0, 1} w×h denotes which are the valid pixels.The value of ground-truth or prediction at a particular pixel is denoted by i, e.g.y(x) i for the ground-truth 3D coordinate of pixel i.

Hierarchical Scene Coordinate Prediction
HSCNet.A baseline conventional approach for this task is to use a fully convolutional network (FCN) that maps input images to 3D coordinate predictions and is trained with a regression loss.The proposed HSC-Net extends this scheme by constructing a hierarchy of labels, from coarse-level to fine-level, and by adding extra layers to predict those labels.Hierarchical discrete labels are defined by partitioning the ground-truth 3D points of the scene with hierarchical k-means.The number of levels in the hierarchy is fixed to 2 in this work.In this way, in addition to the ground-truth 3D scene coordinates, each pixel in a training image is also associated with two labels, namely region and sub-region labels, obtained at different levels of the clustering hierarchy.
Region and sub-region labels are denoted by one-hot encodings y r (x) ∈ {0, 1} w×h×k and y s (x) ∈ {0, 1} w×h×k , respectively.The fine-level information is given by the residual between the ground-truth 3D point and the corresponding sub-region center, which we denote by y 3D (x) ∈ R w×h×3 .Ground-truth 3D pixel coordinates y(x) are replaced by y r (x), y s (x), and y 3D (x).Subregion centers and residuals, when combined by addition, compose the pixel 3D coordinates.We add two classification branches for regions and sub-regions, which provide the label predictions in the form of the k-dimensional probability distributions, and a regression branch for residual prediction.Regions, sub-region and residual predictions are denoted by ŷr (x), ŷs (x), and ŷ3D (x), respectively.A key ingredient is to propagate coarse region information to inform the predictions at finer levels, which is achieved by conditioning layers before the classification/regression layer(s).
FiLM Conditioning.The FiLM-based [50] conditioning layers are used to encode the predicted (sub-)regions information into follow-up branches.These layers rely on parameter generators γ, β : R d → R d , and perform modulation of input features z by where is the Hadamard product, and functions γ, and β consist of 1 × 1 convolutions and are conditional The depicted losses correspond to the case of learning with dense ground-truth.Feature maps from different parts of the dense feature encoder are inputted to g r , g s , g 3D .This is represented by red and magenta arrows in an abstract way, while the detailed architecture is presented in Figure 3.
parameter generators, i.e. the parameters of the FiLM layer depend on one of the inputs.Unlike the original FiLM layers which perform the same channel-wise modulation across the entire feature map, our conditioning layers perform a linear modulation per spatial position, i.e., element-wise multiplication and addition.Therefore, instead of vectors, the output parameters γ(w) and β(w) are feature maps of the same dimensions as the input feature map.

HSCNet++
HSCNet is extended by adding transformer blocks at each branch of the pipeline.The resulting architecture is referred to as HSCNet++.It integrates transformer encoders that enjoy the inherent and implicit region and sub-region information and do not require the use of conventional position encodings that are typically used with transformers.
Model architecture.The overall architecture of HSC-Net++ is summarized in Fig 2 .We present the model as it operates during inference and then clarifies the differences between training and inference.An FCN backbone is used for dense feature encoding and is de-noted by F(x) ∈ R w×h×d , mapping the input image to a dense feature tensor which represents the appearance of the input image.Prediction of region labels is performed first.A mixed module, consisting of FCN with transformer and denoted by g r , performs encoding processing of feature map F(x).The result is given by x r = g r (F(x)), which is then fed to the region predictor h r : R w×h×d → R w×h×k comprised a 1×1 convolutional layer.Region prediction is provided by ŷr (x) = h r (x r ).After the region prediction, sub-region prediction is performed.Feature map F(x) is now fed to a conditioning block.Processing is performed in a way that is conditioned on region predictions ŷr (x), and the features are enhanced with the transformer by capturing global information to local features.This is denoted by function g s : R w×h×d × R w×h×k → R w×h×d , i.e. the output feature map depends both on the input feature map and on the predicted regions.Then, x s = g s (F(x), ŷr (x)) is fed into the sub-region predictor (similar to the region predictor) ŷs (x) = h s (x s ).Conditioning on region predictions is a way to jointly encode appearance and region predictions.Therefore, conditioning on region predictions is used to improve sub-region predictions.Now, the last part of continuous residual is performed.Similar to the earlier stage, feature map f (x) is processed by conditioning on sub-region predictions ŷs (x).This is denoted by function g 3D : R w×h×d × R w×h×k → R w×h×d .Then, x 3D = g 3D (f (x), ŷs (x)) is fed into the 3D residual predictor to obtain ŷ3D (x) = h 3D (x 3D ), where h 3D : R w×h×d → R w×h×3 and consists of 1 × 1 convolution.
Conditioning with transformer (w/ Txf).Conditioning blocks g s and g 3D consist of FiLM layers, parameter generators and transformers.The parameter generators consist of several 1x1 FCN layers and take predicted (sub-)regions, ŷ from previous layers as input.The FiLM layers then condition the input features F(x) with the output of parameter generators outputs using Eq. 1.Therefore, appearance information, as indicated by the input feature map, and position information is jointly encoded.We add transformer encoders right after each FiLM layer in the conditioning blocks.As FiLM layers encode both appearance and position, the requirement of conventional 2D positional encoding [77] is not needed.It is to be noted that the spatial information encoded by FiLM depends on the network predictions.To the best of our knowledge, such form of spatial encoding for transformers has not appeared in the computer vision or machine literature before.
The vanilla transformer has a quadratic computation cost O(n 2 ) with the length of input features, which is computationally unaffordable in our case as we adopt a semi-dense feature map (F(x) ∈ R (w×h)×d ) as input for the scene coordinate prediction.Inspired by [30,70], we apply the linear transformer [30] to speed up this process and keep it in the sparse ground-truth label setting.The linear transformer considers the self-attention as a linear dot-product of kernel feature maps and leverages the associativity property of matrix products to reduce the computational complexity to O(n).Consequently, the additional transformer modules have a negligible impact on our running time.
Training.When training with dense supervision, the following losses are adopted.Classification loss c is applied to the output of the two classification branches, Where ce is the cross-entropy loss.Additionally, regression loss r , in particular mean squared error, is applied on ŷ3D (x) and y 3D (x).The total loss L is a weighted summation of the two classification losses and the regression loss.
Where λ 1 and λ 2 are the weights for each term.We observe that the regression prediction is more sensitive to localization performance.Thus, a larger weight is assigned to the r .
Inference.During inference, the predicted 3D coordinates ŷ(x) and their corresponding 2D pixels are fed into the PnP-RANSAC loop to estimate the 6-DoF camera pose.These predicted 3D coordinates are obtained by simply adding the center of predicted subregions ŷs (x) and predicted residuals ŷ3D (x).We differentiate on how conditioning is performed during training and inference as shown in Fig 2 .At training time, conditioning is performed using the ground truth (sub-)region labels, i.e. y r (x) and y s (x) are the second inputs of the conditioning blocks.At test time, conditioning is performed using predicted (sub-)region labels.In particular, the one-hot encodings of the argmax operation of ŷr (x) and ŷs (x) are the second inputs of the conditioning blocks.

HSCNet++ with Sparse Supervision
When only sparse ground truth of 3D coordinates, indicated by mask m(x) for image x, is available, the straightforward approach is to apply the loss only on pixels where the mask value is 1, which we refer to as valid pixel.Instead, we propose to perform propagation of the available labels to nearby pixels and use two additional losses that are appropriately handling the scarcity of the labels.

Label propagation (LP).
We rely on a smoothness assumption: labels do not change much in a small pixel neighborhood.Consequently, we propagate the labels in a local neighborhood around each pixel.The neighborhood is defined by a square area of size (2z + 1) × (2z + 1).All neighbors of a valid pixel are marked as valid too and ground-truth maps, namely y r (x), y s (x), and y 3D (x), are updated by replicating the label of the original pixel to the neighboring pixels.Then, the classification and regression losses are applied on the newly obtained valid pixels after propagation.This is seen as some form of pseudo-labeling that increases the density of the available labels.

Symmetric cross-entropy loss (SCE).
Pseudo-labels are expected to include noise.This noise will typically be larger if propagation reaches background pixels starting from a foreground-object valid pixel.The conventional cross-entropy loss that we use as classification loss is shown to be not very robust to noise in the labels in the work of Wang et al. [81].Inspired by their work, we use the symmetric cross-entropy loss given by compared to the conventional one defined as follows: for pixel i for the region prediction and similarly for the sub-region prediction too.Computational problems are simply solved by setting log 0 equal to a constant value.The total classification loss is just a weighted summation sce = λ ce ce + λ rce rce Re-projection error loss (Rep).We additionally use a re-projection error loss that does not require any labels and, therefore, does not get influenced by noise in the pseudo-labels.Nevertheless, we do not apply this on all pixels to avoid background pixels, but rather apply it on only all valid pixels after the label propagation, therefore staying near the original valid pixels.We use the angle-based re-projection error as a loss.Given ground-truth camera pose F , the loss for pixel i of image x, whose 2D coordinates in the image are denoted by p i , is given by where and C is the intrinsic matrix.Note that re-projection loss is not added to the total loss in the beginning epochs for a fast training convergence.Similar to our dense setting, The total loss for sparse supervision is the weighted summation of regression loss, symmetric classification loss, and re-projection loss, sparse = sce + λ 2 r + λ 3 rep .

Experiments
In this section, we discuss the experimental setup and employed datasets, present our results, and compare our approach to state-of-the-art localization methods.

Experimental setup
Datasets.We use three standard benchmarks for the evaluation; namely, 7-Scenes [68], 12-Scenes [75], and Cambridge Landmarks [34].The 7-Scenes dataset covers a volume of ∼ 6m 3 for each individual scene.The 3D models and ground truth poses are included in the dataset.12-Scenes is another indoor RGB-D dataset that contains 4 large scenes with a total of 12 rooms, the volume ranges 14-79m 3 for each room.The union of these two datasets forms the 19-Scenes dataset.Cambridge Landmarks dataset is a standard benchmark for evaluating scene coordinate methods in outdoor scenes.
It is a small-scale outdoor dataset consisting of 6 individual scenes, and the ground truth pose is provided by structure-from-motion.Following prior work [9], we conduct experiments per scene, i.e. the individual scenes setting, but also by training a single model on all scenes of a corresponding dataset, i.e. the combined scenes setting.The combined settings of the given indoor localization benchmarks are denoted by i7-Scenes, i12-Scenes, and i19-Scenes, respectively.Competing methods.In this work, we compare the proposed approach with the following methods: (1) pose regression methods that directly regress absolute or relative camera pose parameters: MapNet [12], Geometric PoseNet [32], AttTxf [66], LSTM-Pose [78], Anchor-Net [57] and LENS [49]; (2) local feature based pipelines based on SIFT such as Active Search (AS) [63] and Hloc [58] based on CNN descriptors; (3) DSAC (3D) [11]: the latest scene coordinate regression approach with 3D model; (4) VS-Net [28]: scene-specific segmentation and voting; (5) PixLoc [60]: scene-agnostic network; (6) SFT-CR [25]: scene coordinate regression with global context-guidance.In addition, we also compare with (7) ESAC [9] on the combined scenes.We also consider a baseline called Reg-only without the hierarchical classification layers.Evaluation metrics.We report the median translation and orientation error (cm, • ) as well as the accuracy of test images under the threshold of (5cm, 5 • ) on indoor scenes.On Outdoor Cambridge Landmarks [34], we report only the median pose error as in previous methods [6,11,36].Training details.We generate the region labels by hierarchical K-means.For 7-Scenes, 12-Scenes, and Cambridge landmarks, we adopt 2-level ground truth labels with a branching factor of 25 for all the levels.Furthermore, for combined scenes, i7-Scenes, i12-Scenes, and i19-Scenes, the first level branching factor is set to 7×25, 12×25, and 19×25, respectively.For the individual scene setting, training is performed for 300K iterations with Adam optimizer.For the combined scenes, the number of iterations is set to 900K.Throughout all experiments, we use a batch size of 1 with the initial learning rate of 10 −4 .
The classification loss weights λ 1 is set to 1 for all datasets, while regression loss weight λ 2 is 10 for single scenes and 100000 for combined scenes.In the sparse supervision setting, λ ce and λ rce are set to 0.1 and 1, respectively, while λ 2 follows the dense setting, and λ 3 is increased from 0 to 0.1 after first 10 epochs.We initialize the network by training with l r using pseudolabel coordinates and later also add l rep after 10 epochs.When training with sparse supervision, we select the  For experiments on the combined scenes we added two more layers in the first conditioning generator, g s that are marked in (dotted) red.We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively.
neighborhood size z = 5 to propagate labels, and use the cluster centers obtained from dense scene coordinates for a direct comparison.Table 1: Indoor localization: individual scene setting (7-Scenes).For each scene of 7-Scenes dataset we report the median translation (t, cm) and orientation (r, • ) error.The best and second best results are in bold and underlined.Note that except VS-Net [28] and SFT-CR [25], the rest results are reported in centimeter precision for translation error.
Scenes Methods Reg-only [36] DSAC*(3D) [11] HSCNet [36] HSCNet++ t, cm r,   Data augmentation is also effective in increasing the prediction accuracy.Thus, similar to HSCNet [36], we randomly augment training images using translation, rotation, scaling and shearing by uniform sampling from [-20%, 20%], [-30  ] re-spectively.In addition, images are augmented with additive brightness uniformly sampled from [-20, 20].Pose estimation.We follow the same PnP-RANSAC pipeline and parameters setting as in [8].The inlier threshold and the softness factor are set to τ = 10 and β = 0.5, respectively.We randomly select 4 correspondences to formulate a minimal set for a PnP algorithm to generate a camera pose hypothesis, and a set of 256 initial hypotheses are sampled.Similar to [8,11], a pose refinement process is performed until convergence for a maximum of 100 iterations.Architecture details.The detailed architecture of HSCNet++ is shown in Fig 3 ; we also visualize the block details of the FiLM conditioning network and the transformer modules.By removing the transformer layers, we derive the architecture of HSCNet.Additionally, the number of channels in the last branch, g 3D of HSC-Net is 4096, while it is 2048 for HSCNet++ that reduces memory cost (c.f.Sec.6.6).For experiments on the combined scenes we added two more layers in the first conditioning generator, g s that are marked in (dotted) red.We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively.For individual scenes, we add 2 multi-head attention layers (MHA) to both classification and regression conditioning blocks, while in the combined setting, the number of MHA is set to 5.

Results for HSCNet and HSCNet++
Individual scenes setting.We present results on 7-Scenes and 12-Scenes in Table 1   Combined scenes setting.To test the scalability of scene-coordinate regression methods, we go beyond small-scale environments such as individual scenes in 7-Scenes and 12-Scenes and use the combined scenes, i.e. i7-Scenes, i12-Scenes, and i19-Scenes by combining the former datasets.
Results on the combined scenes setting presented in Table 3 including comparison with the regressiononly baseline and ESAC.Results show that our method scales well with increase in number of scenes compared to Reg-only baseline.It is to be noted that ESAC requires training and storing multiple networks specializing in local parts of the environment, whereas our approach requires only a single model.Results show that our approach outperforms ESAC on i7-Scenes and i12-Scenes, while performing comparably on i19-Scenes (87.9% vs. 88.1%).ESAC and our approach could be combined for very large-scale scenes, but we do not explore this option in this work.HSCNet++ advances the state-of-the-art on all datasets, demonstrating the utility of transformers for this task.
Cambridge Landmarks.Table 4 reports the results of three types of visual localization methods on Cambridge landmarks.AS [63] and Hloc [58] estimate the camera poses with sparse SfM ground truth.DSAC++, DSAC* and our approaches train a scene-coordinate regression model with MVS-densified depth maps, VS-Net leverage the hybrid of the two.Both HSCNet and HSCNet++ perform better than other scene coordinate methods DSAC++ and DSAC*.The performance is comparable to more recent approaches.However, we observe that the models trained with MVS-densified pseudo ground truth show a lightly worse performance compared to the approaches that use the sparse SfM 3D map.HSCNet++ shows even worse performance by adding the transformer modules.Such results motivated us to extend the HSCNet++ to train with sparse supervision and our hypothesis is that the MVS densification introduces more noise to the dense supervision.The HSCNet++(S) performance on Cambridge landmarks in Sec.6.5 verified our hypothesis.

Ablations: HSCNet
Data augmentation.Using geometric and color data augmentation provides robustness to lighting and viewpoint changes [21,44].We investigate the impact of data augmentation and summarize the obtained results in Table 5a.Applying data augmentations leads to better localization accuracy.Note that without data augmentation, the proposed approach still provides comparable results to state of the art methods (c.f .ESAC [9] in Table 3 vs.row 3 of Table 5a).Conditioning mechanism.The two key components of HSCNet are the coarse-to-fine joint classificationregression module and its combination with the conditioning mechanism.Their impact is evaluated and results are shown in Table 5a.We train a variant of our network without the conditioning mechanism, i.e. we remove all the conditioning generators and layers.The network still estimates scene coordinates in a coarseto-fine manner by using the predicted location labels, but there is no coarse location information that is fed to influence the network activations at the finer levels.Results indicate the importance of the conditioning mechanism for accurate scene coordinate prediction.The regression only baseline fails to achieve good performance, which pronounces the benefit of the proposed hierarchical scheme.Hierarchy and partition granularity.The robustness of HSCNet to the label hierarchy hyperparameter by varying depth and width are reported in Table 5.The results show that the performance of our approach Increasing the number of classification layers from 2 is not always beneficial and only brings marginal improvement in 7-Scenes, while increasing the computational costs.We observe the best trade-off for the partition of 25 × 25 for both 7-Scenes and 175 × 25 for i7-Scenes (175 = 7 × 25 due to 7 scenes combined).

Ablations: HSCNet++
Impact of internal transformer encoder layers.In this ablation, we remove transformers encoders t r and t s , while only t 3D remains.This variant is denoted by HSCNet++ † and Table 6a shows a small to noticeable drop in all cases.
To factor out the impact of multi-headed attention (MHA) layers, we report results in Table 6a, which shows that increasing the number of MHA layers in HSCNet++ † does not lead to performance improvement.It is worth mentioning that HSCNet++ † with 8 MHA layers has 2 million more parameters than HSC-Net++.Our intuition is that this happens due to the improvement of predictions at coarse levels of the network.To test the above hypothesis, we compute the accuracy of the sub-region predictions.For each valid pixel in a query image , this metric evaluates whether the valid pixel is correctly classified.Results in Table 6c show that adding transformers at classification branches helps to improve the label classification accuracy.However, the sub-region prediction accuracy does not always correlate with the localization performance.This can be attributed to RANSAC-based filtering of final 3D scene coordinates for camera pose estimation.That is, incorrect 3D scene predictions due to erroneous sub-region predictions can be detected as outliers by RANSAC.Impact of positional encoding.We compare the proposed way of providing region (position) information to the transformer blocks with the classical positional encoding used in transformers.As label encoding is an inherent part of HSCNet, for a direct comparison with positional encoding, we additionally add the positional encoding right before the transformer block and perform experiments on i7-Scenes.Results presented in Table 6b show that with the additional position encoding the results noticeably drop.

Results for HSCNet++(S)
We now present results for HSCNet with sparse supervision and study the pseudo-labeling and loss functions in detail.We donate it as HSCNet++(S).For indoor scenes, we synthetically sparsify dense coordinates using sparse SIFT-based SfM reconstruction.That is, we select the subset of dense 3D coordinates whose 2D reprojections (pixel locations) are also registered in the SfM reconstruction.For the outdoor Cambridge dataset, we directly obtain the keypoints of training images from the provided SfM models.
The localization performance on 7-Scenes, i7-Scenes, and Cambridge datasets is provided in Fig 5 and Table 7. Results show that even with sparse coordinate supervision, HSCNet++(S) achieves competitive results on 7-Scenes with respect to the dense counterpart, even outperforming HSCNet.On the more challenging combined scene setup of i7-Scenes, HSCNet++(S) lacks by 10% indicating a further requirement for future research in this direction.However, on the outdoor dataset Cambridge Landmarks, where only sparse coordinate data is available in most cases, HSCNet++(S) outperforms HSCNet and HSCNet++, which are trained on MVSdensified [8,36,65] data, by a large margin.It demonstrates the effectiveness of our label propagation and supports our hypothesis that noisy dense ground truth from MVS harms the training process.The largest improvement is observed on Kings College, Great Court and Old Hospital with median pose errors (cm/•) of 15/0.24,18/0.11 and 15/0.30respectively (c.f .Table 4).On average median pose error, HSCNet++ (S) outperforms PixLoc (15/0.25),VSNet (13.6/0.24) and DSAC* (20.6/0.34).Component Ablations.We formulate ablations on 7-Scenes to examine the components in the proposed HSCNet++(S).We first train the model without the     Cambridge dataset, where increasing z from 0 → 5 reduces median pose error (t/r) from 32/0.28 → 18/0.11.But increasing z further from 10 → 18 increases median pose error from 18/0.11 → 35/0.2.Limiting spatial proximity of pseudo-labels to initial sparse labels seems a suitable choice.

Model Capacity and Efficiency
Model Capacity.As mentioned in Sec.4.2, we prune some heavy convolution layers compared to HSCNet.We have not observed a clear difference between the two methods in the inference running time.The running time varies from around 85 ms to 130 ms to localize one image.This is mainly dependent on the accuracy of predicted 2D-3D correspondences fed into the RANSAC-PnP loop.

Conclusion
We have propsoed a novel hierarchical coarse-to-fine approach for scene coordinate prediction.The network benefits from FiLM-like conditioning of coarse region predictions for better scene coordinate prediction.Experimentally we demonstrate that both hierarchical and prediction conditioning are required for improvement.The method is extended to handle sparse labels using the proposed pseudo-labeling approach.Adaptation of symmetric cross-entropy and reprojection losses provides robustness to pseudo-label noise.We also show that synergy of each component proposed in this work is needed for best performance.
Results show that the proposed hierarchical scene coordinate network is more accurate than previous regression only approaches for single-image RGB localization.The proposed method is also more scalable as shown by results on three indoor datasets.In addition, the proposed method is extended to handle sparse labels using less costly methods than existing methods and obtaining better results on outdoor scenes.

Fig. 2 :
Fig. 2: An overview of the proposed HSCNet++.The figure shows the network architecture of the proposed HSCNet++.The depicted losses correspond to the case of learning with dense ground-truth.Feature maps from different parts of the dense feature encoder are inputted to g r , g s , g 3D .This is represented by red and magenta arrows in an abstract way, while the detailed architecture is presented in Figure3.

Fig. 3 :
Fig.3: HSCNet++ detailed architecture.The figure shows the detailed network architecture of the main pipeline and the FiLM conditioning network.For experiments on the combined scenes we added two more layers in the first conditioning generator, g s that are marked in (dotted) red.We also roughly doubled the channel counts that are highlighted in red, cyan and violet for i7-Scenes, i12-Scenes and i19-Scenes, respectively.

Fig. 4 :ig. 5 :
Fig.4: Scene coordinates visualization on i7-Scenes.We visualize the scene coordinate predictions for three test images with HSCNet, HSCNet++, and HSCNet++(S) on i7-Scenes.The XYZ coordinates are mapped to the heatmap, and the ground truth scene coordinates are computed from the depth maps.For each image, the left column is the correct predicted label and the right column is the predicted scene coordinates.

Fig. 6 :
Fig.6: Impact of neighborhood size z.The percentage of accurate labels and valid pixels change with the increasing of neighborhood size z.

Table 2 :
Indoor localization: individual scene setting (12-Scenes).Similar to the 7-Scenes localization benchmark, we provide the median translation (t, cm), orientation (r, • ) error, and accuracy with the error threshold of 5cm and 5 • .The best accuracy results are in bold.

Table 4 :
Outdoor localization: individual scene setting (Cambridge).For each scene of the dataset we report the median translation (t, cm) and orientation (r, • ) error.The best results are in bold.ingly.All models are trained and evaluated individually on each scene of the corresponding dataset.Results show that HSCNet is still competitive with respect to methods published later.With the addition of transformers, HSCNet++ further boosts the average performance by 4% on 7-Scenes and obtains the best accuracy on 7-Scenes among the competitors.

Table 5 :
Ablation for HSCNet.Average pose accuracy obtained with different hierarchy settings.The models with 4-level label hierarchy are classificationonly, i.e. the final regression layer is omitted is robust w.r.t. the choice of these hyperparameters, with a significant drop in performance observed only for the smallest 2-level label hierarchy.
The impact of increasing the number of MHA layers #MHA.Without intermediate transformers at the classification branches (only t 3D is used), adding additional #MHA layers to HSCNet++ † does not improve performance.Positional encoding.PE: conventional positional encoding with sine and cosine functions.
(c) Sub-region prediction accuracy (%).Results show that HSCNet++ improves classification accuracy at the sub-region level.

Table 6 :
Ablations for HSCNet++.We analyze the influence of different design choices of the proposed approach on i7-Scenes.

Table 7 :
HSCNet++In this section, we analyze the impact of the LP neighborhood size, z.We vary the neighborhood size z range from 0 → 9 on RedKitchen as ablation, and the results are reported in Fig 6and Table 9. Fig6shows that increasing the size of z, also increases pseudo-label noise shown by a decrease in the percentage of accurate labels.For e.g. when z = 5 the fraction of noisy labels is 15%.Results in Table9shows that there is a trade-off between increasing z, and camera localization accuracy.This effect is more pronounced in outdoor scene, Great Court from

Table 8 :
Ablations for HSCNet++(S)The results of HSCNet++(S) and various variants are presented, the table shows the median translation and rotation errors (Error) and localization accuracy (Accuracy) under 5cm/5 • .

Table 9 :
Impact of z on pose estimationWe report the pose estimation results (median errors and accuracy) on Red Kitchen and Great Court with different neighborhood size

Table 10 : Comparison of the model capacity and runtime.
We compare the statistics of the model of HSCNet and HSCNet++, we provide the results on the same software and hardware setting.To demonstrate the efficiency of this setting, Table10reports the model size of HSCNet and HSCNet++ on 7-Scenes and i7-Scenes.Our method has a memory footprint reduction of 43% compared to HSCNet on the individual scene training and 30% reduction on the combined scenes.Runtime.For a fair comparison of the running time, we run all the experiments on NVIDIA GeForce RTX 2080 Ti GPU and AMD Ryzen Threadripper 2950x CPU.It takes ∼7.4 h for 300k iterations on individual scene training for HSCNet++ and ∼10.4 h on HSCNet with the same setting.We show the approximate training time for one iteration in Table10.It is clear that HSCNet++ has a smaller memory footprint and faster training time while offering higher accuracy.We also notice that the training time grows with the number of multi-head attention layer increases.