Relative Camera Pose Estimation using Synthetic Data with Domain Adaptation via Cycle-Consistent Adversarial Networks

Learning-based visual localization has become prospective over the past decades. Since ground truth pose labels are difficult to obtain, recent methods try to learn pose estimation networks using pixel-perfect synthetic data. However, this also introduces the problem of domain bias. In this paper, we first build a Tuebingen Buildings dataset of RGB images collected by a drone in urban scenes and create a 3D model for each scene. A large number of synthetic images are generated based on these 3D models. We take advantage of image style transfer and cycle-consistent adversarial training to predict the relative camera poses of image pairs based on training over synthetic environment data. We propose a relative camera pose estimation approach to solve the continuous localization problem for autonomous navigation of unmanned systems. Unlike those existing learning-based camera pose estimation methods that train and test in a single scene, our approach successfully estimates the relative camera poses of multiple city locations with a single trained model. We use the Tuebingen Buildings and the Cambridge Landmarks datasets to evaluate the performance of our approach in a single scene and across-scenes. For each dataset, we compare the performance between real images and synthetic images trained models. We also test our model in the indoor dataset 7Scenes to demonstrate its generalization ability.


Introduction
Simultaneous localization and mapping (SLAM) has prosperously evolved in recent years and is largely used in augmented reality and robot navigation. Visual SLAM infers camera movement from pixels as in dense SLAM system [1,2] or by extracting sparse keypoints (e.g., SIFT [3], ORB [4]). In many cases, 3D geometry has been used to solve the localization problem.
Visual SLAM has been maturely applied to ground mobile robots and self-driving vehicles. However, for the localization of unmanned aerial or ground systems, traditional visual SLAM meets some challenges. First, the highspeed movement of Unmanned Aerial Vehicles (UAVs) causes massive changes in the viewpoints, which leads to a the spatial information of the scene by the weights, so the system is not very versatile [11][12][13]. In contrast, estimating the image pair's relative pose is a more common problem. The ideal relative camera pose estimation model can not only be trained and tested in a fixed scene, but also can be tested in multiple seen locations or even new places.
Generally, a CNN-based camera relocalization system can be employed in 3 ways of increasing complexity and usefulness: the first way is to train and test the model on a fixed scene. The second way is to use multiple scenes to train a model and test it on each scene, which is the work of across-scenes training in this work. The third way is to train a model on multiple locations and test it in new environments. The last two ways can be developed more easily by relative camera pose estimation than by absolute camera pose regression. Across-scenes absolute camera pose regression is more difficult since different locations have skewed camera poses distribution due to the scale-inconsistency. In the meantime, the relative camera pose estimation has versatility in many robot applications. It could be used as a neighbor frame pose estimator in visual odometry systems. It can also predict robots' relative pose in a multi-robot cooperation system. The relative camera pose estimation could even combine with a global place recognition method such as NetVLAD [9] to get a more precise absolute pose prediction in large environments with a two-step localization pipeline.
Although SfM methods can generate pose labels for covisualized images, the collection of images and the SfM process are extremely time-consuming. Compared to realworld data, synthetic images and related poses are much easier to obtain. As a result, synthetic data is used for many visual tasks [14][15][16]. However, the domain shift from synthetic to real has always been the most significant challenge in this area. The system trained on synthetic data is often not directly applicable to real data. In order reduce the discrepancy between domains, style transfer [17,18] and domain adaptation methods [19,20] are utilized.
Motivated by the above analysis, we present an acrossscene relative camera pose estimation network (RCPNet) for urban outdoor camera relocalization. We collect over 10,000 images with drones in eight city locations to construct a dataset (hereafter called Tuebingen Buildings dataset) and obtain each image's absolute pose by the SfM method. We generate over 300,000 image pairs for relative camera pose estimation. The previous work of RCPNet and original Tuebingen Buildings dataset has been published in [65]. To further expand the training dataset, twice the amount of synthetic images are rendered from the 3D models of Tuebingen Buildings and Cambridge Landmarks [11] datasets. Inspired by CycleGAN [21], we further employ translations of real-to-synthetic and synthetic-to-real to realize cycle consistency. We train RCPNet by three schemes to demonstrate the effect of the cycle-consistent adversarial network [21], namely: train on mixed images and test on real images, train on synthetic-to-real images and test on real images; train on synthetic images and test on real-to-synthetic images.
RCPNet is first compared with other learning-based pose estimation approaches PoseNet [11,55] and RPNet [7], using the real images from two datasets, namely, Tuebingen Buildings and Cambridge Landmarks. We then compare the accuracy between the real images and synthetic images trained RCPNet models on the two datasets. We also test our model in the indoor dataset 7Scenes [22] to demonstrate its generalization ability.
Our main contributions can be summarized as follows: -First, we develop RCPNet to estimate relative camera pose in multiple urban outdoor environments; -Second, we build the Tuebingen Buildings dataset with drone collected images. We further expand it by rendering synthetic images from 3D models produced by the SfM method; -Third, we take advantage of synthetic images to reduce the human labor in dataset generation and improve the performance of RCPNet further with domain adaptation via image-to-image translation.
The rest of the paper is as follows: Section 2 lists the related work. Section 3 presents the method for relative camera pose and domain adaptation via cycle-consistent adversarial networks. Section 4 introduces the data collection and preparation. Section 5 compares the experimental results of RCPNet on a two-stage analysis. In the end, Section 6 concludes the paper.

Visual Localization
The visual localization usually includes three tasks [12,23]: i) Relative camera pose estimation between consecutive keyframes as visual odometry, ii) relative camera pose estimation between the query and the reference images to eliminate the drift of localization in back-end optimization, iii) image matching to recognize viewed places in loop closure. We classify the first two as metric localization and the last one as topological localization.
Topological Localization Given a set of images with known locations and a query image, different feature matching methods will be used to retrieve the images with the closest distance or the most similar appearance. These methods have successfully relocalized the camera from Google Street View [24], aerial views [25], or satellite imagery [26] to a known location roughly.
Suenderhauf et al. [27] use CNN features as robust landmark descriptors, which can recognize the camera locations under severe changes in viewpoints and other conditions. To achieve cross-view geo-localization, Workman et al. [25] and Vo et al. [28] collected two aerial-ground image datasets (called CVUSA, Vo and Hays). CVM-Net [29] uses Siamese network and NetVLAD [9] to achieve robust cross-view image matching based on the above two datasets.
Majdik et al. [23] collected a dataset in the center of Zurich, Switzerland by a drone that flew a trajectory of two kilometers, to achieve UAVs localization from street view images in GPS-denied urban environments. The dataset includes 113 discrete street view locations and 405 matching aerial images. The above topological localization methods cannot provide continuous and accurate pose estimation but only limited and discretized position estimation for the query image.

Metric Localization
The ultimate goal of automated robot applications is to continuously and accurately locate new images in a known environment or map. Using point-based features to create sparse or dense environment maps is known as the classic visual SLAM methods [30,31]. By combining data from cameras and LIDAR to construct a map, Pascoe et al. [32] realized real-time localization of cameras.
Kendall et al. [11] built the Cambridge Landmarks dataset and replaced the three softmax classifiers of GoogLeNet [33] with affine regressors to output poses. They used the SfM method to obtain the images' poses and trained an end-to-end CNN to regress the absolute camera poses, setting a precedent in 6DoF camera relocalization. Based on nearest-neighbor matching and consecutive learning feature descriptors, RelocNet [34] introduced a CNN representation method for camera pose retrieval. To learn a more discriminative regression function, Naseer and Burgard [12] generated synthetic viewpoints and corresponding depth maps to augment dataset. Walch et al. [13] collected the TMU-LSI dataset and demonstrated that the classic methods failed in a textureless environment. They also presented the CNN + long short term memory (LSTM) architecture, which performs camera pose regression by modeling the context of the image.
Melekhov et al. [35] proposed a system to solve the relative camera pose estimation problem. The system used a hybrid CNN with fully-connected layers (FCs) as a pose estimator. However, comparisons are difficult since their predicted translations are not full but scaled vectors. As far as we know, all existing camera pose regression methods (absolute and relative methods) that trained and tested in certain location need to improve the generalization ability. Sattler et al. [36] regard absolute camera pose regression as topology localization rather than metric localization. The results show that absolute pose regression within a scene is different from accurate pose estimation based on the 3D structure but more similar to pose approximation with image retrieval. Combining local SfM reconstruction with 2D model-based methods, [37] solved the problem of metric localization, because they believed that building broadscale 3D models for real-time localization is still a major challenge. On the contrary, our opinion is that one trained model can estimate the relative camera poses in multiple locations, and it is sufficient to perform 3D reconstruction before the training phase.
In addition to visual SLAM, RCPNet can also estimate pose changes between consecutive keyframes, which can be used in some visual odometry systems, such as DeepVO [38], VINet [39] and VINS [40]. RCPNet can also be used for a group of aerial or/and ground robots to directly obtain the relative pose between robots in a centralized or decentralized manner.

Domain Adaptation
In this work, the relative camera pose estimation model is first trained on real images collected by a hand-held smartphone or a drone. Although the SfM method reduces the labor of labeling the images and offers accurate 6DoF pose for each image, the data collection and SfM procedure are very time-consuming and expensive. Since we can easily download 3D models for some famous landmarks, or build one with several images captured towards the target location, it is a natural idea to consider rendering synthetic images with related poses from 3D models for camera pose estimation network training. However, due to domain bias [41,42], a system adapted to one dataset usually cannot be generalized to another. As a result, even if the camera pose estimation network can predict the relative pose for synthetic image pairs, it can not directly predict the relative pose for real image pairs, which will make the model impractical.
Domain adaptation aims to solve the dataset/domain bias problem. One strategy is to learn the domain-invariant features [43,44], while others learn a feature or pixel level mapping from source to target domains [45,46]. Minimizing the Maximum Mean Discrepancy (MMD) is popular to align feature distribution across the target and source domains [18,47]. The adversarial loss brings representations that force the generated images (or translated images) to be indistinguishable from photos in the target domain [43,48].
The image-to-image translation is a promising way to realize domain adaptation. Isola et al. [49] presented the pix2pix which uses a generative adversarial network [50] for a translation from source to target photos. Similarly, Sangkloy et al. [51] generate photos from sketches while Karacan et al. [52] from the attribute and semantic layouts.
Paired training examples are necessary for the above prior works, while CycleGAN [21] realizes unpaired imageto-image mapping via adversarial networks with cycleconsistency loss [53]. Our approach builds on CycleGAN while keeping the training images in a paired way to retain the geometry alignment between the synthetic and real images from the same viewpoint.

Relative Camera Pose Estimation Model
In this part, we discuss the relative camera pose model and the architecture of RCPNet. RCPNet will output the relative pose vector p of the two input images. p is consist of a 3D relative camera translation t and a quaternion rotation q: We chose quaternions to represent the rotation because by normalizing them to unit length, it is easy to map a 4-D value to a reasonable rotation.

Learning Relative Translation and Rotation Simultaneously
We generate the two cameras' relative pose for training and testing. (R 1 , t 1 ), (R 2 , t 2 ) are the rotation matrices and translation vectors which project a point from world coordinate to camera 1 and 2's systems, respectively. From camera coordinates 1 to 2, we set P 12 as the transformation matrix, R 12 as the rotation matrix, and t 12 as the translation vector: Take (q 1 , q 2 , q 12 ) as quaternion representations of (R 1 , R 2 , R 12 ). A quaternion can represent a 3D rotation and is defined by 4 real numbers. x, y, and z represent a vector. w is a scalar that stores the rotation around the vector. Because the unit quaternions q and −q represent the identical rotation, we perform a numerical inversion of all q 12 with negative w to make the prediction of the network more consistent.
Training translation and rotation regressor separately will affect each other's performance [11]. Therefore, the original framework uses stochastic gradient descent to optimize the following loss function, which minimizes the Euclidean distance between pose predictions (t andq) and ground truth ( t and q): To learn rotation and translation at the same time, they used grid search to fine-tune the weighting factor β to maintain a balance between translation and rotation errors. The result shows that the change interval of β in the outdoor scene is between 250 and 2000. Using cross-validation, RPNet [7] found the most suitable hyperparameter β value in different locations, and spends lots of time clustering the original dataset and testing the trained model for the evaluation. For RCPNet, we use automatic weights that scale on the loss function based on homoscedastic uncertainty (as in [55]) across all the locations, which is numerically more stable than β. In this loss function, the weighting factor β between the translation and rotation error is not static but adaptive during the whole training process. More precisely, if the translation estimation is more accurate than rotation in training, there will be a larger penalty for rotation error in the next epoch, and vice versa: (4) where L t represents the translation loss and L q represents the rotation loss. To maintain the balance between the penalty values of translation and rotation, factorsŝ t andŝ q are used to force that the error of rotation and translation is not skewed. Given valid values for variance, the exponential mapping allows the regression to unconstrained scalar values since exp(−s i ) is resolved to the positive domain. We setŝ t = 0.0,ŝ q = −3.5 as initialization (approximately to β starts from 30, but fine-tuned during training) for all datasets. The adaptive loss makes the model more generalized to adapt to different locations.

Architecture of RCPNet
Different from RPNet [7] and PoseNet [11] based on GoogLeNet, we use two branches of pre-trained ResNet34 networks [57] to construct a weight-sharing Siamese network [56]. The 6DoF relative camera pose is estimated end-to-end.
The relative camera pose estimator of RCPNet is based on FCs and ReLU [58] activations, as shown in Fig. 1. We empirically use the output of the second to last layer as a 512-dimension (512D) global feature of each input image for every ResNet34 branch. The last pooling layer, and the 1000 units FC layer from the original ResNet34 are deleted. A 512D vector represents an image's geometrical features, and the distance between two 512D vectors denotes the spatial relationship between two corresponding images. Following [56], the compatibility between images I 1 and I 2 is measured as: where G w (I i ) presents each image's global feature. In [56], the compatibility between images I 1 and I 2 represents the 'semantic' distance for image similarity metric learning in face verification. However, compatibility denotes the spatial distance between two overlapping outdoor images in this paper. More precisely, E w (I 2 , I 1 ) is small if I 1 and I 2 are close in both position and rotation, and large if they are far away and have a different appearance. Afterward, we insert two 2048D FCs as regressors to respectively output relative translation (3D) and rotation (4D). It is worth noting that they are learned together from the objective function of (4). We normalize the quaternion rotation vector to unit length.

Bidirectional Adversarial Loss
In this paper, the real image denotes the RGB image captured from real environments by hand-held cameras or camera-mounted drones, which is limited and difficult to collect and label. The synthetic image represents the image rendered from 3D models with specific viewpoints, with less time and human labor consumption. Since the synthetic images are easier to obtain for training and real images are the target objects for relocalization testing, we also refer to the synthetic image as the source domain and the real image as the target domain. We aim to learn bidirectional mapping functions G s2t and G t2s to bridge the gap between the synthetic (source) domain X s to the real (target) domain X t . D s and D t are two adversarial discriminators following [50], where D s is used to discriminate photos {G t2s (x t )} from photos X s , D t aims to distinguish {G s2t (x s )} from X t . The adversarial losses are expressed as:

Cycle Consistency Loss
To constrain two adversarial producers G s2t and G t2s to generate required geometrical consistent images rather than random photos with the target domain style, we use a cycleconsistent loss. As illustrated in Fig. 2(a), for each photo x s from domain X s , the transfer cycle is able to bring x s back to be indistinguishable from the original photo, i.e., G t2s (G s2t (x s )) ≈ x s and G s2t (G t2s (x t )) ≈ x t . Therefore, the cycle-consistent loss is as follows: We utilize the open-source software Blender [62] to render synthetic images from different poses on each 3d model. For cycle-consistent adversarial network training, we render images at the exact pose of each real image to generate real-synthetic image pairs.

Full Objective
The joint loss function is as follows: with λ selected empirically as λ = 10, which is the balancing factor between the two objectives. We need to solve:

Architecture of Adversarial Network
Based on the work in [54], the adversarial generator network is consist of two convolutional layers followed by Fig. 2 (a) Our adversarial network architecture contains two image style translators: G s2t : X s − > X t and G t2s : X t − > X s , and two discriminators D s and D t . D t forces G s2t to translate X s into outputs indistinct from X t , and vice versa for G t2s and X t . Two L GAN are the objective bidirectional losses for adversarial training. Two L cyc are the cycle consistent losses to further regularize the translators that encourage: G t2s (G s2t (x s )) ≈ x s and G s2t (G t2s (x t )) ≈ x t . When utilizing the translated result of the cycle-consistent adversarial networks, we have adopted three implementation options: (b) realsynthetic mixed image pairs as input to train the RCPNet, implicitly learn the mapping between the two domains; (c) synthetic (domain X s ) images are firstly translate into synthetic-to-real images before the pairwise training of RCPNet. The query images from the real domain can be directly fed in the trained model; (d) synthetic image pairs are directly fed into RCPNet for training. The real query images need to be translated into real-to-synthetic images before testing nine residual-blocks [57]. Two up-convolutional layers are inserted afterward to up-sample the image to the input size. The discriminator networks are adopted from Patch-GANs [49].

Data Collection and Preparation
In this chapter, we introduce how to build a dataset for camera relocalization in urban outdoor environments. Using drones to capture images, we further extend the database to a 3D space with vertical viewpoint changes. We use SfM to generate ground truth poses for more than 10,000 real-world photos. By providing more image matches and a wider range of rotation and translation, the dataset makes the training of the pose estimator more efficient.

Data Collection
We build the Tuebingen Buildings dataset from eight outdoor urban places. The dataset offers data to train and test the absolute and relative pose estimator in different urban environments. We use a DJI Mavic Pro drone to collect the dataset in multiple locations near Tubingen, Germany. The drone was manually piloted at each location. By keeping the camera always facing the building, we collect images of the entire environment at variant flying heights ranging from 2 to 35 m. In each location, we carried out at least four flights to capture images under varying weather and lighting conditions. Although there are some clutters like vehicles and pedestrians, they have little effect on most images captured from the height above 5 m.
We use Pix4D Mapper [59] to generate image poses as ground truth measurement and training labels. The Table 1 shows the output uncertainty of the absolute camera position and orientation. The vertical variance of the viewpoints brings new 3D constraints, which leads to better localization: the position error is about 10 to 40 cm, and the orientation error is below 1 • . Rather than recording videos and sub-sampling to frames, we programmed the drone to capture photos every two meters of movement (measured by GPS) in any direction. To obtain a better 3D reconstruction, images are captured with high resolution (4000×3000) from variant distances (see Fig. 3).
The We use the 7Scenes indoor dataset [22] to demonstrate the generalization ability of RCPNet. This dataset was built with a Kinect RGBD camera in 7 separate scenes, and each scene consists of a single room. Using Kinect Fusion [22], ground truth poses were generated. The dataset contains many repetitive or texture-less features, which is exceptionally challenging for visual relocalization using traditional features.

Real Image Pairs Preparation
An effective method is needed to produce image pairs to achieve relative camera pose estimation. For the Cambridge Landmarks dataset, En et al. [7] randomly paired every image with eight images in the same sequence. The training sequences and testing sequences are separated beforehand. When using this dataset for relative camera pose estimation, we followed their settings. In contrast, for the Tuebingen Buildings dataset, we first separate images into training and testing subsets in each scene. Then we use SIFT feature matching to generate real image pairs for the subset, traversing each subset for every image in the subset. To further reject the outliers, some SIFT feature matched image pairs are filtered: if they have significant differences in translation (> 30 m) or rotation (> 75 • ), we check their co-visibility manually and delete the wrong matches. The thresholds are measured from the ground truth. The "co-visibility" information means what landmark objects are visible together to two cameras. At last, we obtained around 300,000 valid pairs in all eight scenes, an average of 30 pairs per image. The augmentation is as expected because our dataset covers a wider 3D space, and each images have many co-visible neighbors from many directions. Figure 5 demonstrates that the relative camera pose samples (translation and rotation ranges) in the Tuebingen Buildings dataset are more widely distributed.
The images are rescaled to 256 × n or n × 256 pixels, n ≥ 256, and then are cropped into 224 × 224 patches as the input of CNN in the previous work [7,11]. The model is trained by random cropping and then tested by central cropping as data augmentation. However, cropping operation (as shifting, flipping, rotating, and zooming) may affect the spatial information implicit in the image. For the relative camera pose estimation, we keep the random crop coherent in the two input photos, and then test multiple scenes with scale size from 256 to 236. The results indicate that for scenes with shorter object distances such as Shop Facade, a smaller rescale ratio will enable a wider field of view and a larger overlap, thus will slightly improve localization. For most of the scenes that are far away from the objects, the data augmentation effect using a bigger rescale ratio is dominant. This indicates that other data augmentation methods need be considered to keep the spatial information as much as possible. Following PoseNet [11], the tuning of brightness, contrast, saturation, and hue, combined with the crop operation, are also adopted in our framework. Same data augmentations have been applied to  For individual training, we separate the data in each scene randomly into training and test sets at a ratio of 4:1. For across-scenes training, we maintain the test sets unchanged, and extract 20,000 image pairs from the training set of each scene to obtain a 120k pairs training set. Some spatially close photos may be separated into the test and training sets, but these images are completely different, and the similarity among image pairs is very low because they are matched from multiple directions.
For the 7Scenes dataset, we randomly choose two paired images for the target one from the near frame in the same sequence. We cut off the nearest frames (e.g., in +/ − 10 frames) to ensure enough pose shift and cap on the threshold of the 25th frame to ensure co-visibility. We select 5 of the scenes for across-scene training, and 2 of the scenes for individual training, obtaining very close performance.

Synthetic Images Generation
We use ContextCapture [60] [62] to render synthetic images. First, we render images at the exact pose of each real image. This step Second, we randomly generate twice the amount of initial training poses within the spatial area of each scene (see the fourth column in Table 2). The camera intrinsics are calibrated during the 3D model building. For the Cambridge Landmarks dataset, the height of poses is set as z < 2m to simulate the human's viewpoint, while for Tuebingen Buildings, we set the threshold of the highest viewpoint in the training set. To ensure the generated poses are facing toward the 3D model, we set a threshold for Euler angle difference between a generated pose and its nearest real pose at 10 • . Each rendered image in a new pose is then paired with around 30 images via a co-visibility check.
As a result, we use an average of 300 images to build a 3D model for an urban scene and generate an average of 3000 synthetic images for training. The later result proves that this scenario promises to generate training labels for visual localization and other visual tasks while requiring minimal human labor consumption.

Training of Cycle-Consistent Adversarial Networks for Domain Adaptation
We train two cycle-consistent adversarial networks separately on the Tuebingen Buildings and the Cambridge Landmarks datasets. To maintain the geometry-alignment between images from two domains, each real-synthetic image pair is extracted from the same pose. For each dataset, we randomly select 500 real-synthetic image pairs as a new train set and 150 real-synthetic pairs as a validation set from the initial train set, and 150 real-synthetic pairs are extracted from the initial test set as a new test set. Every sub-scene has an equal proportion in each set. Figures 6 and 7 demonstrate the qualitative results of the cycle-consistent adversarial networks trained on Tuebingen Buildings and Cambridge Landmarks, respectively. In each figure, T (target), S2T , S (source), and T 2S represent the real image, synthetic-to-real image, synthetic image, and real-to-synthetic image. The last row is a typical challenging failure example. We focus on the comparisons of T − S2T and S − T 2S.
First of all, due to the fine 3D models and accurate pose labels, the cycle-consistent adversarial network preserves the geometric details of the generated and translated images to the greatest extent, e.g., the second and fifth rows in Fig. 6, and the second and the third row (pay attention to the dog at the bottom!) in Fig. 7. The buildings' skyline is well preserved by the synthetic-to-real mapping in many cases except for the most complex buildings. Some small moving objects, like pedestrians, can be eliminated by both direction translation, see the right bottom corner of the images in the first row of Fig. 7. This is an excellent feature for visual localization tasks. However, if the occluded objects are too large or too close to the camera, like the tree in the last row of Fig. 6 or the bus in the last row of Fig. 7, eliminating them is beyond of the capability of our image style translation method.
The illumination in the real domain is always varying. If the distribution of the query image is identical with the training set, the result is consistent, e.g., the fifth and sixth rows in Fig. 7, if not, then the brightness of the real image and related synthetic-to-real image is inconsistent, e.g., the fourth row in Fig. 7. Naturally, we can imagine that seasonal changes will bring similar results. However, synthetic images and real-to-synthetic images do not have such trouble. We can observe it from the last two columns in Figs. 6 and 7. This difference gives us an interesting hint. In Fig. 2, three options for utilizing the translated images have been discussed. The S2T Training scenario could directly use real query images, thus saving time during the inference stage. However, the T 2S Test scenario is promising in generalization ability due to the consistent feature distribution between synthetic and real-to-synthetic images. Further details will be discussed in the next section.

Experimental Setup
We obtain a large number of synthetic images in Section 4.3. We utilize these synthetic images in three scenarios, as shown in Fig. 2(b) to (d): (1) mixed real-synthetic pair consisting of a synthetic image and a real image is fed into RCPNet for training, forcing it to implicitly learn the mapping between the two domains; (2) synthetic (domain X s ) images are first translated into synthetic-to-real images before the pairwise training of RCPNet. In the inference stage, the real query image pairs can be directly fed in the trained model; (3) in the training stage, synthetic image pairs are directly fed into RCPNet. The query images from the real domain need to be translated into real-tosynthetic images before testing. The number of training samples is doubled with the three scenarios mentioned above, compared with the initial real image dataset. Now we present the quantitative evaluation of different methods on multiple datasets. As benchmarks, we also test PoseNet [11] with absolute camera pose regression and RPNet [7] with relative camera pose estimation on eight scenes in the Tuebingen Buildings dataset. For the Cambridge Landmarks dataset, we quote the results from their papers. We also cite the results of PoseNet [11, Fig. 6 Examples of image style transfer result compared with original synthetic and real images in Tuebingen Buildings dataset. From left to right, T , S2T , S, and T 2S represent the real image, synthetic-to-real image, synthetic image, and real-to-synthetic image. The last row is a typical challenging failure example 55] and RelocNet [34] on the indoor dataset 7Scenes as baselines. The translation errors are measured in meters and rotation errors are in degrees, as the cited works did. Since the Tuebingen Buildings and Cambridge Landmarks datasets only provide discrete images with ground truth poses for testing, we evaluate RCPNet with discrete image relocalization in this paper instead of path tracking of robotcaptured motion videos. However, relative camera pose Fig. 7 Examples of image style transfer result compared with original synthetic and real images in Cambridge Landmarks dataset. From left to right, T , S2T , S, and T 2S represent the real image, synthetic-to-real image, synthetic image, and real-to-synthetic image. The last row is a typical challenging failure example estimation, combined with the global matching method, has a great potential to realize long-term camera relocalization in large environments.
We normalize pixel intensities of the input images from range −1 to 1. We use an implementation in PyTorch [63] to train RCPNet and the adversarial networks. We use ADAM [64] for optimization with a learning rate of 1 × 10 −4 . For individual trained RCPNet, we use a batch size of 32 on an NVIDIA 1080Ti GPU; training takes 20k -100k iterations, i.e. 10 hours -2 days. We use a batch size of 128 for acrossscenes trained RCPNet on two NVIDIA 1080Ti GPUs, and training takes 2 days. The learning rate of the adversarial networks is constant for 100 epochs and linearly decreases to zero over the last 100 epochs. 500 real-synthetic image pairs with 200 epochs training takes 6 hours on an NVIDIA 1080Ti GPU.

Results of Real Images Training
In Table 2, we compare different approaches on each scene, individually or across-scenes, seen or unseen. The baselines are PoseNet trained by images and RPNet trained by real image pairs, both within each scene. Generally, the accuracy of absolute pose regression cannot be directly compared with relative pose estimation. However, since PoseNet [11] is the basis of RPNet [7] and RCPNet, we think it is a valid reference. When one implements relative camera pose estimation in a real environment, it is common to estimate the relative pose between an unknown query image and a known reference image with a ground truth pose. Under this circumstance, the relative pose estimation accuracy is equal to the absolute pose regression accuracy for the query image. When we compare the right four columns in Table 2, the result shows the individually trained RCPNet outperformed the other two baselines in most scenes. The results of across scenes trained RCPNet in the last column of Table 2 has an average 5% decline compared with the results of individual trained RCPNet in both datasets, but it is still comparable to PoseNet and RPNet in most scenes. In King's college and Shopping Mall, the across-scenes trained model has even better performance than the individual trained model. The comparison between individual training and across-scenes training shows that different scenes may have general features, and a fine-tuned model can adapt to them simultaneously.
For invisible scenes (Industrial building, Sand South and Shop Facade), the across-scenes model failed, which indicates the limitation of PoseNet-based architecture. Some recent findings [36,37] on image-based localization show that PoseNet design can degrade the generalization ability in challenging scene variation due to scale-inconsistency. Although the PoseNet-based methods have limited generalization ability, it is still meaningful to expand from individual scene training to across-scenes training. For example, when a vehicle works in a factory that consists of several workshops, or a drone delivers products between multiple GPS-denied places in a city, the robot with a single model can be competent if it was across-scenes trained.
The cumulative distribution functions (CDF) of   In Table 3, we compare our results with PoseNet and RelocNet on indoor dataset 7Scenes. RelocNet uses multiple candidates for the relative pose regressor in four scenes. We train an across-scenes model across five scenes (Chess, Fire, Heads, Pumpkin, and Stairs), while individually train two models for Office and Red Kitchen. The performance of RCPNet is comparable to the three baselines: an average 53.5% and 35.7% increase in translation and rotation accuracy to PoseNet (β weight) [11]; an average 9.3% and 21.3% increase in translation and rotation accuracy to PoseNet (Geometric) [55]; an average -0.7% decrease and 3.6% increase in translation and rotation accuracy, respectively, compared with RelocNet (7Scenes) [34]. In the small indoor dataset, the translation accuracy may have reached the upper limit for PoseNet-like methods, but there is room for improvement in rotation accuracy. Table 4 demonstrates the quantitative results of RCPNet with across training using synthetic images in the three scenarios, as we discussed in Section 5.1 and shown in Fig. 2. Mixed Input denotes that in the training stage, real-synthetic image pairs are fed into the RCPNet. S2T Training represents that in the training stage, synthetic-to-real image pairs are fed into the RCPNet, then the real query image can be tested directly. T 2S Test means that real-to-synthetic images are tested with a model trained by synthetic image pairs.  Compared with the results of across-scenes real image trained RCPNet in the last column of Table 2, the performance of the three scenarios mentioned above has been improved due to the expansion of training data. The mixed input scenario has an average 14.5% and 16.2% improvement translation and rotation accuracy, respectively. It shows the network implicitly learns the bidirectional mapping between two domains from mixed real-synthetic image pairs. The S2T training scenario has a slightly inferior performance to Mixed Input, which shows that the feature distribution in real-world images is always varying due to the illumination shifts and seasonal changes. Therefore, it's challenging to find an average mapping representation from the consistent synthetic distribution to the inconsistent real distribution. However, the S2T training still has an average 8.3% and 9.5% increase in translation and rotation accuracy, compared with the real image trained model.

Results of Synthetic Images Training
The T 2S test scenario has an average 32.6% and 32.3% increase in translation and rotation accuracy to the real image trained RCPNet. It also outperforms the other two synthetic image trained competitors with a large margin, proving the robustness of the translation from the real distribution to the synthetic distribution. We need real-tosynthetic translation for every input query image in the T 2S test scenario during the test stage. The predicted poses are still compared to the real query images' poses. It caused an extra time-consuming but is still a meaningful example for synthetic images application. Besides, synthetic image training via domain adaptation in the other two scenarios also improves the original RCPNet. For visual localization tasks, a promising bridge has been built between the real data and the synthetic data. As portable camera equipment becomes more and more popular, and 3D modeling becomes more convenient, this method will be more widely used in the future and reduce the heavy data labeling work of humans.

Conclusion
We present RCPNet, a learning-based method for relative camera pose estimation across multiple scenes. RCPNet can be used in continuous localization for autonomous navigation of unmanned systems, multi-robot cooperation systems, visual odometry applications, or combined with global image retrieval methods like NetVLAD to obtain more accurate absolute pose estimation. A camera relocalization dataset for both absolute and relative camera pose estimation is built with a drone and the SfM method. We demonstrate that our method outperforms two baseline approaches in two outdoor datasets. The result that across-scenes training has comparable performance to individual training shows that general features of image pairs in different locations exist. For unseen scenes, the results of the across-scenes model are not satisfactory, the PoseNetbased architecture needs to be further modified. We also test our model in an indoor dataset to demonstrate its generalization ability.
Besides, we further expand our dataset with 3D modeling and synthetic image rendering. We utilize cycle-consistent image style transfer and adversarial training to estimate real-image pairs' relative camera pose based on training over synthetic environment data in different schemes. The result demonstrates the effectiveness of the cycle-consistent adversarial networks in domain adaptation between real and synthetic images.
In future work, we aim to improve our network architecture to implement the image style transfer and pose estimation in an across-scenes model with fewer hyperparameters. To improve the practical ability in long-term camera relocalization, we will further test our method in larger and connected 3D datasets.

Declarations
Competing interests The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.