1 INTRODUCTION

Today, deep neural networks are the most popular tool in autonomous systems development. They can be effective prediction models based on different input sensor data [1]. We are especially interested in neural depth and ego-pose estimation for the onboard monocular camera. This is essential both for the tasks of detecting and tracking three-dimensional objects [2] and for high-quality mapping of the area in which the camera is moving [3]. Also, monocular camera depth estimation is much more complicated than the reconstruction of depth maps from a stereo pair of images [4, 5], where we can accurately estimate the scale and pixel disparity both analytically and using neural networks [6].

Existing results [7, 8] show that neural networks can successfully cope with this problem on one level with feature-based methods. However, they are still significantly affected by noise in the reconstructed depth maps.

For each specific task it is also necessary to have a large data set. To compile such a set for the ego-motion estimation, it is necessary to have accurate equipment for taking ground truth (GT) values. It is not always possible, especially if you plan to test the performance of the algorithm with your data. Therefore, we chose an joint self-supervised learning approach [7] as the main algorithm, which does not require pre-labeling for training, but uses an additional single-frame depth and multi-frame pose predictions to minimize the photometric error.

The process of choosing neural network architectures is empirical since there are no mathematically proved rules for their compilation. Different models can solve the same problem with a big difference in the final metrics. Our approach builds upon the insight that the receptive field is an important hyperparameter that greatly affects the ability of neural networks to perform a task. Experiments on the KITTI [9] dataset show the trueness of this assumption for both pose and depth prediction.

In summary, our work makes the following main contributions:

• a novel convolutional neural approach called ERF-SfMLearner for monocular depth and ego-motion estimation with extended receptive field;

• analysis of neural network receptive filed influence on monocular depth and ego-motion estimation on KITTI dataset with different resolutions of the input image.

2 RELATED WORK

Most self-supervised methods use single-frame depth and multi-frame pose predictions to minimize photometric error from source to a target frame from the sequence of images. This idea was first introduced by Zhou et al. [7]. Based on this principle a lot of works were proposed. Mahjourian et al. [10] offer an approach to combine photometric loss with geometric constraints using 3D-based loss. Godard et al. [11] in Monodepth2 enhance reprojection loss and design it to robustly handle occlusions. Subsequent methods propose improvements based on various techniques, such as supervision from optical flow [1216], semantic segmentation [17, 18] or combination of these techniques [19, 20]. Wang et at. [21] propose to synthesize new views from raw images, thereby enriching the training data and improving the performance of the pose network. Tak-Wai Hui in [22] rethink the utilization of image sequence in the RNN acitecture. Suri et at. in [23] introduce pose constraints to reduce depth inconsistencies and scale ambiguity. Lee et at. in [24] suggest to integrate IMU sensor to disambiguate depth scale.

However, most of these methods inherit learning setup and neural architectures from [7]. Our analysis shows that they can be more improved if they leverage our findings on the importance of the neural network receptive field in the task of self-supervised monocular depth and ego-motion estimation.

3 METHOD

3.1 ERF-SfMLearner Architecture

In a deep learning context, receptive field (RF) is defined as the size of the region in the input that produces the feature. Basically, it is a measure of association of an output feature (of any layer) to the input region (patch). Specifically, for self-supervised pose and depth estimation, we would like each output feature of the encoder to have a big receptive field, so as to ensure that no crucial information was not taken into account [25]. We establish a strong baseline for our algorithm by following practices from this work [7]. In baseline method, neural networks are implemented as a convolution network (Fig. 1, top). So, for a more fair comparison, we choose two convolution operators type to effectively increase the receptive field of the neural network without global architecture redesign: dilated convolution and deformable convolution (Figs. 2, 3).

Fig. 1.
figure 1

Baseline network architecture (SfMLearner) and ERF-SfMLearner (extended receptive field SfMLearner). Each rectangular block indicates the output channels after convolution operation. DepthNet has “U-net” like architecture with multi-scale side predictions. The kernel size is 3 for all the layers except for the first 4 conv layers with 7, 7, 5, 5, respectively. PoseNet predicts 6-DoF relative pose. Kernel size is 3 for all the layers except for the first two conv where we use kernel 7 and 5, respectively. ERFDepthNet encoder shares the same architecture with baseline besides 4 first blocks of deformable convolutions (DFC ERFDepthNet, Fig. 3a). In ERFPoseNet 4 blocks of deformable convolution place at the end of the encoder (DFC ERFPoseNet, Fig. 2c).

Fig. 2.
figure 2

Network architectures for ERFPoseNet. (a) PoseNet from baseline paper. (b–d) Different ERFPoseNets with extended RF.

Fig. 3.
figure 3

Different ERFDepthNet encoder’s architectures.

3.2 Learning Approach

An overview of the baseline approach is shown in Fig. 1. It can learn depth and camera motion from unlabeled data. The method consists of two parts: depth prediction network and pose estimation network, which are trained jointly.

For two adjacent frames, \({{I}_{t}}\) and \({{I}_{s}}\), if the depth map of \({{I}_{t}}\) and the relative pose between the two views are given, then \({{I}_{s}}_{'}\) view can be reconstructed from \({{I}_{t}}\). Taking \({{I}_{t}}\) as input, depth prediction network generates depth map, denoted as \({{D}_{t}}\). The relative camera pose between two views can be estimated from the pose estimation network, denoted as \({{T}_{{t \to s}}}\). Denote \({{p}_{t}}\) as the homogeneous coordinates of a pixel in \({{I}_{t}}\) and \({{p}_{s}}_{'}\) as the corresponding pixel in \({{I}_{s}}_{'}\). The projected coordinates then can be expressed as:

$${{p}_{s}}_{'} \sim K{{T}_{{t \to s}}}{{D}_{t}}{{K}^{{ - 1}}}{{p}_{t}},$$
(1)

where K is the camera intrinsic matrix, \({{T}_{{t \to s}}}\) is the camera coordinate transformation matrix from the \({{I}_{t}}\) frame to the \({{I}_{s}}\) frame, \({{D}_{t}}\) is the depth value of the \({{p}_{t}}\) pixel in the \({{I}_{t}}\) frame, and the coordinates are homogeneous.

The loss function includes photometric, smooth and regularization losses [7].

3.3 Receptive Field Extension with Dilated Convolution

Dilated convolution is a very similar to a basic convolution operator. In essence, dilated convolutions introduce another parameter, denoted as r, called the dilation rate. Dilations incorporate holes in a convolutional kernel [26]. The “holes” basically define a spacing between the values of the kernel. So, while the number of weights in the kernel is unchanged, the weights are no longer applied to spatially adjacent samples.

3.4 Receptive Field Extension with Deformable Convolution

Deformable convolution adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner [27]. In our work we also examine second version of deformable convolution: each sample not only undergoes a learned offset, but is also modulated by a learned feature amplitude. The network module is thus given the ability to vary both the spatial distribution and the relative influence of its samples [28].

4 EXPERIMENTS

We evaluate the performance of our methods and compare them with baseline’s approach on multi-frame ego-motion estimation and single-view depth as well. We use the KITTI dataset [9] for benchmarking.

Dataset. We use monocular image sequences for training and test. The original image size is 375 × 1242, and images are downsampled during training. In order to compare fairly with baseline, we use two different splits of the KITTI dataset: KITTI OdometryFootnote 1 to train networks and evaluate ERFPoseNet and Kitti Eigen splitFootnote 2 to test ERFDepthNet. We train models on KITTI Odometry sequence 00–08 and evaluate the pose error on sequence 09 and 10.

Training details. For all the experiments we set epoch-size = 1000, sequence-length = 5, photo-loss-weight = 1, mask-loss-weight = 0, smooth-loss-weight = 0.2. During training, we used batch normalization for all the layers except for the output layers, and the Adam optimizer with \({{\beta }_{1}} = 0.9\), \({{\beta }_{2}} = 0.999\), learning rate of 0.0002 and mini-batch size of 4. More details can be found in our repository.Footnote 3

4.1 Pose Estimation

To evaluate the impact of RF on ego-motion prediction we use different ERFPoseNet’s architectures as visualized in Fig. 2 and jointly train them with the DepthNet from the baseline. To resolve scale ambiguity during evaluation, we first optimize the scaling factor for the predictions made by each method to best align with the ground truth, and then measure the Absolute Trajectory Error (ATE) and Rotation Error (RE) as the metrics (Table 1). RE between \({{R}_{1}}\) and \({{R}_{2}}\) is defined as the angle of \({{R}_{1}}{{R}_{2}}^{{ - 1}}\) when converted to axis/angle:

$$RE = arccos\left( {{{(trace({{R}_{1}}{{R}_{2}}^{{ - 1}}) - 1)} \mathord{\left/ {\vphantom {{(trace({{R}_{1}}{{R}_{2}}^{{ - 1}}) - 1)} 2}} \right. \kern-0em} 2}} \right).$$
(2)
Table 1. Absolute Trajectory Error and Rotation Error on the KITTI Odometry split averaged over all 5 frame-snippets (lower is better)

As shown in Table 1, RF increase helps to get lower errors and better metrics in ego-motion estimation. Moreover, dilated convolution slightly improves the metrics, but the use of deformable convolution for the last layers of the ERFPoseNet is a much more profitable method. As shown in Fig. 4, RF with deformable convolution covers the entire input, which could explain this result. On the other hand, applying deformable convolution to the four first layers of the ERFPoseNet (Dfcv2) leads to results worse than ERFPoseNet (Dilated) and similar to ERFPoseNet (Dfc).

Fig. 4.
figure 4

The receptive field (RF) of ERFPoseNet. Top, left to right: original image from KITTI dataset, RF for baseline PoseNet. Bottom, left to right: RF for ERFPoseNet (Dilated), RF for ERFPoseNet (Dfc) and ERFPoseNet (Dfcv2). We use backprop to compute the RF and exploit the fact that the values of the weights of the network are not relevant for computing RF. We change the weight for every layer to be 0.05 and the bias to be 0. To create a situation in which the gradient at the output of the model depends only on the location of the pixels a white image is passed into the network. For visualization, we only compute RF for the one pixel in the first channel of the penultimate conv layer—set the corresponding gradient value to 1 and all the others to 0. When we backpropagate this gradient to the input layer and light up the RF as a red mask.

4.2 Depth Estimation

To evaluate the impact of RF on depth prediction we also used ERFDepthNet architectures, shown in Fig. 3. We study changing convolution operations only in the encoder block and replacing them with the deformable convolutions (Table 2). Since the depth predicted by method is defined up to scale factor, for evaluation we multiply the predicted depth maps by a scalar s that matches the median with the ground-truth, i.e. s = median(\({{D}_{{gt}}}\))/median(\({{D}_{{pred}}}\)). This we call GT supervisor. As a result, lower errors and best accuracy show deformable convolution, applied to the first four DepthNet layers in ERFDepthNet(Dfc). Also we compare different baseline’s DepthNet with ERFPoseNets (Table 3). With joint training, increasing of RF in the original PoseNet gives positive effect on DepthNet predictions, producing depth metrics comparable with ERFDepthNet (Dfc).

Table 2. Results for depth estimation on Eigen KITTI split: ERFDepthNets + PoseNet architectures. The errors are only computed where the depth is less than 80 m
Table 3. Results for depth estimation on Eigen KITTI split: DepthNets + ERFPoseNets architectures. The errors are only computed where the depth is less than 80 m

5 CONCLUSIONS

We present ERF-SfMLearner, a result of the analysis of receptive field importance in self-supervised deep learning method for monocular depth prediction and camera ego-motion estimation tasks. The experimental evaluation on KITTI dataset shows that bigger receptive field may be a one key to the successful solution of this task. The best result that we have been able to achieve is an increase of receptive field through the use of deformable convolution both for ERFPoseNet and ERFDepthNet models. Also, our work highlights an important fact: in joint training with self-supervised loss changing the architecture of one neural module can affect another module’s result. With this knowledge, more advanced neural architectures can be proposed to better cope with the task of the monocular depth and ego-motion estimation and, as a consequence, with a high-quality mapping and better localization.