Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Self-supervised monocular depth estimation has been widely investigated and applied in previous works. However, existing methods suffer from texture-copy, depth drift, and incomplete structure. It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment. Moreover, it is hard to design the depth smoothness loss to balance depth smoothness and sharpness. To address these issues, we propose a coarse-to-fine method with a normalized convolutional block attention module (NCBAM). In the coarse estimation stage, we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems. Then, we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage. Our method can produce results competitive with state-of-the-art methods. Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.


Introduction
Depth information in a 2D image has a wide range of applications, including 3D reconstruction [1][2][3][4][5], simultaneous localization and mapping (SLAM) [6], shadow removal [7], and so on. Range finding sensors, such as LiDAR, time of flight cameras (TOF), and stereo cameras, are often used to extract depth information. However, it is unrealistic to rely on such expensive or complex sensors in many cases. This has advanced the development of learning-based methods using large datasets [8,9]. Supervised monocular depth estimation methods have made great progress [10]. However, collecting extensive and high-quality ground truth depth is challenging due to sensor noise and unpredictable complex environmental conditions. Supervised monocular depth estimation thus has limited generalization ability.
Recently, self-supervised monocular depth estimation approaches have been introduced, trained with stereo image pairs [11] or monocular video sequences [12][13][14], and supervised with geometric information. Compared to stereo-based supervision, monocular video is more attractive, as more sequenced frames are available for use as supervision signals. To enhance the performance of depth estimation, many works focus on masking strategies [12,13,15], loss functions [12], and multi-task learning [16,17]. However, existing self-supervised monocular depth methods still suffer from texture-copy, depth drift, and incomplete structure.
Texture-copy in depth map is a situation in which the details of the color image are transferred to the depth map. Monodepth2 [12] upsamples the generated multi-scale depths to the input image resolution and then computes all losses, to partially alleviate the texture-copy phenomenon. Depth drift occurs when object depth largely differs from its surrounding environment in the wrong way. It is caused by the depth network incompletely understanding the spatial correlation between the object and its surrounding environment. The incomplete structure indicates that the object depth is not completely predicted, especially for sharp objects in the scene, as the smoothness loss mistakenly eliminates the depth differences of the sharp object. In Fig. 1, we illustrate some typical examples of the above problems in the predicted depth; our predicted depth maps are better than those of the comparator methods.
We propose a coarse-to-fine method with a normalized convolutional block attention module (NCBAM). Our pipeline includes coarse depth estimation and depth refinement, as shown in Fig. 2. Specially, we improve the lightweight CBAM attention module [19], to provide a normalized convolutional block attention module (NCBAM), and then incorporate it into networks to tackle the problems of texture-copy and depth drift. Furthermore, we design a network that uses the corresponding color image as a guide to refine the coarse depth, which can deal with the incomplete structure problem. The coarse depth network and depth refinement network are trained.
To summarize, this paper presents the following two main contributions: • We tackle the texture-copy and depth drift problems by improving the CBAM and incorporating it into depth and pose networks. • We tackle the incomplete structure problem by designing a new network using the color image as a guide to refine the coarse depth.

Background
Inferring depth from a single image is an ill-posed problem. However, deep learning has shown its ability to provide acceptable estimation results based on large-scale datasets. In this section, we mainly review related work on self-supervised monocular depth estimation.

Fig. 1
Existing methods suffer from texture-copy, depth drift, and incomplete structure. (a) The depth of the person exhibits an incomplete structure problem in the Monodepth2 [12] output. (b) The depths of the person and the car suffer from a texture-copy problem in the result of Guizilini et al. [18]. (c) The car depth has a drift problem in the output of Klingner et al. [15].

Fig. 2
Overview of our method. We use a coarse-to-fine method comprising coarse depth estimation and depth refinement. We train the depth and pose estimation models in the coarse depth estimation stage. In the depth refinement stage, we only train the depth refinement model.
Traditionally, structure-from-motion [20] and binocular stereo algorithms [21] have been used to estimate depth from a series of images or stereo image pairs, respectively. Recently, learning-based algorithms have made great progress in monocular depth estimation [22,23]. Supervised methods train a network model on sparse depth labels provided using RGBD sensors. However, it is not easy to collect high-quality ground truth depths. As an alternative, self-supervised depth estimation has attracted attention, using stereo image pairs [11] or monocular video sequences [12,24] as training datasets. Self-supervised depth estimation trains the depth estimation model by projecting one view to nearby views based on the predicted depth and minimizing the photometric re-projection loss between the projected image and the target image.

Self-supervised stereo training
Deep3D [25] uses a deep neural network to generate 3D stereo image pairs from 2D images or video frames and uses the photometric re-projection loss to train the depth network on the stereo image pair datasets. This network predicts a probabilistic disparity map for the input image, and the depth-based image rendering layer produces the right image in the context of binocular pairs. Garg et al. [26] proposed a deep neural network to directly estimate the depth and trained loss terms including a photometric re-projection loss and a depth smoothness loss. Monodepth [11] inputs a left image into a depth network and predicts left-right disparities to enforce mutual consistency. This method uses photometric re-projection loss and introduces a left-right disparity consistency loss. Both methods in Refs. [27] and [28] use generative adversarial networks to train the depth network.

Self-supervised monocular training
Monocular video is more attractive than stereo-based supervision, as more frame sequences are available for use as supervision signals. Self-supervised monocular training needs to estimate the parameters of the depth and pose estimation models. The pose estimation network takes a finite series of frames as input and outputs the relative camera pose. The source frame is warped to the target frame based on the predicted depth and relative camera pose, and then the photometric error between the warped frame and the target frame is used to supervise the model during training [13].
The method in Ref. [13] was the first work that used monocular video to train end-to-end depth and camera pose estimation networks. Mahjourian et al. [29] used a 3D geometry consistency loss to train the model. Godard et al. [12] made the following three innovations. First, they proposed a minimum photometric re-projection loss to address the problem of occluded pixels. Then, they designed an automasking loss to ignore training pixels that violate relative camera motion assumptions. Finally, they upsampled the predicted depth maps to the input resolution and computed all losses to reduce texturecopy artifacts.
Multi-task training strategies are also available to improve the accuracy of depth estimation. Yang et al. [17] constrained the depth to be consistent with the surface normal and image edges. Ying and Shi [24] learned depth, optical flow, and pose together and used the predicted depth and optical flow to mask moving objects during training. Zhu et al. [30] used edge consistency between the semantic segmentation and depth map as a supervision signal. Klingner et al. [15] used the learned semantic information to eliminate the influence of moving objects when computing photometric re-projection loss.
Self-attention (Transformer) [31] has improved the performance of natural language processing systems by better handling of long-range dependencies between words. In addition, self-attention has been applied in computer vision tasks such as semantic segmentation [32] and depth estimation [33][34][35]. Johnston and Carneiro [36] used the ResNet-101 network to encode the input image and then passed it through a self-attention module [31] to explore contextual information, allowing the inference of similar depth values in discontiguous regions of the input image. However, as the self-attention module requires much memory, they only incorporated it into the encoder output layer.
Attention mechanisms have achieved great success in many visual tasks, such as image classification, object detection, and semantic segmentation [37]. We improve the lightweight attention module CBAM [19], to give a normalized convolutional block attention module (NCBAM). Then, we generalize the NCBAM model in multiple places, including the depth estimation network, relative pose estimation network, and depth refinement network, to improve the accuracy of the depth and pose estimation models.

Method
In this section, we give a detailed description of our method (see Fig. 3). First, we introduce the improvements to CBAM to give NCBAM. Then, we describe the coarse depth and pose estimation methods. Finally, we present the depth refinement approach. We use the Monodepth2 [12] network as a baseline.

NCBAM attention module
The convolutional block attention module (CBAM) [19] is a lightweight module, which can aggregate deep features. It sequentially infers attention maps along with two separate sub-modules: channel and spatial, as shown in Fig. 4. The attention maps are multiplied by the input feature map for adaptive feature refinement. The CBAM attention module can learn correlations between the object itself and the surrounding environment. When incorporating the CBAM model into depth estimation networks, it cannot completely solve the problems of texture-copy and depth drift (as shown in Fig. 12 later).
We improve upon the CBAM module in the following ways, in a normalized convolutional block attention module (NCBAM). To reduce the differences between global average pooling and global max-pooling in the channel and spatial attention modules, we convert the input feature to the range (−1, 1) using the tanh function. We use the activation function softplus(x) = log(1+e x ) to replace relu(x) = max(0, x) in the shared network of the channel submodule and the convolution layer of the spatial submodule. The activation function softplus can be seen as smoothing relu, avoiding excessive neuronal death during training. Experiment comparisons show that the NCBAM module can produce better results than the CBAM module.

Coarse depth estimation
We need to train depth and pose estimation networks simultaneously based on monocular video training. We incorporate the NCBAM module into the depth estimation network [12], which is a U-Net network, an encoder with residual blocks and a decoder with skip connections. (b) Pose network. We incorporate the NCBAM module into the standard pose network [12]. (c) NCBAM module. (d) Depth refinement network. This uses the color image as guidance to refine the coarse depth. Fig. 4 The CBAM attention module [19] has two sequential submodules: channel and spatial.
The depth estimation network f D : I → D predicts the depth for every pixel in the target image I t . The pose estimation network f T : (I t , I t ) → T t→t , predicts the camera transformation relating the target image I t to the source image I t . Based on the learned depth and pose, we warp the frame into adjacent frames {t − 1, t + 1} using the photometric re-projection loss as the optimization objective function. We follow the multi-scale depth output strategy proposed in Ref. [11]. First, we input encoder features into deep convolutions and output a low-resolution depth map. Then, we apply three additional stages of upsampleconvolution that receive skip connections from the ResNet encoder to generate corresponding resolution depth maps.
The NCBAM attention module can aggregate deep features and extract correlations between the object and the surrounding environment. Therefore, we incorporate the NCBAM model into the ResNet multiscale features and the skip connections associated with decoder layers for the U-Net architecture (SK feature), as shown in Fig. 3(a). Skip connections between encoder and associated decoder layers can keep high-level information in the final depth output. When incorporating the NCBAM module, the depth estimation model can overcome with the texture-copy and depth drift problems.
The output of our depth network is a pixel-wise disparity probability for multiple disparity layers, to give a discrete disparity volume (DDV) [38]. We input the SK feature into a 2D convolutional layer with filters of size 3 × 3, and output a K channel disparity probability volume P = {P 1 , . . . , P K } with K disparity layers: where d min and Δ d are the minimum disparity value and disparity interval, respectively. A depth-wise softmax operation processes P to produce an actual probability map for each disparity plane: We extract the final disparity as a weighted sum of the disparity probabilities P d : We use a widely used backbone [12], which takes two images at different time steps as input and learns the relative camera pose transformation T t→t ∈ SE(3) between the images recorded at time steps t and t : T The special Euclidean group SE(3) defines the set of all possible rotations and translations. Such transformations are usually represented by 4 × 4 matrices. Following Ref. [12], we predict the six degrees of freedom pose.
We also incorporate the NCBAM model into the pose network, as shown in Fig. 3(d). The NCBAM model can also learn other related features in the two input images, enhancing the accuracy of the pose model, furthermore increasing the accuracy of the depth estimation model. First, we input a pair of color images to the ResNet-18 network to extract corresponding deep features. Then, we input those deep features into the NCBAM module to learn their correlation (CF features). Finally, we concatenate the CF features and input them to a series of 2D convolution layers to output a single six degrees of freedom relative pose.

Training the coarse depth network
Following Ref. [12], training our coarse depth estimation model is mainly based on minimizing per-pixel photometric re-projection loss between the source image I t and target image I t , using the learned relative pose T t→t and depth D t . The photometric re-projection loss is defined as where pe(.) is the photometric reconstruction error; t ∈ {t − 1, t + 1}: we use the two frames temporally adjacent to I t as the source frames [12]. μ is a binary mask that filters out stationary points: where [.] is the Iverson bracket. The binary mask μ includes pixels where the re-projection error of I t →t is lower than the un-warped image I t , indicating that the object is stationary relative to the camera. Re-projection loss minimization significantly reduces artifacts along the object boundaries in the image, leading to better accuracy. The re-projected image is defined as where . is the sampling operator; K ∈ R 3×3 is the camera intrinsic parameter matrix, identical for all images. We also apply differentiable bilinear sampling [39] to sample the source images. proj(.) returns 2D coordinates of the projected depths D t in I t [40]: where p t denotes a pixel.
Monodepth2 [12] encourages neighbouring pixels to have similar depths, and uses an edge-aware depth smoothness loss L s weighted by image gradients to improve predictions around object boundaries: where ∂ x , ∂ y are gradient operators in x, y, respectively, andD t = D t /D t is the mean-normalized inverse depth from Ref. [42] to discourage shrinking of the estimated depth. The final loss is computed as the weighted sum of masked image photometric re-projection loss L p and smoothness loss L s : where λ weights the smoothness term. In our experiments, we set it to 0.01.

Background
In Section 3.2, we incorporate the NCBAM module into the depth estimation network to tackle the problems of texture-copy and depth drift, which can improve the accuracy of the initial depth estimates. However, these estimates are still imperfect. In particular, the depths may exhibit incomplete structure in sharp object regions, as shown in Fig. 7.
We design a refinement network that uses the color image as guidance to refine the coarse depth, and deal with the incomplete structure problem. First, we input the color and corresponding coarse depth images into a series of 2D convolution layers to extract their features. Then, we concatenate the output features and pass them through a 2D convolution to generate the color-depth feature. Meanwhile, we input the color feature into the NCBAM module, allowing the network to learn more about the color feature. Finally, we concatenate the color-depth feature and the refined color feature and pass them through a series of 2D convolution layers to output a depth residual. The depth residual is added to the coarse depth to get the refined depth. Table 1 presents a detailed specification of the depth refinement network.

Training the depth refinement model
Depth and normal are two highly correlated entities. Inspired by Ref. [43], we design a normal consistency loss for the coarse depth D and refined depth D: where N denotes the number of pixels, i indexes pixels, and u i = (∂ x D i , ∂ y D i ). Angle minimization is performed by maximizing the dot-product. We also use the photometric re-projection loss in Eq. (5) with the camera pose model trained in the coarse depth estimation work. Here, we use multi-scale structural similarity, MS-SSIM [44]. The photometric reconstruction error function pe is pe (I t , where s = 4 is the number of scales employed. Here, the photometric re-projection loss is The final learning objective function of our depth refinement network is L refine = L p + γ n L n + γ s L s (15) where γ n and γ s are hyperparameters to control the significance of the normal term L n and patch smoothness term L s , respectively. In our experiments, γ n is set to 10 −4 , and γ s is set to 10 −5 .

Results and discussion
This section presents experimental results to verify the effectiveness of our approach. Firstly, we describe the experimental datasets and implementation details. Secondly, we present evaluations of our method on various testing configurations. Thirdly, we perform an ablation study to demonstrate that the NCBAM module can improve the accuracy of the predicted depths. Finally, we apply our predicted depths to novel view synthesis and describe limitations of our depth estimation model.

Datasets
We trained our overall network model using the standard KITTI benchmark [58]. The KITTI dataset collects rectified stereo pairs of 61 scenes (containing about 42,382 stereo frames) mainly concerned with driving scenarios. The image size is 1242 × 375 pixels. We follow the Eigen split test dataset [22]. It contains 39,810 monocular training sequences consisting of three frames, 4424 validation sequences, and 697 for evaluation. Following previous work [13], we remove static frames before training and only evaluate depths up to a fixed range of 80 m per standard practice [12]. We use the same intrinsic parameters for all images; we set the camera principal point to the image center and the focal length to the average of all focal lengths in KITTI.

Training dataset augmentation
For the training dataset, we resized all images to a standard resolution (640 × 192), or high resolution (1024×320). We augmented the training dataset with horizontal flips, and 50% were processed by random adjustments to contrast ± 0.2, saturation ± 0.2, hue ± 0.1, and brightness ± 0.2. The additional color images were only used as depth and pose network input, but the original color images were used to compute the training loss.

Depth evaluation metrics
To evaluate the depth estimation model, we used four error metrics and three accuracy metrics as in Ref. [23]. The four error metrics measure the difference between predicted depth and ground-truth depth, namely the absolute relative error (Abs Rel), the squared relative error (Sq Rel), the root mean square error (RMSE), and the logarithmic root mean square error (RMSE log). The three accuracy metrics give the fraction δ of predicted depths in an image whose ratio and inverse ratio to the ground truth are within the thresholds 1.25, 1.25 2 , and 1.25 3 .

Implementation details
In our experiments, we set the number of disparity layers K = 98, the minimum disparity value d min = 10 −5 , and disparity interval Δ d = 0.01. We used the PyTorch framework [59] to implement our work and trained on a single Nvidia 2080Ti. We used ResNet-18, ResNet-50, and ResNet-101 as the encoders for the depth estimation network. For coarse depth estimation, we used the Adam optimizer [60] with α = 10 −4 , β 1 = 0.9, and β 2 = 0.999, training for 25 epochs with a batch size of 8. For depth refinement, α = 10 −5 , β 1 = 0.9, and β 2 = 0.999, training for 13 epochs with a batch size of 6. Because the ResNet-101 network needs more memory, the batch size was set to 4 in coarse depth estimation, and we did not train it in the depth refinement work.

Depth evaluation on KITTI dataset
We evaluated our depth estimation model on the Eigen split test dataset [22]. Table 2 presents the results, which demonstrate that our method is better than existing methods in terms of the seven evaluation metrics. A qualitative evaluation of our coarse depth estimation model is provided in Fig. 5, showing a comparison to results generated by the methods in Refs. [12,18,36,56]. Unlike those methods, our estimated depth maps have complete structures for objects, such as the human body. The estimated depth maps from the comparator methods exhibit texture-copy phenomena in areas such as the car, but our approach overcomes these problems. The depth evaluation results verify that the NCBAM module effectively estimates monocular depth. We also have qualitatively evaluated our coarse depth estimation model on the Cityscapes test dataset [57]: see Fig. 6. The Cityscapes and KITTI  5 Comparisons to state-of-the-art self-supervised monocular depth estimation methods: Monodepth2 [12], Guizilini et al. [18], Johnston and Carneiro [36], and WaveletMonodepth [56] using examples from the Eigen split test dataset [22].
datasets have some differences. Our model only trained on the KITTI dataset was used to predict depths on the Cityscapes dataset. Our predicted depth maps are better than those from the methods in Refs. [12,18,36,56], whose predicted depths exhibit texture-copy, depth drift, and incomplete structure problems, in areas such as cars, persons, and landmarks. The benefits of the depth refinement model are shown in Table 2 and Fig. 7. Compared to the coarse depth estimation model, the results of the depth refinement model are further improved. In Fig. 7, we show qualitative results of the coarse and refined depth estimation models. Refined depths provide better results on thin structures such as poles. Table  2 and Fig. 7 show that our depth refinement network is effective, and can refine the coarse depth.
We have compared our method to Monodepth2 [12] and Johnston and Carneiro [36] using a variety of encoder networks, including ResNet-18, ResNet-50, and ResNet-101. Table 3 shows the quantitative results. Our results are quantitatively better than those of these two methods. Figure 8 shows a qualitative comparison between our method and Monodepth2 [12] with the ResNet-18 encoder and input image size of 1024 × 320. Unlike Monodepth2 [12], our depth estimation model can deal with the texturecopy problem and produce a clear depth for a sharp object in the image. Figure 9 shows a comparison between our method and the results of Johnston and Carneiro [36] using the ResNet-101 encoder network. Our approach can predict depths of delicate structures.

Depth evaluation on Make3D dataset
In Table 4, we provide quantitative evaluation results on the Make3D dataset [61] using our models trained on the KITTI dataset. We used the same testing protocol as Monodepth2 [12] and the evaluation Fig. 7 Comparisons between the coarse depth and refined depth. Some depth maps of sharp objects were refined using the depth refinement network. criteria from Monodepth [11]. For the Make3D dataset, we evaluated on a center crop of 2 × 1 ratio and used median scaling [12] because the ground truth depth and corresponding input image were not well aligned. Table 4 shows that our method can produce better results than previous self-supervision methods. Figure 10 shows our depth prediction results on the Make3D test dataset. These results demonstrate that the estimated depth is credible even though depth estimation model training did not use the Make3D dataset.

Odometry evaluation
We evaluated the pose model to validate the effectiveness of the NCBAM model on pose. Following Monodepth2 [12], the training set of our pose network model was sequences 0-8 of the KITTI odometry dataset [58], and the test dataset was sequences 9 and 10. Generally, the pose network takes five frames as input [13,16] and predicts transformations. Our pose network baseline is Monodepth2 [12]: the input to the pose network comprises two frame images, and the output is a relative pose transformation between that pair of frames. To evaluate the two-frame model on the five-frame test sequences, Monodepth2 [12] makes separate predictions for each of the four pairs of frame transformations for each set of five frames and combines them to form local trajectories. We follow the Monodepth2 [12] testing protocol to evaluate our pose network. Here, our pose model was trained for 11 epochs. Evaluation results are shown in Table 5: the accuracy of our pose model is better than that of Monodepth2 [12], and show that the NCBAM can enhance the results of the pose model.

Ablation study
To better validate the effectiveness of NCBAM in coarse depth estimation, we have performed an ablation study. Table 6 shows the results. We start from the baseline Monodepth2 [12] + DDV with ResNet-18 encoder network (1st row). Then, we incorporate the NCBAM module into the depth estimation network (2nd row), pose estimation   Figure 11 shows qualitative results of the ablation study. The best results are those in which we incorporate NCBAM into depth and pose networks. In Fig. 12, we compare depth estimation models with NCBAM and CBAM modules. Better results are provided by the version with NCBAM. Table 6, Fig. 11, and Fig. 12 show that the proposed NCBAM model is effective.
In Table 7, we illustrate the effectiveness of the Fig. 11 Ablation study. Incorporating NCBAM into both the depth and pose networks produces the best results.  improvements made to CBAM. The first converts the input feature to the range (−1, 1) using the tanh function in the channel and spatial attention modules (CBAM Tanh). The second is to use the activation function softplus instead of relu in the shared network of the channel sub-module and the convolution layer of the spatial sub-module (CBAM softplus). We can see that both changes to CBAM improve depth estimation, and the best results are those when we utilize both improvements simultaneously, i.e., NCBAM.
The normalization sigmoid activation sigmoid(x) = 1/(1 + e −x ) maps the input to the range (0, 1). Table 7 compares use of tanh and sigmoid functions for normalization, and shows that tanh function produces better results.

Limitations
Although our method can overcome the above three targeted problems, our approach also has some limitations in common with other methods. One is that it cannot effectively predict the depths of moving objects. Figure 13 presents an example generated by our approach and other state-of-the-art self-supervised monocular depth estimation methods. Unfortunately, all methods fail to predict the person's depth since the training set KITTI dataset is collected in scenes nearly completely lacking in humans.

Conclusions
In this paper, we have presented a new self-supervised monocular depth estimation method. Previous methods typically produce predicted depth maps with incomplete structures, texture-copy issues, and depth drift problems. We improved the attention model CBAM, to provide NCBAM, incorporated it into networks, and proposed a coarse-to-fine approach to address the above problems. We have performed extensive experiments to compare our method to state-of-the-art methods which validate its effectiveness. In future, we will further investigate depth estimation in complex scenes containing motion objects.

Author contributions
Yuanzhen Li conceived and designed the study, and collected the data. All authors analyzed the data and were involved in writing the manuscript.