1 Introduction

Depth estimation is a significant and interesting task in the field of scene perception, with a wide range of applications, such as autonomous driving, intelligent transportation, 3D reconstruction, and virtual reality. However, traditional methods for acquiring depth information, such as Lidar or Kinect sensors [1], have limitations in certain situations. For example, Lidar is not suitable for medical applications, like gastroscopy, due to its large size and high cost [2], and Kinect cannot be used in bright sunlight [3]. Additionally, visible cameras are commonly used in depth estimation tasks [2, 3] as they are cost-effective and have a smaller size. Two main approaches for depth estimation using camera sensors are monocular and binocular solutions [4]. While binocular depth estimation is a possible solution, it is usually limited by the occlusion problem, and the larger calculation amount and cost are more expensive than the monocular camera [5]. Therefore, in recent years, monocular depth estimation methods have gained popularity as a promising and feasible solution [4, 5].

1.1 Traditional Machine Learning Methods

Recovering depth from camera sensors has been a subject of research for a long time, using traditional machine learning methods. There are two main branches of traditional machine learning methods in monocular depth estimation, i.e., parameter learning methods and non-parametric learning methods.

The parameter learning methods obtain parameters of the model through training and have been widely adopted for depth estimation from monocular camera sensors [6,7,8]. For example, Saxena et al. [6] modeled the mapping relationship between the input image characteristics and the output depth by using Markov random field (MRF). Liu et al. [7] optimized the depth map by constructing a two-layer MRF model which used semantic tags as an auxiliary different semantic tag and using pixels and super-pixels as nodes. Wang et al. [8] described the correlation between RGB images and the corresponding depth maps by adopting a kernel function in a nonlinear space, and then used image block learning parameters for depth estimation. However, these methods all require that the relationship between sensor-collected RGB images and the inferred depths can be established by a parametric model, which is difficult to be formulated reliably to describe the real-world mapping relationship. Therefore, the prediction accuracies of the parametric learning methods are usually limited.

The methods based on nonparametric learning are another widely adopted solution for depth estimation using camera sensors [9,10,11]. These methods infer depth by using existing datasets for similarity retrieval. For example, Karsch et al. [9] used depth transfer to search for the image sequence that closely resembles the input image. Liu et al. [10] obtained the depth map using a discrete and continuous optimizer, where the continuous optimization encoded the super-pixels in the input features to generate depth and the discrete part described the relationships between the adjacent super-pixels. Konrad et al. [11] performed median filtering on the retrieved similar images to generate an initial depth map and then used a bilateral cross filtering method to smooth the initial depth map. However, these methods rely heavily on retrieving image pixels, which can be computationally expensive and may pose challenges in practical applications.

1.2 Deep Learning Methods

With the rapid development of convolution neural network (CNN) in recent years, various deep learning approaches have been developed to recover depth information from RGB images captured by monocular camera sensors [12,13,14,15,16,17,18]. These methods can be generally classified into supervised learning methods and self-supervised learning methods.

Supervised learning methods for depth estimation from RGB images mainly involve constructing a loss function to evaluate the difference or variance between the input image and the output predicted value. The loss values are then back-propagated to the neural network to update the weights. These methods typically achieve higher accuracy than unsupervised approaches. For example, in Ref. [19], a transformer-based module was proposed in which the depth of range was divided into bins, and the middle values of these bins were estimated adaptively per image. In Ref. [20], the Laplacian pyramid was incorporated into a decoder architecture and weight standardization was applied to the pre-activation convolution blocks of the decoder architecture. Ranftl et al. [21] proposed a transformer-based method to replace the convolution structure in the backbone for depth prediction tasks. However, these supervised learning methods are highly dependent on high-quality datasets with annotated labels, which limits their adaptiveness to other scenarios.

Alternatively, self-supervised learning methods can be used to overcome the limitations of supervised learning methods. There are two main branches of self-supervised learning methods in the literature, i.e., approaches based on stereo matching and approaches based on synthetic stereo pairs or monocular video. The methods based on stereo matching aim to minimize the cost volume calculated from the matched features. For example, Zbontar et al. [22] trained a deep neural network by computing the matching cost of two different patches. Wang et al. [23] used a new structure for depth estimation by comprehensively using a pyramid voting module (PVM) and deep convolutional neural network (DCNN). These methods can deliver accurate results in real time, but they are prone to problems such as occlusion and texture-copy artifacts [23].

Recent studies have proposed methods to get depth information by training models based on synthetic stereo pairs [13, 24] and monocular videos [4, 5] from camera sensors. The methods based on synthetic stereo pairs have shown promising results in monocular depth estimation, which is different with monocular video in that the model is trained using stereo images. For instance, in Ref. [13], the left image in the stereo image pair was used to generate the depth map of the corresponding left image, and then the warp method was used to obtain the disparity map of the right image. Based on the generated depth map, a synthesized right image was obtained, and a loss function was designed by comparing it with the real right image. In Ref. [24], a CNN was used to estimate the left image in the image stereo pair to generate the corresponding left disparity image, which was then combined with the real right image to obtain the synthetic left image. However, these methods are less attractive than those based on monocular videos because monocular camera sensors can acquire datasets more easily and conveniently.

Given the increasing availability of public datasets, methods based on monocular camera sensors are receiving increased attention from researchers. Recently, self-supervised methods have demonstrated the ability to synthesize the RGB image of the target through the depth map estimated by CNN [4, 15, 25]. For instance, Zhou et al.[15] trained a depth estimation model along with an ego-motion network using a self-supervised method based on videos datasets from camera sensors. However, this method may make the model fall into a local minimum because it is challenging to simultaneously estimate depth and predict ego-motion. To address this issue, various approaches have been proposed. Vijayanarasimhan et al. [5] estimated depth by using segmentation and object motion to construct a motion field, reducing the influence of ego-motion and relative motion. Klingner et al. [26] proposed a self-supervised semantical method to guide depth estimation in dynamic scenarios. Godard et al. [4] proposed an auto-mask to solve non-rigid motion and per-pixel minimum re-projection loss to handle occlusions in depth estimation.

The most recent approaches have primarily focused on complex structures to improve estimation performance. For example, Fu et al. [27] proposed a regression method for depth estimation to obtain a continuous high-precision depth map. Hu et al. [28] proposed fusing features extracted at different scales and used a complex model to improve estimation accuracy. Chen et al. [29] built a depth estimation model by combining a residual pyramid decoder and four residual refinement modules. However, these methods did not consider that stacking too many pooling and CNN layers may cause information redundancy.

The merits and demerits of the above-mentioned methods are summarized in Table 1.

Table 1 A brief summary of the related methods based on camera sensors

1.3 Attention Mechanism and Feature Pyramid Network

Previous research has proved that incorporating learning mechanisms, such as attention, can significantly improve network performance without the need for additional supervision [30]. One such mechanism is the squeeze-and-excitation block (Seblock), proposed in Hu et al. [30], which increases the weight of valid information and reduces the weight of invalid information. Another example is the use of sequential channel and spatial attention maps for adaptive feature refinement in Woo et al. [31]. Additionally, self-attention, originally used in natural language processing, has been utilized in recent camera sensor-related tasks [32]. This study leverages the Seblock module to effectively extract image features.

In deep learning, increasing the receptive field is a significant challenge. While this can be achieved by adding more CNN layers, this approach also leads to the problem of gradient disappearance [33]. Previous work has primarily focused on integrating features from the backbone network [34,35,36]. As one of the classical methods, Lin et al. [37] built high-level sematic feature maps at each scale using a top-down framework with lateral connections. Liu et al. [38] proposed a bottom-up augmentation method to reduce the distance between lower and higher layers. Amirul Islam et al. [39] introduced gate units to control the flow of valid information and avoid ambiguity. More recently, Ghiasi et al. [40] utilized a neural architecture search (NAS) strategy to achieve a more effective yet complex feature fusion structure. To effectively use features from different layers, this study developed an improved bidirectional feature pyramid module (BiFPN) that connects features from different layers by calculating weights from different layers rather than simply concatenating the features.

1.4 Contributions

In this study, a novel self-supervised monocular depth estimation method is proposed, inspired by ResNet [41]. The method integrates a channel attention module and an improved BiFPN for enhanced performance. The channel attention module extracts more useful information than the baseline by learning weights from different features, while the improved BiFPN is used as the decoding network, preserving fine-grained features and incorporating global information based on high-level features from multilayers. The integration of the channel attention module and BiFPN improves the depth estimation accuracy of the developed method while reducing the number of parameters, which addresses the issue of high network complexity commonly found in stacked pooling or stride convolution.

The main contributions of this study are twofold. Firstly, a fusion version of ResNet is proposed as the encoder, which effectively extracts features from input images by incorporating the channel attention mechanism in different layers of ResNet, thereby combining information from different channels and improving model performance. Secondly, an improved BiFPN, with a unique structure, is proposed as the decoder, which effectively generates high-precision depth maps of input images while preserving rich and effective details.

The proposed method is demonstrated to be effective and superior to state-of-the-art methods on two large-scale datasets, KITTI and Make3D. To the best of knowledge, this technology has not been previously reported in studies on depth estimation based on camera sensors.

1.5 Paper Organization

The remaining part of this paper is structured as follows: Sect. 2 introduces the proposed approach for depth estimation. Section 3 details the experimental set up, including the datasets and evaluation metrics used. Section 4 presents both quantitative and qualitative experimental results to demonstrate the superiority of the proposed method. Finally, Sect. 5 concludes this study.

2 Proposed Method

2.1 General Solution

A feasible solution for self-supervised training is to synthesize a new image and compare it to the original image, using this comparison to construct an L1 loss training network. This approach does not require ground truth labels, but instead utilizes a supervised signal to guide the convergence of the loss function. By using this method, the depth \({{\varvec{D}}}_{t}\) of \({{\varvec{I}}}_{t}\) and ego-motion \({{\varvec{T}}}_{t\to s}\) between the target image \({{\varvec{I}}}_{t}\) and the source image \({{\varvec{I}}}_{s}\) (s \(\in \)(t−1, t + 1)) can be estimated using camera sensor data. The homogeneous coordinates of a pixel in \({{\varvec{I}}}_{t}\) are denoted as \({{\varvec{p}}}_{t}\). The projection \({{\varvec{p}}}_{s}\) of \({{\varvec{p}}}_{t}\) can be obtained by

$$ \begin{array}{*{20}c} {{\varvec{p}}_{s} = K{\varvec{T}}_{t \to s} {\varvec{D}}_{t} \left( {{\varvec{p}}_{t} } \right){\varvec{K}}^{ - 1} {\varvec{p}}_{t} } \\ \end{array} $$
(1)

where \({\varvec{K}}\) is the intrinsic matrix of camera. Then, a differentiable bilinear sampling mechanism is employed to solve the problem of non-integer pixel coordinate values being projected onto \({{\varvec{I}}}_{s}\).

$$ \begin{array}{*{20}c} {{\varvec{I}}_{t}^{\prime } \left( {{\varvec{p}}_{t} } \right) = \mathop \sum \limits_{{i \in \left\{ {t,b} \right\},j \in \left\{ {l,r} \right\}}} \omega^{ij} {\varvec{I}}_{s} \left( {{\varvec{p}}_{s}^{ij} } \right)} \\ \end{array} $$
(2)

where { \(t,b,l,r\)} denote the 4-pixel neighbors, and \({\omega }^{ij}\) is the weight of the calculated bilinear interpolation which measures the distance between adjacent pixels and\(\sum {\omega }^{ij}=1\).

The synthesized target images \({{\varvec{I}}}_{t}^{\mathrm{^{\prime}}}\) are acquired from the above calculation. Then, the L1 loss between \({{\varvec{I}}}_{t}\) and \({{\varvec{I}}}_{t}^{\mathrm{^{\prime}}}\) can be computed to get photometric loss:

$$ \begin{array}{*{20}c} {L_{{{\text{ph}}}} = {\text{min}}\mathop \sum \limits_{t} \left| {{\varvec{I}}_{t} - {\varvec{I}}_{t}^{^{\prime}} { }} \right|} \\ \end{array} $$
(3)

where photometric re-projection loss is used to address the problem of occlusion [4]. Then, structural similarity (SSIM) loss is calculated to measure the similarity between \({{\varvec{I}}}_{t}\) and the synthesized image \({{\varvec{I}}}_{t}^{\mathrm{^{\prime}}}\):

$$ \begin{array}{*{20}c} {L_{{{\text{ssim}}}} = \frac{{1 - {\text{SSIM}}\left( {{\varvec{I}}_{t} - {\varvec{I}}_{t}^{\prime } } \right)}}{2}} \\ \end{array} $$
(4)

To make the depth images clearer and smoother intuitively, the following loss is used:

$$ \begin{array}{*{20}c} {L_{{{\text{smooth}}}} = \left| {\partial_{x} d_{t}^{*} } \right|e^{{ - |\partial_{x} {\varvec{I}}_{t} |}} + \left| {\partial_{y} d_{t}^{*} } \right|e^{{ - |\partial_{y} {\varvec{I}}_{t} |}} } \\ \end{array} $$
(5)

where \({d}_{t}^{*}= d/ \overline{d }\) with \(d\) as predicted depth value and \(\overline{d }\) as the mean of predicted depth value. By employing \({d}_{t}^{*}\), the shrinking of the estimated depth can be efficiently prevented [42]. Then, the final loss is designed as:

$$ \begin{array}{*{20}c} {L = \mu \left( {\tau L_{{{\text{ph}}}} + \left( {1 - \tau } \right)L_{{{\text{ssim}}}} } \right) + \gamma L_{{{\text{smooth}}}} } \\ \end{array} $$
(6)

where the smooth term \(\gamma \) is set at 0.001 and photometric loss term \(\tau \) at 0.15, and \(\mu \) denotes the auto-mask which is used in Ref. [4] for masking stationary pixels and objects motion.

2.2 Architecture of the Proposed Method

The proposed network structure, as shown in Fig. 1, comprises of two main branches. The upper branch is responsible for estimating depth information (i.e., the upper part in Fig. 1), while the lower branch is utilized to estimate pose information (i.e., the lower part in Fig. 1).

Fig. 1
figure 1

The overall structure of the proposed method

In Fig. 1, the frames labeled −1, 0, and 1 represent three consecutive images in time. The frame labeled 0 is the target frame, while the frames labeled − 1 and 1 are the frames immediately preceding and following the target frame, respectively. The depth map of the target image is obtained through the depth network in the upper part of the figure, and the camera’s rotation and translation information is obtained through the pose network in the lower part. The depth map is then transformed into 3D space using the inverse of camera’s internal parameters to generate a point cloud, and the camera rotation and translation information is used to align the point cloud with the corresponding input image. Finally, the point cloud of the target image is projected onto the 2D plane according to the camera’s internal parameters, and the final synthesized image is obtained through bilinear interpolation.

Both the depth network and pose network have encoding and decoding structures, with the depth network incorporating two innovations: the use of Seblocks in the encoder to extract features from different layers and an improved BiFPN in the decoder to fuse multilayer features by learning the weights of features. The encoder of the pose network has the same structure as the depth encoder (i.e., Seblocks are inserted in the encoder), but it receives input from two pictures to infer ego-motion, whereas the depth network only needs one picture to estimate depth.

2.3 Channel Attention Network

The Seblock [30] is applied to address the problem of information redundancy, and the weights of different channels learned by Seblock are applied to extract useful information and to reduce the weights of useless information. The diagram of the Seblock module is shown in Fig. 2. Seblock is a unit to construct the given transformation \({{\varvec{F}}}_{\mathbf{t}\mathbf{r}}\): \({{\varvec{T}}}^{\mathrm{^{\prime}}}->\) T, \({\varvec{T}}\in {{\varvec{R}}}^{H\times W\times C}\), \({{\varvec{T}}}^{\mathrm{^{\prime}}}\in {{\varvec{R}}}^{{H}^{\mathrm{^{\prime}}}\times {W}^{\mathrm{^{\prime}}}\times {C}^{\mathrm{^{\prime}}}}\). The \({\varvec{V}}=\left[{{\varvec{v}}}_{1},{{\varvec{v}}}_{2},\cdots ,{{\varvec{v}}}_{{\varvec{c}}}\right]\) denotes the learned filter kernels, and \({v}_{c}\) is the parameter of the c-th filter. Then, the outputs of \({{\varvec{F}}}_{\mathbf{t}\mathbf{r}}\) as \({\varvec{U}}=\left[{{\varvec{u}}}_{1},{{\varvec{u}}}_{2},\dots ,{{\varvec{u}}}_{{\varvec{c}}}\right]\) can be obtained.

$$ \begin{array}{*{20}c} {{\varvec{u}}_{{\varvec{c}}} = {\varvec{v}}_{c} *X = \mathop \sum \limits_{s = 1}^{{c^{\prime}}} {\varvec{v}}_{{\varvec{c}}}^{{\varvec{s}}} {*}{\varvec{x}}^{{\varvec{s}}} } \\ \end{array} $$
(7)

where * denotes convolution, \({{\varvec{v}}}_{c}=[{{\varvec{v}}}_{{\varvec{c}}}^{1},{{\varvec{v}}}_{{\varvec{c}}}^{2},\dots ,{{\varvec{v}}}_{{\varvec{c}}}^{{\varvec{c}}\boldsymbol{^{\prime}}}]\), \({\varvec{X}}=[{{\varvec{x}}}^{1},{{\varvec{x}}}^{2},\dots ,{{\varvec{x}}}^{{\varvec{c}}\boldsymbol{^{\prime}}}]\), and \({{\varvec{v}}}_{{\varvec{c}}}^{{\varvec{s}}}\) is a 2D spatial kernel. For simplicity, bias terms are omitted. In order to address the limitation that transformation outputs cannot use global contextual information, a global average pooling is proposed to expand the receptive field of the transformation outputs, as shown in Eq. (8).

$$ \begin{array}{*{20}c} {z_{c} = {\varvec{F}}_{{{\mathbf{sq}}}} \left( {{\varvec{u}}_{{\varvec{c}}} } \right) = \frac{1}{H \times W}\mathop \sum \limits_{i = 1}^{H} \mathop \sum \limits_{j = 1}^{W} u_{c} \left( {i,j} \right)} \\ \end{array} $$
(8)

where \(z\in {R}^{c}\) and \({{\varvec{F}}}_{\mathbf{s}\mathbf{q}}\) are the squeeze functions to generate statistics \({z}_{c}\) by using average pooling operation on \({{\varvec{u}}}_{{\varvec{c}}}\). To completely gain the channel-wise dependencies, a simple but useful gating mechanism with a sigmoid function is proposed as follows.

$$ \begin{array}{*{20}c} {s = {\varvec{F}}_{{{\mathbf{ex}}}} \left( {{\varvec{z}},{\varvec{W}}} \right) = \sigma \left( {g\left( {{\varvec{z}},{\varvec{W}}} \right)} \right) = \sigma \left( {{\varvec{W}}_{2} \delta \left( {{\varvec{W}}_{1} {\varvec{z}}} \right)} \right)} \\ \end{array} $$
(9)

where \(\delta \) and \(\sigma \) are the ReLU function and sigmoid function, respectively, \({{\varvec{W}}}_{1}\in {R}^{\frac{c}{r}\times c}\) and \({{\varvec{W}}}_{2}\in {R}^{c\times \frac{c}{r}}\). To make the modules lightweight, the reduction ratio r is set as 16 [30]. Finally, the outputs are obtained by rescaling.

$$ \begin{array}{*{20}c} {\overline{{{\varvec{x}}^{{\varvec{c}}} }} = {\varvec{F}}_{{{\mathbf{scale}}}} \left( {{\varvec{u}}_{{\varvec{c}}} ,s_{c} } \right) = s_{c} \cdot {\varvec{u}}_{{\varvec{c}}} } \\ \end{array} $$
(10)

where \(\overline{{\varvec{X}} }=\left[\overline{{{\varvec{x}} }^{1}},\overline{ {{\varvec{x}} }^{2}},\cdots ,\overline{{{\varvec{x}} }^{{\varvec{c}}}}\right]\) and \({{\varvec{F}}}_{\mathbf{s}\mathbf{c}\mathbf{a}\mathbf{l}\mathbf{e}}\left({{\varvec{u}}}_{{\varvec{c}}},{s}_{c}\right)\) are channel-wise multiplication between \({{\varvec{u}}}_{{\varvec{c}}}\epsilon {R}^{H\times W}\) and scalar \({s}_{c}\).

Fig. 2
figure 2

The diagram of the Seblock module [30]

Different from SeNet [30] that uses Seblock in backbone to train the model, Seblock is inserted into the encoders of the depth network and pose network in this study. As illustrated in Fig. 1, the channel attention mechanism is applied to the encoding and decoding structure.

2.4 The Improved Bidirectional Feature Pyramid Network (BiFPN)

In Ref. [33], BiFPN is proposed as a method for efficiently improving network performance through multi-scale feature fusion. Compared to other methods [37,38,39,40], BiFPN has several unique features. Firstly, it simplifies the structure by removing nodes with only one input edge. Secondly, it adds an extra edge from input to output for more feature fusion. Thirdly, it utilizes a bidirectional path to achieve high-level feature fusion. Lastly, it addresses the issue of uneven input feature contributions by introducing additional weights for each input, allowing the network to learn the importance of each input feature. Figure 3 shows the specifics of the original BiFPN.

Fig. 3
figure 3

The diagram of the original BiFPN module [33]

In Fig. 3, P3-P7 represents the feature level with a resolution of \(1/{2}^{(i-2)}\) of the input images, where \(i=3, 4, \dots , 7\). For example, \({P}_{3}\) represents the feature level with a resolution of \(1/{2}^{(3-2)}\) of the input images, which means that if the input resolution is 192 \(\times \) 640, the P3 feature level would be with a resolution of 96 \(\times \) 320 because \(192/{2}^{(3-2) }=96\) and \(640/{2}^{(3-2) }=320\).

In this study, BiFPN is used as a decoder to efficiently fuse features from multilayers. In addition, in order to use BiFPN efficiently, channel downsampling is applied to resize the channel to fit the BiFPN’s inputs and merge the features in different layers. Further, another channel downsampling (64 → 1) is applied to gain the final depth value. Figure 4 shows the improved BiFPN module, which is novel in the literature.

Fig. 4
figure 4

The diagram of the improved BiFPN module used in the proposed method

In Fig. 4, the numbers in the circles represent the number of feature channels. The different colors indicate different features in different layers. The blue rectangles indicate dimension reduction from the original feature channels to 64. The light-blue rectangles denote dimension reduction processes of the channels to one layer.

3 Experiments

3.1 Training and Test Datasets

3.1.1 KITTI

The KITTI dataset is one of the most widely used datasets in autonomous driving and compute vision tasks (e.g., visual odometry and SLAM). The training and testing data split method used in this study, as well as in Refs. [15] and [43], is the same as Ref. [44]. As suggested by Zhou et al. [15], 39,810 monocular triplets without static images were used for model training. The KITTI dataset, which includes 4424 images from camera sensors, was used to evaluate the examined methods. Additionally, the same camera intrinsic matrix was used for all images and the predicted depth was capped at 80 m, following the guideline of the KITTI dataset [44].

3.1.2 Make3D

The proposed method was further evaluated for its generalizability on the Make3D dataset [45]. The Make3D, which is designed specifically for depth estimation tasks, consists of monocular RGB images and ground truth data from camera sensors. However, it lacks stereo images or image sequences, making it a common test datasets for unsupervised methods [4]. Although it is not suitable for training unsupervised or stereo depth estimation methods due to its small size (only 534 images), it was used to evaluate the proposed method. Image preprocessing involved central cropping due to the varying aspect ratios of the images in the Make3D dataset.

3.2 Evaluation Metrics

To quantitatively evaluate the performance of the proposed method against other state-of-the-art methods, five commonly used evaluation metrics are utilized [46], including absolute relative error (Abs Rel), square relative error (Sq Rel), root-mean-square error (RMSE), root-mean-square logarithmic error (\({\mathrm{RMSE}}_{\mathrm{log}}\)), and accuracy with threshold (\(\delta <{1.25}^{i},i=\mathrm{1,2},3\)). These metrics are widely used in monocular depth estimation [4, 13, 24, 26]. The definitions of these metrics are given as follows:

$$ \begin{array}{*{20}c} {{{Abs}} {{Rel}} = \frac{1}{{\left| {\varvec{T}} \right|}}\mathop \sum \limits_{{y \in {\varvec{T}}}} \frac{{\left| {y - y^{*} } \right|}}{{y^{*} }}} \\ \end{array} $$
(11)
$$ \begin{array}{*{20}c} {{{Sq rel}} = \frac{1}{{\left| {\varvec{T}} \right|}}\mathop \sum \limits_{{y \in {\varvec{T}}}} \frac{{\left| {y - y^{*} } \right|^{2} }}{{y^{*} }}} \\ \end{array} $$
(12)
$$ \begin{array}{*{20}c} {{{RMSE}} = \sqrt {\frac{1}{{\left| {\varvec{T}} \right|}}\mathop \sum \limits_{{y \in {\varvec{T}}}} \frac{{\left| {y - y^{*} } \right|^{2} }}{{y^{*} }}} } \\ \end{array} $$
(13)
$$ \begin{array}{*{20}c} {{{RMSE}}_{\log } = \sqrt {\frac{1}{{\left| {\varvec{T}} \right|}}\mathop \sum \limits_{{y \in {\varvec{T}}}} \frac{{\left| {\log y - \log y^{*} } \right|^{2} }}{{y^{*} }}} } \\ \end{array} $$
(14)
$$ \begin{array}{*{20}c} {{{Accuracy}} = \% {\text{of}} y_{i} s.t.\max \left( {\frac{y}{{y^{*} }},\frac{{y^{*} }}{y}} \right) = \delta < thr} \\ \end{array} $$
(15)

where \(y\) is the predicted depth, \({y}^{*}\) is the ground truth label, \({\varvec{T}}\) is the collection of all the pixels, \(\left|{\varvec{T}}\right|\) denotes the number of pixels, and \(thr\) denotes the threshhold gate (i.e., \(thr={1.25}^{i},i=\mathrm{1,2},3\)). The unit of the predicted depth and ground truth depth is m, while the used evaluation metrics are dimensionless.

3.3 Implementation Details

The proposed method involves determining 3 parameters: the smooth parameter \(\gamma \), the photometric loss term \(\tau \), and the learning rate. These parameters were specified according to Ref. [4]. The Adam optimization algorithm [47] and the model were trained 20 epochs with a batch size of 16. The specific values of γ and τ were set at 0.001 and 0.15, respectively. The learning rate was set at \({10}^{-4}\) in the beginning and \({10}^{-5}\) in the final five epochs. The patch size used for the KITTI dataset was 192 × 640, and for the Make3D dataset, it was 240 × 319. Following the setting in Godard et al. [4] and Chen et al. [48], the depth range was limited to 0–80 m for evaluation. As shown in Fig. 1, each layer in the encoding network downsamples the input features once, and each downsampling process reduces the resolution by half. In addition, each layer in the decoding network upsamples the input features and finally outputs a depth map with the same resolution as the input image. Following the other depth estimation approaches [4, 49], the weights were pre-trained on ImageNet [50].

The depth estimation network is comprised of a encoding network, which includes the ResNet50 architecture with inserted Seblock modules, and a decoding network, featuring an improved BiFPN with a U-Net architecture that effectively extracts useful features from the inputs to produce a depth map.

The pose estimation network was structured with a ResNet50 architecture and incorporated the Seblock module for feature extraction. To estimate the 6-DoF, which included rotation and translation, the outputs were scaled by 0.01, following the approach in Wang et al. [42]. In order to input two images to estimated 6-DoF, the pose network is modified to accept six channel images [4]. Furthermore, to prevent overfitting, techniques for online data augmentation, such as random brightness, contrast, and saturation, were implemented.

All the experiments were implemented in PyTorch [51] on 3.50 GHz Intel(R) Core (TM) i5–7300HQ CPU with 64.00 GB RAM and one NVIDIA GeForce Titan Xp GPUs. The change of training loss with the number of training steps is illustrated in Fig. 5, which shows that the proposed method can effectively converge to a stable level.

Fig. 5
figure 5

The change of training loss with the number of training steps. The spacing of the horizontal axis does not represent equal distance, but only serves as a tick mark

4 Results and Discussion

4.1 Comparison with the State-of-the-Art (SOTA) Methods

Thirteen SOTA methods for depth estimation were compared to demonstrate the advances of the proposed method. Among the thirteen methods, six are supervised and seven are self-supervised. The supervised methods include those found in Bhat et al. [19], Song et al. [20], Ranftl et al. [21], Eigen et al. [46], Liu et al. [52] and Kundu et al. [53]. The self-supervised methods include those proposed by Monodepth2 [4], Mahjourian et al. [12], Monodepth [13], Zhou et al. [15], SGDdepth [26], DDVO [42], Struct2depth [54], DualNet [55], GeoNet [56], Schellevis et al. [57] and Zhou et al. [58].

The estimation results when using different methods on the KITTI dataset are shown in Table 2. The presented results reveal that the Abs Rel, Sq Rel, RMSE, and \({\mathrm{RMSE}}_{\mathrm{log}}\) of the proposed method are 0.113, 0.763, 4.645, and 0.187, respectively. These numbers are improved by 1.74%, 15.50%, 4.48%, and 3.11%, respectively, when compared to Monodepth2 [4]. Additionally, the accuracies with thresholds 1.25, \({1.25}^{2}\), and \({1.25}^{3}\) are 0.874, 0.960, and 0.983, respectively, when using the proposed method. The slightly weaker performance of the proposed method on \(\delta <1.25\) and \(\delta <{1.25}^{2}\) is probably because of the simpler decoder design which only contains 8 M parameters. The proposed method demonstrates the best performance across all other evaluation metrics when compared to the other self-supervised methods.

Table 2 Quantitative comparison of the examined supervised and self-supervised methods

Many of the compared methods (e.g., [27,28,29, 33] and [58]) in Table 2 use stacked pooling or stride convolution to extract high-level features for depth estimation. Stacking too many pooling or stride convolution layers can lead to information redundancy [33]. For example, the VGG encoding network used in Zhou et al. [58] has 500 M parameters, which is five times more than the number of parameters in the proposed method. Due to the high complexity of stacked pooling and stride convolution, the performance of these compared methods is not satisfactory (as shown in Table 2). To address this issue, the proposed method utilizes a more efficient decoding network based on BiFPN and incorporates a channel attention mechanism to enhance its performance. The results in Table 2 show that the proposed method’s depth estimation performance surpasses that of the methods with stacked pooling or stride convolution.

The proposed method has the best performance among the self-supervised methods, as shown in Table 2. However, the supervised learning methods including Lapdepth, DPT-Hybrid, and AdaBins achieve better results than the proposed method because of the use of labelled data for training, which can address the challenges of occlusion and ego-motion. Nonetheless, the proposed method still outperforms the other three supervised learning methods including in Eigen et al. [46], Liu et al. [52] and Kundu et al. [53], demonstrating that the proposed self-supervised method can achieve comparable performance to supervised methods. The qualitative results shown in Fig. 6 also indicate that the proposed method has better performance, with sharper thin objects such as poles in comparison with the estimation from Monodepth2. This could be attributed to the use of the Seblock together with the improved BiFPN module for depth estimation.

Fig. 6
figure 6

Qualitative results for comparisons with the examined supervised and self-supervised methods

4.2 Ablation Study

In order to evaluate the impact of each component in the proposed method on depth estimation performance, ablation experiments were conducted. Both ResNet18 and ResNet50 were tested as the baseline encoder. As shown in Table 3, using ResNet50 as the encoder achieves better performance than using ResNet18. Then, the Seblock and the improved BiFPN module were incorporated into the ResNet50 baseline and evaluated their impact on network performance. As displayed in Table 3, the Sq Rel and RMSE of the ResNet50 baseline are 0.831 and 4.705, respectively. These two metrics were improved by 6.26% and 1.08%, respectively, when Seblock was added to ResNet50, and improved by 6.38% and 0.32%, respectively, when improved BiFPN was added. Furthermore, when the improved BiFPN was used, Sq Rel and RMSE were further reduced to 0.763 and 4.645, respectively. Compared with the ResNet50 baseline, the Sq Rel and RMSE are improved by 8.18% and 1.28%, respectively, when using the proposed ResNet50 + Seblock + BiFPN. The performances on Abs Rel, \({\mathrm{RMSE}}_{\mathrm{log}}\) and the accuracies with different thresholds when incorporating different modules are generally on the same levels. The results on ResNet18 show similar trends with the results on ResNet50. These results indicate that both Seblock and the improved BiFPN contribute to the improved depth estimation performance.

Table 3 Ablation experiment results on KITTI

4.3 Robustness of the Proposed Method

The robustness of the proposed method was further evaluated by testing it in another popular depth estimation dataset, Make3D [45]. The central crop method, as suggested in Godard et al. [4], was used to process the sensor-collected images with different aspect ratios in the dataset. To ensure fairness in comparison, the model trained on KITTI was directly used for testing on Make3D without any fine-tuning. Eight state-of-the-art supervised and self-supervised methods were used for comparison to demonstrate the robustness of the proposed method. Three of the eight methods are supervised, which can be found in Refs [9, 16]. and [53], and the other five are self-supervised, including Monodepth2 [4], Monodepth [13], SharinGAN [59], Atapour et al. [60], and GASDA [61].

The quantitative comparison results are presented in Table 4. As indicated by the numbers in bold, the proposed self-supervised method obtains better depth estimation performance when compared to the other self-supervised methods on Make3D. When comparing the proposed method with the supervised learning methods, the proposed method shows competitive performance, similar to the results in Table 2. Only the supervised method in [16] performs better than the proposed method. Given that supervised methods can learn from the accurately annotated labels, while unsupervised methods can overcome the heavy reliance on ground truth labels with degraded estimation [4], it is promising that the performance of the proposed method is close to or even better than supervised methods.

Table 4 Quantitative comparison results on Make3D

The performance of the proposed method is compared with Monodepth2 through qualitative analysis [4], which is one of the most advanced methods. The results in Fig. 7 show that the depth maps obtained using the proposed method capture more details from the input images and have more accurate depth information, indicating superior performance compared to Monodepth2.

Fig. 7
figure 7

Qualitative illustration results on Make3D

4.4 Limitations and Future Work

The limitation of the proposed method is that it may result in artifacts when synthesizing images. As shown in Fig. 8, blurry boundaries can occur when the target frame is obtained by interpolating from the first frame. Another limitation is that the proposed method may induce errors when constructing the photometric loss based on synthesized images from the previous frame and the next frame. In the future research, a new loss function may be considered to solve this problem. For example, the target frame could be synthesized by incorporating the previous frame in the continuous image sequence instead of the next frame, which may reduce the occurrence of artifacts.

Fig. 8
figure 8

Remarkable distortions in the synthesized images where we have labeled with red rectangles

5 Conclusions

In this paper, an innovative approach for self-supervised monocular depth estimation is proposed, which combines the use of Seblock and an improved BiFPN module to process images based on ResNet50. The Seblock module improves depth map accuracy by strengthening the weights of useful features, and the improved BiFPN module effectively utilizes different levels of features from the encoder. Results on the KITTI dataset show that this proposed method outperforms current state-of-the-art self-supervised methods and even some supervised methods in terms of depth information estimation. The robustness of the proposed method is further demonstrated on the Make3D dataset, where it achieved competitive performance with examined supervised methods. The proposed method, being self-supervised, overcomes the limitation of heavy reliance on annotated labels for training, making it useful for the development of smart environment perception systems in autonomous vehicles for safe driving in intelligent transportation systems.