1 Introduction

Monocular depth estimation (MDE) refers to the task of extracting distance information between objects in a scene and a camera using only a single image. Accurate depth estimation is a critical step for enabling three-dimensional reconstruction and a range of downstream tasks. In particular, a high-quality depth map prediction can serve as a useful prior in RGB image processing, making it applicable in various academic and industrial applications, including Simultaneous Localization and Mapping (SLAM) [1], autonomous driving [2], scene reconstruction [3], object detection [4], semantic segmentation [5], and other domains.

In general, there are two primary methods for obtaining depth information: depth sensor-based or image-based depth estimation [6,7,8,9,10,11]. The former is to use depth sensors, such as Kinect, Velodyne Lidar, and ZED binocular cameras. Although depth sensors have been widely used in different scenarios, they suffer from expensive prices, high-energy consumption, and less structured information. Most images are captured using ordinary cameras, which only provide color information about the scene. As a result, image-based depth estimation has become the focus of research [12,13,14,15,16], which involves using single or multiple visible light image sequences of the same scene to estimate depth. Image-based depth estimation can be classified as monocular depth estimation (MDE), binocular depth estimation (BDE), and multi-view depth estimation (MVDE). However, BDE and MVDE pose significant challenges, such as computational requirements, memory usage, and reliance on different camera parameters. In contrast, MDE using a single visible light image to estimate depth is a low-cost and accessible technique for depth estimation. Therefore, MDE has become an essential technique for depth estimation in various applications.

MDE is an ill-posed problem because multiple 3D scenes can be projected into the same 2D image. Early works [6, 17,18,19] try to solve this problem by calculating the depth value per pixel in a given image using hand-crafted features, such as texture, shadow, and geometric constraints. Besides, some methods [6, 11, 20] attempt to employ the Markov Random Fields (MRFs) or Condition Random Fields (CRFs) to transform the depth estimation into energy optimization, finding a depth configuration that best matches the depth of the actual scene. Although traditional algorithms can generate promising prediction results in simple scenarios, they struggle to effectively solve complex scenes due to issues such as occlusion, changes in lighting, and texture loss.

To address these challenges, deep learning-based neural networks (such as VGG, ResNet, PVT, and Transformer), thanks to the great success of using Convolutional Neural Networks (CNNs) [7], have been widely adopted in MDE tasks [13, 21, 22] and have made significant progress. The deep learning-based method generally uses continuous regression to recover a monocular depth map by minimizing the error between the actual depth of the ground and the predicted depth. By incorporating the fusion of features related to object appearance, geometry, semantics, spatial relations, and other factors, deep learning-based methods can solve the ill-posed problem of MDE more effectively. For instance, Song et al. [23] used Laplacian pyramids in the decoder stage to fully use the underlying properties of well-encoded features. Typically, the encoder-decoder structure is usually used in MDE, where the encoder gradually extracts multi-scale features, and the decoder restores the details of objects using multi-level up-sampling and residual connection for high-resolution prediction. However, the continuous down-sampling in CNNs-based depth estimation may cause the loss of some essential feature information, which is unrecoverable in the decoder stage. To tackle this, many methods have focused on improving the decoder [12, 13, 23,24,25] to obtain more intensive feature maps and prevent the loss of some essential information in continuous up-sampling. Although these methods show improvement in model performance, CNNs-based approaches still suffer from limited receptive fields and less global representation. Therefore, it is crucial to obtain dense long-range dependencies in the encoder stage to achieve accurate depth estimation.

In response to these problems, researchers have incorporated the Transformer into MDE [21, 26], as the Transformer has demonstrated success in handling global dependencies in other tasks. However, the pure Transformer model lacks the ability to model local information due to the absence of spatial inductive bias. To achieve more satisfactory results, some methods have started to combine Transformer with CNNs [13, 22, 26,27,28] to leverage the strengths of both approaches. This combination allows for better performance in MDE tasks by effectively modeling both local and global information. Most existing works in the design of fusion structure use a serial structure [13, 22, 26], as illustrated in Fig. 1b, to further process the features obtained from CNNs or Transformer, thereby enhancing some of the features and achieving some performance improvement. However, if the features acquired from the previous stage are not accurate enough, it may affect the subsequent features. Alternatively, fewer methods adopt a parallel strategy [27, 28], as shown in Fig. 1a, to obtain the last layer of features for fusion through a parallel backbone network to improve the effectiveness and performance of the model. Unfortunately, this may bring the problem of insufficient fusion. Therefore, how the two features can be effectively fused is a major challenge in applying the combination of Transformer and CNN to depth estimation. Secondly, the combination of Transformer and CNN can have a significant performance improvement, but the number of parameters this brings is also considerable, while the downstream tasks of depth estimation (e.g., autonomous driving, SLAM, etc.) require smaller parameters as well as better performance, which implies that a better balance between the parameters and the performance is needed when designing the network mechanism.

Fig. 1
figure 1

a and b are the combination of CNNs and Transformer in some methods. c is a combination mode proposed in this paper

In this paper, we propose a novel layer parallel network structure, PCTDepth, as presented in Fig. 1c, to fuse the feature maps of different layers of two backbone networks to increase the diversity of features. Different from most existing MDE methods using a single feature encoder with a single backbone to extract depth cues, our PCTDepth adopts a two-stream structure to parallelly extract local and global cues via CNNs and Transformer networks. The advantages of this are twofold: (1), it significantly increases the feature space of depth information, and (2) it takes advantage of the complementary effects of local and global information. Specifically, we first use parallel ResNet and Swin Transformer as encoders to extract local adjacent information features and long-distance dependencies, respectively. Through the parallel fusion of ResNet and Transformer, dense local and global information can be obtained at the encoder stage to avoid the loss of important feature information due to repeated down-sampling operations. Secondly, we construct an efficient hierarchical fusion module (HFM) to promote the effective fusion of hierarchical outputs from the ResNet and the Transformer. Finally, we incorporate a dual attention module (DAM) in the decoder stage, which splits the fused features into two parts for processing to adjust the channel weights and weight features at different spatial locations, and thus gradually reconstruct the depth map from coarse to fine scales.

The main contributions of this work are summarized as follows:

  • We propose a parallel architecture PCTDepth that combines CNNs and Transformer for MDE, which makes it easy for the model to acquire dense features in the encoder phase, preparing the building blocks for the next fusion step.

  • We develop an efficient hierarchical fusion module (HFM) that facilitates the seamless integration of long-range dependencies and local detail information, thus complementing each other.

  • We design a dual attention module (DAM) module that splits the fused features into two parts, which can dynamically adjust the importance of channels and spatial dimensions in the feature map for different levels of features using an attention mechanism to improve the accuracy of the model.

2 Related Work

2.1 Supervised Monocular Depth Estimation

Supervised monocular depth estimation trains the model using the ground truth depth map and applies it to estimate the corresponding depth map from a single image. Liu et al. [9] used continuous CRFs to optimize the depth map based on the regional similarity of the images. Saxena et al. [20] were the first to propose a method based on supervised learning to predict depth from local features, which was then refined through MRFs by combining global context information to improve the local prediction. On this basis, Saxena et al. [6] designed multi-scale MRFs and assumed that all scenes are horizontally aligned with the ground, which used the predicted depth map to reconstruct the scene structure. Liu et al. [8] formulated MDE as a discrete-continuous optimization problem and obtained its solution using particle belief propagation. It can be seen that early works were mainly based on geometric models, which are usually only useful in specific scenarios.

Later, with the great success of CNNs in various tasks, the most recent methods based on CNNs [12, 13, 23, 29,30,31,32] for MDE currently dominate the task. Eigen et al. [7] were the first to introduce deep learning to depth estimation, which utilized a two-branch strategy to initially predict global information for the entire image followed by adjusting the predicted local information for the image. Based on the above work, the Eigen team [29] proposed a unified multi-scale network framework that utilized a deeper VGG-based network as the basic network, where the third fine-scale network was employed to further add details and improve resolution. Yin et al. [14] introduced virtual normal to solve the generalization problem of MDE while maintaining as much geometric information as possible, using three randomly selected points in the reconstructed 3D space as geometric constraints. In addition, some approaches [25, 33] treated the task as an ordinal regression problem, where a multi-scale network was used to obtain multi-level information. Considering the nature of the scene from far to near, Cao et al. [10] first treated the MDE problem as a classification problem at the pixel level. Lee et al. [12] proposed multi-scale local planar guidance to guide the densely encoded features. Wang et al. [24] proposed a semantic divide-and-conquer strategy that combines deep networks with shallow models to reduce MDE tasks to individual semantic fragments. Although many methods have attempted to improve the decoder and have shown promising results, current depth estimation methods with convolution operations at the encoder stage have limited ability to capture long-distance relationships between objects due to their restricted receptive field. To overcome this limitation, we propose a strategy that combines the Transformer and CNNs layer through parallel fusion, which enhances the ability of the encoder to acquire dense feature information.

2.2 Transformer

Because of the Transformer’s excellent performance in natural language processing (NLP), many researchers have started to introduce the Transformer into the field of computer vision, overcoming the limitations of RNNs and significantly improving the performance of the models. Vaswani et al. [34] designed a self-attentive mechanism with a multilayer perceptron (MLP) to overcome the limitations (inability to parallelize, low training efficiency, and short memory length) of previous RNNs in NLP. Dosovitskiy et al. [35] were the first to use Vision Transformer (ViT) to solve image classification tasks, and the success of ViT in image classification tasks accelerated the introduction of Transformer to other tasks [36, 37]. The standard Transformer still has challenges in terms of large resolution and varying scales when used directly in computer vision. In response to these challenges, Liu et al. [38] proposed Swin Transformer, whose representation is computed with Shifted windows that can achieve global modeling capability.

However, due to its lack of spatial inductive bias, pure Transformers cannot effectively recover detailed features. Therefore, only a few researchers have attempted to use Transformers for monocular depth estimation tasks [13, 21, 22, 26]. Bhat et al. [13] first introduced the Transformer to MDE by proposing the AdaBins network, which used a baseline encoder-decoder architecture followed by a mini-ViT that divides the depth range into bins. Ranftl et al. [21] proposed a DPT network using ViT as an encoder to obtain a global receptive field at different stages with an additional convolutional decoder for dense prediction. Yang et al. [22] embedded ViT in the middle of the entire network, first exploiting the inductive bias of ResNet in modeling spatial correlations and later exploiting the power of Transformer in modeling global relationships. These methods [13, 22, 26] use a serial strategy to obtain higher-level feature representations to further improve performance. However, this makes it possible for some useful feature information to be lost when passing features between each network, thus affecting the quality of the fused features. Therefore, a few methods [27, 28] adopt a parallel structure, such as [28], which employs three encoders to obtain helpful information operating at different spatial resolutions, and then integrates these pieces of information using a multi-scale fusion block. However, this approach may suffer from insufficient feature fusion and semantic information. In contrast to previous methods, we propose an efficient hierarchical fusion module (HFM) that facilitates the seamless integration of long-range dependencies and local detail information, thus complementing each other.

Fig. 2
figure 2

Overview of our proposed network architecture. DAM: dual attention module, HFM: hierarchical fusion module. First, we use the Swin Transformer blocks and ResNet blocks of the encoder to obtain the features, respectively. Then, the proposed HFM is used to combine the features of different levels and resolutions of Swin Transformer and ResNet. Finally, the original resolution is restored for dense prediction by up-sampling and convolution operations with the help of DAM module

2.3 Attention Mechanism

Attention methods have been used with great success in many computer vision tasks, including image classification, object detection, and semantic segmentation. For pixel-level prediction, Chen et al. [39] first described an attention model to incorporate multi-scale features learned by FCN for semantic segmentation. Later, Hu et al. [40] introduced a Squeeze-and-Excitation (SE) block that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. Based on the SE block, Zhang et al. [41] improved the Squeeze block and designed EncNet, a network equipped with a channel attention mechanism to model the global environment. Wang et al. [42] improved the Excitation block and designed ECANet, which introduced a one-dimensional convolution operation in the channel dimension to enhance the model’s representation capability while maintaining computational efficiency. Woo et al. [43] extended it and proposed the CBAM module, which introduces an intermediate feature map and uses the multiplication of the attention map and the input feature map for adaptive feature refinement. Wang et al. [44] proposed non-local operations as a generic family of building blocks for capturing long-range dependencies. Fu et al. [45] proposed a dual-attention network for scene segmentation by using two independent attention modules to model the semantic dependencies associated with the spatial and channel dimensions. Recently, Huynh et al. [46] designed a novel attention mechanism that incorporates a non-local coplanarity constraint to the network for the MDE task. As attention mechanisms have been demonstrated to improve model performance, an increasing number of methods [31, 47,48,49] have adopted them. We adopt a dual attention module (DAM) module to split the fusion features obtained from the HFM module into two parts to further improve the accuracy by dynamically adjusting the importance of the channel and spatial dimensions in the feature maps with different attentions.

3 Methodology

3.1 Overall Architecture

Our overall architecture is shown in Fig. 2, which incorporates the popular dense prediction encoder-decoder architecture, with the input RGB map \(\ I \rightarrow \textit{R}^{H\times W \times 3}\) and the output depth map\(\ D \rightarrow \textit{R}^{H\times W \times 1}\). We adopt a dual-branch network architecture that includes a Swin Transformer branch and a ResNet branch, which are used to obtain the feature maps at different levels and resolutions. Then, the proposed hierarchical fusion module (HFM) is used to combine the features from both branches to capture global and local information simultaneously. Finally, our dual attention module (DAM) is used to recover the original resolution for dense prediction by performing up-sampling and convolution operations. Our hybrid architecture allows us to exploit the Transformer’s ability to capture global contextual information and the CNNs’ ability to efficiently acquire local information efficiently, which provides a promising strategy for achieving satisfactory results.

3.2 Encoder

Dense prediction tasks commonly utilize encoder-decoder architectures, such as in monocular depth estimation tasks [50,51,52,53,54,55]. The backbone, also known as the encoder, is responsible for extracting image-rich features, while the decoder incorporates these features to generate a final dense prediction. Over the years, several backbone networks have been proposed in the field of computer vision, including ResNet and the widely-used Swin Transformer [38, 50], which is known for its hierarchical nature, computational efficiency, and superior performance. To enhance the diversity and expressiveness of the features and provide a better depiction of the image content, we adopt two backbone networks to simultaneously extract local features and global dependencies in parallel. The overall architecture of the model is depicted in Fig. 2.

Fig. 3
figure 3

Details of the HFM fusion module. The HFM module is used to calibrate the characteristics of the two branches. The symbols \(\ x_{st}^{i}\ \), \(\ x_{res}^{i}\ \), BRC, and GAP denote the Swin Transformer branch, the ResNet branch, the pre-activation block, and global average pooling, respectively

3.3 HFM Module

We introduce an efficient parallel hierarchical interaction fusion feature module (HFM), depicted in Fig. 3. Unlike Hwang et al. [27], which utilizes residual blocks to improve local features, our objective in developing HFM is to comprehensively integrate local detailed features from the ResNet branch and global features from the Transformer branch using adaptive feature alignment. The module generates four fused features \(\{F_{i}\}_{i=1}^{4}\) with a channel number of 64, reducing model complexity, enhancing computational efficiency, and preventing overfitting. We obtain the position relationship between the vectors in the Transformer and ResNet branch to get the feature \(\{x_{st}^{i}\}_{i=1}^{4}\) and \(\{x_{res}^{i}\}_{i=1}^{4}\). The feature \(\{x_{st}^{i}\}_{i=1}^{4}\), after passing through Block1, is first processed by a 3\(\times \)3 convolution operation to extract more distinctive features. Then, an up-sampling operation is performed to increase the feature resolution, expand the receptive field, and prepare for the subsequent feature fusion. As convolution and up-sampling operations can lead to some information loss, we utilize an adaptive feature alignment structure to obtain\(\ F_{t}\ \), which reduces information loss to a certain extent and enhances feature representation ability. The specific formula is as follows.

$$\begin{aligned} F_{t} = multi(x_{st}^{i},Conv(up(\sigma (x_{st}^{i})))), \end{aligned}$$
(1)

where \(\sigma \) indicates Sigmoid activation function. In this step, we manipulate the Transformer branch to help the model understand more clearly the relationships between different regions in the image.

The ResNet branch feature information is processed in a similar way. After passing through Block2, the feature \(\{x_{res}^{i}\}_{i=1}^{4}\) undergoes global average pooling (GAP) to reduce the dimensionality of the feature maps. This step compresses multiple feature maps into a single feature vector, reducing the risk of overfitting. In this step, we manipulate the ResNet branch to enhance the model’s perception of different features, improving its performance.

$$\begin{aligned} F_{r} = multi(x_{res}^{i},\sigma (Conv(GAP(Conv(x_{st}^{i}))))), \end{aligned}$$
(2)

where \(\sigma \) indicates Softmax activation function.

Then, we concatenate the processed features \(F_{t}\) and \(F_{r}\) from the two branches to achieve the fusion of local and global information.

$$\begin{aligned} F_{mid} = BRC(Cat(F_{t},up(F_{r}))), \end{aligned}$$
(3)

The resulting feature is further optimized using a pre-activation block which consists of BN, ReLU, and Conv (BRC). A typical convolution block consists of Conv, BN, and ReLU, discarding most of the negative values of the non-linear property of the ReLU activation in the last step. We mitigate this problem by using BRC to obtain the intermediate fusion feature \(F_{mid}\). Then, the optimized feature \(F_{mid}\) is concatenated with the initial features to preserve the detailed information and prevent information loss.

$$\begin{aligned} F_{mid} = BRC(Cat(up(Conv(F_{t})), F_{mid},up(Conv(F_{r})))), \end{aligned}$$
(4)

where Cat denotes concatenation operation. Finally, the concatenated feature is optimized again, resulting in the fused feature \(\{F_{i}\}_{i=1}^{4}\).

Fig. 4
figure 4

DAM attention module. a Channel attention block (CA), b Spatial attention block (SA). The symbol indicates \(\textcircled {c}\) concatenation operation

3.4 Decoder of DAM Attention Module

In the decoder stage, we obtain the fused features \(\{F_{i}\}_{i=1}^{4}\) of Swin Transformer and ResNet from the HFM module at resolution size [H/2, H/4, H/8, H/16]. To improve the accuracy of the model, we use the dual attention module to handle high-level semantic features and low-level features separately, as shown in Fig. 4. By using channel attention (CA) to process high-level features, the model is able to better capture global semantic information. Meanwhile, using spatial attention (SA) to process low-level features helps to improve the model’s sensitivity to local details.

To leverage high-level semantic features \(F_{3}\) and \(F_{4}\), we use the CA block to dynamically adjust channel weights. Firstly, we use GAP and GMP(global max pooling) to focus on important features in high-level representations. Secondly, we use concatenation operations and convolution operations to exploit correlations between channels and enhance the expressive power of relevant feature channels.

$$\begin{aligned} F_{k}^{'} = Conv(Cat(\sigma (Conv(GAP(F_{k}))))), \ \ \ k \in \{3,4\} \end{aligned}$$
(5)

where Conv indicates a convolution operation with 3\(\times \)3 kernel, \(k\in \{3, 4\}\). The \(\sigma \) indicates Sigmoid activation function.

$$\begin{aligned} \left\{ \begin{array}{lr} \alpha = Cat(GAP(F_{q}),GMP(F_{q})) &{} \\ \ \\ F_{q}^{'} = Conv(Cat(\sigma (Conv(\alpha )),F_{q})), \ \ \ \ \ q \in \{1,2\} &{} \end{array} \right. \end{aligned}$$
(6)

Conversely, for shallow-level features \(F_{1}\) and \(F_{2}\), here \(q\in \{1, 2\}\). we use the SA block to extract global features from the input feature map, and learn weight coefficients for different spatial locations to emphasize the features at different spatial locations. This enhances the model’s ability to focus on various spatial locations and improve its accuracy.

Afterward, the feature \(F_{4}^{c}\) undergoes an up-sampling operation to restore the image resolution to H/8. Next, we concatenate the feature \(F_{4}^{c}\) with \(F_{3}^{c}\), and perform another up-sampling operation. This process is repeated until it is fused with the multi-level fusion feature to get \(F^{'}\), as shown in equation (7) below.

$$\begin{aligned} F_{i-1}^{'} = Conv(up(F_{i}^{'}),F_{i-1}^{'}), \ \ \ i \in \{2,3,4\} \end{aligned}$$
(7)

where \(F_{i}^{'}\) denotes the \(i^{th}\) level of fused features. After obtaining the fused feature \(F_{1}^{'}\) from equation (7), we use it as input to equation (8).

$$\begin{aligned} D^{pre} = \sigma (Last\_layer(up(F_{1}^{'}))), \end{aligned}$$
(8)

where \(Last\_layer\) indicates two convolution operation with 3\(\times \)3 kernel and a ReLU activation function. Finally,the last prediction depth \(D^{pre}\) is obtained by applying the Sigmoid activation function.

3.5 Loss Function

Following previous work [12, 23], we also use the scale-invariant loss (SI) proposed by Eigen et al. [7] to supervise the training, which calculates the distance between the predicted output depth \({\hat{d}}_{i}\) and the ground truth depth map \(d_{i}\). The equation of SI loss is as follows:

$$\begin{aligned} L_{pixel} =\sqrt{\frac{1}{T} \sum _{i}g_{i}^{2}-\frac{\lambda }{T^2}\sum _{i}g_{i}}, \end{aligned}$$
(9)

where \(g_{i}=log{\hat{d}}_{i}-log{d}_{i}\), \(\lambda \)=0.5. T denotes the number of pixels having valid ground truth values.

Table 1 Quantitative evaluations on the KITTI dataset using test spilled of Eigen et al. [7]
Fig. 5
figure 5

Qualitative comparison with other state-of-the-art methods on the KITTI benchmark dataset

4 Experiments

4.1 Datasets

KITTI [60] is a large-scale outdoor dataset that contains RGB and depth image pairs from autonomous driving scenes. Depth maps are generated by accumulating LiDAR measurements from the entire sequence. We use the KITTI dataset to validate the performance of the proposed model on the monocular depth estimation task. The test set and training set are divided according to the criteria proposed by Eigen et al. [7]. We use 23K images from 32 scenes for training, 697 images from the remaining 29 scenes for testing, and a maximum value of 80 m for evaluation.

4.2 Evaluation Metrics

We follow the standard evaluation scheme in previous work [10] and used the following quantitative evaluation metrics in our experiments, including mean absolute relative error (Abs Rel), mean squared relative error (Sq Rel), root mean squared error (RMSE), root mean squared log error (RMSE log) and the accuracy under the threshold (\(\delta _{i}<1.25^{i}\), i = 1, 2, 3). These error metrics are defined as:

$$\begin{aligned} Abs\ Rel= & {} \frac{1}{T} \sum _{i=1}^T \frac{||d_{i}-d_{i}^{*}||}{d_{i}^{*}}, \end{aligned}$$
(10)
$$\begin{aligned} Sq\ Rel= & {} \frac{1}{T} \sum _{i=1}^T \frac{||d_{i}-d_{i}^{*}||^{2}}{d_{i}^{*}}, \end{aligned}$$
(11)
$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{T} \sum _{i=1}^T {(d_{i}-d_{i}^{*}})^{2}}, \end{aligned}$$
(12)
$$\begin{aligned} RMSE\ log= & {} \sqrt{\frac{1}{T} \sum _{i=1}^T ||log(d_{i})-log(d_{i}^{*})||^{2}}, \end{aligned}$$
(13)
$$\begin{aligned} max(\frac{d_{i}}{d_{i}^{*}},\frac{d_{i}^{*}}{d_{i}})= & {} \delta <t, t\in 1.25^{i}(i=1,2,3), \end{aligned}$$
(14)

where \(d_{i}^{*}\) and \(d_{i}\) are the ground-truth depth and predicted depth at pixel i, respectively, and T is the total number of pixels of the test images.

4.3 Implementation Details

We use the PyTorch framework to implement our proposed architecture and the experiments are performed on 2 NVIDIA GTX A4000 GPUs. The images in the KITTI [60] dataset are cropped to a size of 320 \(\times \) 320. For training, we use the one-cycle learning rate strategy of the Adam optimizer. The entire training is divided into two parts, with the learning rate increasing from 3e-5 to 1e-4 in the first half and decreasing from 1e-4 to 3e-5 in the second half. We set the batch size to 12 and the model converges at around 25 epochs.

Table 2 Ablation study of the HFM module on the KITTI dataset

4.4 Comparison to the State-of-the-Art

We compare our PCTDepth with the state-of-the-art methods on the KITTI dataset. We present the results from both quantitative and qualitative perspectives, which are illustrated in Table 1 and Fig. 5, respectively. The proposed PCTDepth is trained on the input range and tested on the depth values in [0 m, 80 m]. As can be seen from Table 1, PCTDepth outperforms existing depth estimation methods in terms of both error and correctness metrics, except for the Sq Rel error with \(\delta _{2}\) threshold, where we achieve a marginally lower score than AdaBins [13] by 0.002 and 0.001. The RMSE and RMSE log metrics measure the squared difference between the estimated depth and its corresponding GT, which amplifies the error in the apparently erroneous estimate. The lower the RMSE or RMSE log, the better the scene structure is recovered. As can be seen from Table 1, the proposed PCTDepth achieves the best results for both the RMSE and RMSE log metrics. Our RMSE result is 3.3% lower than the second-best result, and the RMSE log result is even 10.2% lower than the second-best. Additionally, our \(\delta _{1}<1.25\) is 0.009 higher than that of the TransDepth [22] which is the serial structure of CNNs and Transformers in the current latest method.

As shown in Fig. 5, our proposed PCTDepth can accurately estimate depth maps for complex urban scenes, including fine objects such as railings and road signs, as well as dynamic objects such as cars and pedestrians. For fine objects (\(1^{st}\),\(2^{nd}\), and \(5^{th}\) rows), our method produces depth maps with more complete details and smoother boundaries compared with other state-of-the-art methods, such as TransDepth [22], BTS [12], AdaBins [13], and DPT [21]. Moreover, for the cars and pedestrians (\(3^{rd}\) and \(4^{th}\) rows), our estimated depth maps better reflect the surface information of the multiple overlapping cars and people, with more complete and smoother contours. These qualitative results demonstrate the effectiveness of our proposed PCTDepth in recovering the scene structure of urban environments.

Fig. 6
figure 6

Qualitative analysis to verify the validity of the architecture on the KITTI dataset. a RGB images, b Swin Transformer network, c ResNet network, d Layered interactive parallel fusion architecture of Swin Transformer and ResNet

4.5 Ablation Study

This section aims to validate the effectiveness of our method by conducting ablation studies to assess the individual contributions of each module on the KITTI dataset. As shown in Table 2, the Base, B, H, and DAM represent ResNet34 combined with Swin Transformer, pre-activation block, hierarchical fusion module, and dual attention module, respectively. Table 2 demonstrates that the parallel strategy of combining ResNet with Transformer can significantly reduce errors, resulting in a Abs Rel error of 0.055. The use of the HFM fusion module can reduce various errors and increase the accuracy under the threshold. The DAM attention module can improve the \(\delta _{3}\) (\(\delta <1.25^{3}\)) metric to a state-of-the-art level of 0.999.

4.5.1 Verify the Effectiveness of the CNNs-Transformer Architecture

To verify the effectiveness of our proposed parallel network structure, we conduct ablation experiments on Transformer branches and CNNs branches, as shown in Table 3. We compared the effect of combining different ResNet networks with the Transformer on the KITTI dataset. R and Swin represent ResNet and Swin Transformer, respectively. The experiments show that the error of either the CNNs network alone or the Transformer network alone is greater than the combination of the two. Comparing the different ResNet networks, we observe that the combination of Swin Transformer and ResNet34 yields the best results. Additionally, Fig. 6 shows the results of our qualitative analysis, which can validate the effectiveness of our proposed hierarchical interaction parallel fusion architecture.

Fig. 7
figure 7

Validation of the fusion module HFM on the KITTI dataset. a RGB images, b Swin Transformer + ResNet, c Swin Transformer + ResNet + HFM. The edges of objects are more clearly defined in the depth map with the addition of the HFM fusion module

Fig. 8
figure 8

Validation of the DAM module on the KITTI dataset. a attention module with grouping, b attention module with the addition of hierarchical features, c attention module with layering

Table 3 Ablation study with Swin Transformer combined with ResNet on the KITTI dataset
Table 4 Percentage drop in loss compared to baseline. baseline: ResNet34 combine Swin Transformer, HFM: hierarchical fusion module
Table 5 Ablation study of the DAM module on the KITTI dataset, here * denotes the CA channel attention used for features F1 and F2 features in Fig. 8c and SA spatial attention for features F3 and F4

4.5.2 Verify the Effectiveness of the HFM Module

The comparison between depth maps generated with and without the HFM module is illustrated in Fig. 7. The results show that the module-free approach can capture large targets such as cars and railings, but the generated depth map’s boundary is more blurred. However, small targets like utility poles and street signs are difficult to capture, or not captured at all. In contrast, the HFM module not only captures the shape and size feature information of large objects, but also has good control of small target details such as utility poles. To verify the effectiveness of the HFM fusion module in terms of target size and edge clarity, we compare three objects: cars (large), billboards (medium), and railings (small). The experiments show that adding the HFM module reduces the depth loss by 6.81% compared to a parallel architecture using only Swin Transformer combined with CNNs (baseline). Additionally, using the Attention module, the depth loss is reduced by 7.62% compared to the baseline. These results are shown in Table 4.

4.5.3 Verify the Effectiveness of the DAM Module

In the decoder stage, we designed three attention module schemes as shown in Fig. 8 and Table 5. We compared the three attention module design options presented in Fig. 8a–c, with and without the attention module.

As can be seen from Table 5, all three strategies have improved the indicators. However, overall, the strategy (c) directly using the CA and SA of our design has a significant improvement for all indicators.

4.6 FLOPs, Params, and Epochs

In addition to the performance evaluation, we also compare the FLOPs and the number of parameters of some of the state-of-the-art methods to reduce parameters and FLOPs as much as possible without significantly reducing accuracy. We use the torchstat module in PyTorch to calculate the FLOPs and parameters of the model, which help analyze the complexity of the model. FLOPs (floating-point operations) refer to the number of floating-point operations and can be used to measure the complexity of an algorithm or model, as shown in Table 6. Our method has slightly higher FLOPs than DPT [21] and DenseDepth [61], as well as slightly higher Parameters than DenseDepth [61]. Although DPT [21] surpasses our method in terms of parameters, our parameters and FLOPs are significantly lower than those of the methods we compared against.

Table 6 The FLOPs and number of parameters for some complex models, compared with Hu [16], Chen [62], Yin [14], BTS [12], DPT [21], AdaBins [13], DenseDepth [61] and ACAN [63]
Table 7 The number of epochs that the model tends to converge in the training stage, compared with BTS [12], DORN [33], TransDepth [22], PGA-Net [55] and Song et al. [23]

As shown in Table 7, we also compare the number of training epochs in which the model tended to converge during the training stage. Compared to some of the current methods, our approach achieves model convergence in fewer epochs.

5 Conclusion

In this paper, we propose a new parallel CNNs-Transformer hierarchical interactive fusion architecture with dual attention to complete the extraction of dense features in the encoder stage and avoid the absence of edge and detail features in the decoder stage. Specifically, we introduce an efficient HFM module and a DAM module to help achieve an effective fusion of Transformer global features with CNNs local features and to better resume high resolution. We validate the effectiveness of our proposed architecture on the KITTI dataset, and the results show that our approach has a competitive advantage over the state-of-the-art results. Although complex models may offer superior performance, they require significant computational resources, leading to increased training and inference times and costs. Therefore, we aim to explore model pruning and lightweight network design to address these limitations in our future research and achieve the goal of easy deployment to practical applications.