DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation

This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Therefore, we propose to leverage the Transformer to model this global context with an effective attention mechanism. We also adopt an additional convolution branch to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features via element-wise interaction and model the affinity between the Transformer and the CNN features in a set-to-set translation manner. Due to the unbearable memory cost caused by global attention on high-resolution feature maps, we introduce the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. Notably, it achieves the most competitive result on the highly competitive KITTI depth estimation benchmark. Our codes and models are available at https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox.


Introduction
Monocular depth estimation plays a critical role in three dimensional reconstruction and perception.Since the groundbreaking work of (He et al., 2016), convolutional neural network (CNN) has dominated the primary workhorse for depth estimation, in which the encoder-decoder based architecture is designed (Fu et al., 2018;Lee et al., 2019;Bhat et al., 2021).Although there have been numerous work focusing on the decoder design (Fu et al., 2018;Bhat et al., 2021), recent studies suggest that the encoder is even more pivotal for accurate depth estimation (Lee et al., 2019;Ranftl et al., 2021).Due to the lack of depth cues, fully exploiting both the long-range correlation (i.e., distance relationship among objects) and the local information (i.e., consistency of the same object) are critical capabilities of an effective encoder (Saxena et al., 2005).Therefore, the potential bottleneck of current depth estimation methods may lie in the encoder where the convolution operators can scarcely model the long-range correlation with a limited receptive field (Ranftl et al., 2021).
In terms of CNN, there have been great efforts to overcome the above limitation, roughly grouped into two cate- gories: manipulating the convolution operation and integrating the attention mechanism.The former applies advanced variations, including multi-scale fusion (Ronneberger et al., 2015), atrous convolutions (Chen et al., 2017) and feature pyramids (Zhao et al., 2017), to improve the effectiveness of convolution operators.The latter introduces the attention module (Vaswani et al., 2017) to model the global interactions of all pixels in the feature map.There are also several general approaches (Fu et al., 2018;Lee et al., 2019;Huynh et al., 2020;Bhat et al., 2021) that explore the combination of both these strategies.Though the performance is improved significantly, the dilemma persists.
In an alternative to CNN, Vision Transformer (ViT) (Dosovitskiy et al., 2021), which achieves tremendous success on image recognition, demonstrates the advantages of serving as the encoder for depth estimation.Benefiting from the attention mechanism, the Transformer is more expert at modeling the long-range correlation with a global receptive field.However, our pilot study (Sec.3.1) indicates the ViT encoder cannot produce satisfactory performance due to the lack of spatial inductive bias in modeling the local information (Yang et al., 2021).
To mitigate these issues, we propose a novel monocular depth estimation framework, DepthFormer (illustrated in Fig. 1), which boosts model performance by incorporating the advantages from both the Transformer and the CNN.The principle of DepthFormer lies in the fact that the Transformer branch models the long-range correlation while the additional convolution branch preserves the local information.We argue that the integration of these two-type features can help achieve more accurate depth estimation.However, independent branches with late fusion lead to insufficient feature aggregation for the decoder.To bridge this gap, we design the Hierarchical Aggregation and Heterogeneous Interaction (HAHI) module to combine the best part of both branches.Specifically, it consists of a self-attention module to enhance the features among hierarchical layers of the Transformer branch via element-wise interaction and a cross-attention module to model the affinity between 'heterogeneous' fea-tures (i.e., Transformer and CNN features) in a set-to-set translation manner.Since global attention on high-resolution feature maps leads to an unbearable memory cost, we propose to leverage the deformable scheme (Dai et al., 2017;Zhu et al., 2021) that only attends to a limited set of key sampling vectors in a learnable manner to alleviate this problem.
The main contributions of this work are three-fold: (1) We apply the Transformer as the image encoder to exploit the long-range correlation and adopt an additional convolution branch to preserve the local information.(2) We design the HAHI to enhance features via element-wise interaction and model the affinity in a set-to-set translation manner.
(3) Our proposed approach DepthFormer significantly outperforms state-of-the-arts with prominent margins on the KITTI (Geiger et al., 2013), NYU (Silberman et al., 2012) and SUN RGB-D (Song et al., 2015) datasets.Furthermore, it achieves the most competitive result on the highly competitive KITTI depth estimation benchmark2 .

Related Work
Estimating depth from RGB images is an ill-posed problem.Lack of cues, scale ambiguities, translucent or reflective materials all leads to ambiguous cases where appearance cannot infer the spatial construction.With the rapid development of deep learning, CNN has become a key component of mainstream methods to provide reasonable depth maps from a single RGB input.
Monocular depth estimation has drawn much attention in recent years.Among numerous effective methods, we consider DPT (Ranftl et al., 2021), Adabins (Bhat et al., 2021) and Transdepth (Yang et al., 2021) as three the most important competitors.
DPT proposes to utilize ViT as the encoder and pre-train models on larger-scale depth estimation datasets.Adabins uses adaptive bins that dynamically change depending on representations of the input scene and proposes to embed the mini-ViT at a high resolution (after the decoder).Transdepth embeds ViT at the bottleneck to avoid the Transformer losing the local information and presents an attention gate decoder to fuse multi-level features.We focus on comparing these (and many other) methods in this paper.
Encoder-decoder is commonly used in monocular depth estimation (Eigen et al., 2014;Fu et al., 2018;Hu et al., 2019;Lee et al., 2019;Huynh et al., 2020;Yang et al., 2021;Bhat et al., 2021).In terms of the encoder, mainstream feature extractors, including EfficientNet (Tan and Le, 2019), ResNet (He et al., 2016) and DenseNet (Huang et al., 2017), are adopted to learn representations.The decoder frequently consists of successive convolutions and upsampling operators to aggregate encoder features in a late fusion manner, recover the spatial resolution and estimate the depth.In this paper, we utilize the baseline decoder architecture in (Alhashim and Wonka, 2018a).It allows us to more explicitly study the performance attribution of key contributions of this work, which are independent of the decoder.
Neck modules between the encoder and the decoder are proposed to enhance features.Many previous methods only focus on the bottleneck feature but ignore the lower-level ones, limiting the effectiveness (Fu et al., 2018;Lee et al., 2019;Huynh et al., 2020;Yang et al., 2021).In this work, we propose the HAHI module to enhance all the multi-level hierarchical features.When another branch is available, it can model the affinity between the two-branch features as well, which benefits the decoder to aggregate the heterogeneous information.
Transformer networks are gaining greater interest in the computer vision community (Dosovitskiy et al., 2021;Liu et al., 2021;Carion et al., 2020;Zheng et al., 2021).Following the success of recent trends that apply the Transformer to solve computer vision tasks, we propose to leverage the Transformer as the encoder to model long-range correlations.In Sec.3.1, we discuss our motivation and present differences between our method and several related work (Ranftl et al., 2021;Yang et al., 2021;Bhat et al., 2021) that adopt the Transformer in monocular depth estimation.

Methodology
In this section, we present the motivation of this work and introduce the key components of DepthFormer: (1) an encoder consisting of a Transformer branch and a convolution branch and (2) the hierarchical aggregation and heterogeneous interaction (HAHI) module.An overview of DepthFormer is shown in Fig. 2.

Motivation
To indicate the necessity of this work, we conduct a meticulous pilot study to investigate the limitations of existing methods that utilize pure CNN or ViT as the encoder in monocular depth estimation.
Pilot Study: We first present several failure cases of the state-of-the-art CNN-based monocular depth estimation methods on the NYU dataset in Fig. 3.The depth results at the wall decorations and carpets are unexpectedly incorrect.Due to the pure convolutional encoder for feature extraction, it is hard for them to model the global context and capture the long-range distance relationship among objects through limited receptive fields.Such large-area counter-intuitive failures severely impair the model performance.
To solve the above issue, ViT can serve as a proper alternative that is superior in modeling the long-range correlation with a global receptive field.Therefore, we experiment to analyze the performance of ViT-and CNN-based methods on the KITTI dataset.Specifically, based on DPT (Ranftl et al., 2021), we adopt ViT-Base and ResNet-50 as the encoder to extract features, respectively.The results shown in Tab. 1 prove that the models applying ViT as encoder outperform  those using ResNet-50 on distant object depth estimation.However, opposite results appear on the near objects.Since the depth values exhibit a long tail distribution and there are much more near objects in scenes (Jiao et al., 2018), the overall results of the models applying ViT are significantly inferior.More experimental details and results are reported in the experimental section.
Analysis: In general, it is tougher to estimate the depth of distant objects directly.Benefiting from modeling the longrange correlation, the ViT-based model can be more reliable to accomplish it via reference pixels in a global context.The knowledge of distance relationships among objects results in better performance on distant object depth estimations.As for the inferior near object depth estimation result, there are many potential explanations.We highlight two major concerns: (1) The Transformer lacks spatial inductive bias in modeling the local information (Yang et al., 2021).As for depth estimation, the local information is reflected in the detailed context that is crucial for consistent and sharp estimation results.However, these detailed content tends to be lost during the patch-wise interaction of the Transformer.
Since objects appearing nearer are larger with higher texture quality (Saxena et al., 2005), the Transformer will lose more details at these locations, which severely deteriorates the model performance at a near range and leads to unsatisfying results.(2) Visual elements vary substantially in scale (Liu et al., 2021).In general, a U-Net (Ronneberger et al., 2015) shape architecture is applied for depth estimation, where the multi-scale skip connections are pivotal for exploiting multilevel information.Since the tokens in ViT are all of a fixed scale, the consecutive non-hierarchical forward propagation makes the multi-scale property ambiguous, which may also limit the performance.
In this paper, we propose to leverage an encoder consisting of Transformer and convolution branches to exploit both the long-range correlation and local information.Different from DPT (Ranftl et al., 2021), which directly utilizes ViT as the encoder, we introduce a convolution branch to make up for the deficiencies of spatial inductive bias in the Transformer branch.Furthermore, we replace ViT with Swin Transformer (Liu et al., 2021) so that the Transformer encoder can provide hierarchical features and reduce the compu-tational complexity.Unlike previous methods which embed the Transformer into the CNN (Bhat et al., 2021;Yang et al., 2021), we adopt the Transformer to encode images directly, which can fully exert the advantages of the Transformer and avoid the CNN discarding crucial information before the global context modeling.Moreover, due to the independence of these two branches, the simple late fusion of the decoder leads to an insufficient feature aggregation and marginal performance improvement.To bridge this gap, we design the HAHI module to enhance features and model affinities via feature interaction, which alleviates the deficiency and helps combine the best part of both branches.

Transformer and CNN Feature Extraction
We propose to extract image features via an encoder consisting of a Transformer branch and a light-weight convolution branch, thus fully exploiting the long-range correlation and the local information.
Transformer branch first splits the input image I into non-overlapping patches by a patch partition module.The initial feature representation of each patch is set as a concatenation of the pixel RGB values.After that, a linear embedding layer is applied to project the initial feature representation to an arbitrary dimension, which is served as the input of the first Transformer layer and denoted as z 0 .After that, L Transformer layers are applied to extract features.In general, each layer consists of a multi-head self-attention (MSA) module, followed by a multi-layer perceptron (MLP).A LayerNorm (LN) is applied before the MSA and the MLP, and a residual connection is utilized for each module.Therefore, the process of layer l is formulated as where ẑl and z l denote the output features of the MSA module and the MLP module for layer l, respectively.The structure of a Transformer layer is illustrated in the left part of Fig. 2. Following DPT (Ranftl et al., 2021), we sample and reassemble N feature maps from the N selected Transformer layers as the output of the Transformer branch and symbolize them as Notably, our framework is compatible with a variety of Transformer structures.In this paper, we prefer to utilize Swin Transformer (Liu et al., 2021) to provide hierarchical representations and reduce the computational complexity.The main differences from the standard Transformer layers lie in the local attention mechanism, the shifted window scheme, and the patch merging strategy.
Convolution branch contains a standard ResNet encoder to extract the local information, which is commonly used in depth estimation methods.Only the first block of the ResNet is used here to exploit the local information, which avoids the low-level features being washed out by consecutive multiplications (Yang et al., 2021) and greatly reduces the computational time.The output feature map with C g channels is denoted as G ∈ R Cg×Hg×Wg .
Upon acquiring Transformer features F and convolution features G, we feed them to the HAHI module for further processing.Compared to TransDepth (Yang et al., 2021), we adopt an additional convolution branch to preserve the local information.It avoids the discarding of crucial information by CNN and enables us to predict sharper depth maps without artifacts, as shown in Fig. 4.

HAHI Module
To alleviate the limitation of insufficient aggregation, we introduce the HAHI module to enhance the Transformer features and further model the affinity of the Transformer and the CNN features in a set-to-set translation manner.It is motivated by Deform-DETR Zhu et al. (2021) and attempt to apply attention modules to solve the fusion of heterogeneous features.
We consider a set of hierarchical features as the inputs for feature enhancement.Since we use the Swin Transformer layers to extract the features, the reassembled feature maps will exhibit different sizes and channels, as shown in Fig. 5.Many previous works have to downsample the multi-level features to the resolution of the bottleneck feature and can only enhance the bottleneck feature with simple concatenation, or latent kernel schemes (Yang et al., 2021;Zheng et al., 2021;Lee et al., 2019).Oppositely, we aim to enhance all the features without downsampling operators that may lead to information loss.Specifically, we first utilize 1×1 convolutions to project all the hierarchical features to the same channel C h , denoting as . Then, we unfold (i.e., flatten and concatenate) the feature maps to a two-dimensional matrix X, where each row is a C h -dimensional feature vector of one pixel from the hierarchical features.After that, we compute the Q (query), K (key) and V (value) by linear projections of X as where P Q , P K , and P V are linear projections, respectively.We attempt to apply the self-attention module to enhance the features.However, extremely numerous feature vectors lead to an unbearable memory cost.enhancement.Let q and v index a element with representation feature x q ∈ Q and x v ∈ V, respectively.p q reprents the location of the query vector x q .The processing can be formulated as where the attention weight A qk and the sampling offset ∆p qk of the k th sampling point are obtained via linear projection over the query feature x q .A qk are normalized as k∈Ω k A qk = 1.As p q + ∆p qk is fractional, bilinear interpolation is applied as in (Dai et al., 2017) in computing x v (p q + ∆p qk ).We also add a hierarchical embedding to identify which feature level each query pixel lies in.The output denoted as X is folded (i.e., split and reshaped) back to the original resolutions to get the hierarchical enhanced features F enh .After fusing F enh and F via channel-wise concatenatetions followed by 1×1 convolutions, we obtain the output F o = {f n o } N n=1 and achieve the feature enhancement.When the additional convolution branch is available, we consider a feature map G as the second input of the HAHI for affinity modeling.Similar to the first input F, G can be any other type of representation.We utilize a 1×1 convolution to project G to G h with a channel dimension C h and then flatten G h to a two-dimensional query matrix Q. Applying X as K and V, we calculate the cross-attention to model the affinity.Similarly, the unbearable memory cost still persists.We apply the deformable attention module in Eq. 3 to alleviate this issue, where the reference point locations p q are dynamically predicted from the affinity query embedding via a learnable linear projection followed by a sigmoid function.After reshaping the result to the original resolution to form the attentive representation G att , we fuse G att and G by a channel-wise concatenation and a 1×1 convolution, getting another output of HAHI, denoted as G o .This process achieves the affinity modeling and the feature interaction between the Transformer and the CNN branches.
All the outputs of the HAHI (i.e., F o and G o ) are sent to the baseline decoder (Alhashim and Wonka, 2018a;Li et al., 2021b) for depth estimation, which consists of several consecutive UpConv layers that are illustrated in the right part of Fig. 2. The network optimization loss updated from (Eigen et al., 2014) is: where h i = log di −log d i with the ground truth depth d i and predicted depth di .T denotes the number of pixels having valid ground truth values.Following (Bhat et al., 2021), we use λ = 0.85 and α = 10 for all experiments.
4 Experiment Results

Datasets
KITTI is a dataset that provides stereo images and corresponding 3D laser scans of outdoor scenes captured by equipment mounted on a moving vehicle (Geiger et al., 2013).The RGB images have a resolution of around 1241 × 376, while the corresponding ground truth depth maps are of low density.Following the standard Eigen training/testing split (Eigen et al., 2014), we use around 26K images from the left view for training and 697 frames for testing.When evaluation, we use the crop as defined by Garg et al. (Garg et al., 2016) and upsample the prediction to the ground truth resolution.
For the online KITTI depth prediction, we use the official benchmark split (Uhrig et al., 2017), which contains around 72K training data, 6K selected validation data and 500 test data without the ground truth.NYU-Depth-v2 provides images and depth maps for different indoor scenes captured at a pixel resolution of 640×480 (Silberman et al., 2012).Following previous works, we train our network on a 50K RGB-Depth pairs subset.The predicted depth maps of DepthFormer have a resolution of 320 × 240 and an upper bound of 10 meters.We upsample them by 2× to match the ground truth resolution during both training and testing.We evaluate the results on the predefined center cropping by Eigen et al. (Eigen et al., 2014).
SUN RGB-D is an indoor dataset consisting of around 10K images with high scene diversity collected with four different sensors (Song et al., 2015;Xiao et al., 2013;Janoch et al., 2013).We apply this dataset for generalization evaluation.Specifically, we cross-evaluate our NYU pre-trained models on the official test set of 5050 images without further fine-tuning.The depth upper bound is set to 10 meters.Note that this dataset is only for evaluation.We do not train on this dataset.

Evaluation Metrics
In our experiments, we follow the standard evaluation protocol of the prior work (Eigen et al., 2014) to confirm the effectiveness of DepthFormer in experiments.For the NYU, KITTI Eigen split and SUN RGB-D dataset, we utilize the accuracy under the threshold (δ i < 1.25 i , i = 1, 2, 3), mean absolute relative error (AbsRel), mean squared relative error (SqRel), root mean squared error (RMSE), root mean squared log error (RMSElog), and mean log10 error (log10) to evaluate our methods.In terms of the online KITTI benchmark (Uhrig et al., 2017), we use the scale-invariant logarithmic error (SILog), percentage of AbsRel and SqRel (absErrorRel, sqErrorRel), and root mean squared error of the inverse depth (iRMSE).

Implementation Details
Since we find there is no commonly used codebase for the monocular depth estimation task, we develop a unified bench-mark based on the MMSegmentation (Contributors, 2020).We believe it can further boost the development of this field and achieve fair comparisons.We train the entire network with the batch size 2, learning rate 1e −4 for 38.4k iterations on a single node with 8 NVIDIA V100 32GB GPUs, which takes around 5 hours.The linear learning rate warm-up strategy is applied for the first 30% iterations following (Bhat et al., 2021).The cosine annealing learning rate strategy Table 5 Results of models trained on the NYU-Depth-v2 dataset and tested on the SUN RGB-D dataset (Song et al., 2015) without fine-tuning.The reported numbers are from (Bhat et al., 2021).
is adopted for the learning rate decay.Following (Ranftl et al., 2021;Liu et al., 2021), we sample N = 4 results from the transformer features as the output of the transformer branch.The number of reference points in deformable attention modules and C h is experientially set to 8 and the median value of the channel dimension of F, respectively.Following (Zhu et al., 2021), we adopt 8 deformable attention heads.The default patch size of ViT-Base and window size of Swin Transformer are 16 and 12, respectively.Following previous works, our encoders are pre-trained on ImageNet dataset (Krizhevsky et al., 2012) and then fine-tuned on depth datasets.As for the pilot study, the baseline model consists of an encoder and a decoder.We adopt the decoder in (Alhashim and Wonka, 2018a) as a default setting and mainly focus on the influence of encoder choices.In terms of the pure convolution encoder, we utilize the standard ResNet-50 (He et al., 2016).the pure Transformer encoder, we adopt the ViT-B (Dosovitskiy et al., 2021) following the design of the DPT (Ranftl et al., 2021) and Swin-T (Liu et al., 2021).Notably, our encoders are pre-trained on the ImageNet classification, which is the standard protocol of supervised monocular depth estimation.During training, we adopt the AdamW optimizer.The weight decay is set to 0.01.We experientially use the 1-cycle policy with the learning rate lr = 6e −5 for the Transformer-based model and lr = 1e −4 for the ResNet-based model.We also apply a linear warm-up scheduler for the first 500 iterations.The cosine annealing learning rate strategy is adopted for the learning rate decay.
When evaluation, we divide the depth range to 0-20m, 20-60m and 60-80m.The results of 0-20m and 60-80m can indicate the model performance predicting the depth of near and distant objects, respectively.We also present more detailed results in Fig. 13, where the model performance at each tick is shown in a curve.When visualizing the results, RGB DenseDepth (Alhashim and Wonka, 2018b) BTS (Lee et al., 2019) DPT (Ranftl et al., 2021) AdaBins (Bhat et al., 2021) Ours  we utilize the color map of jet and reversed magma for NYU and KITTI, respectively.

Comparison to State-of-the-Arts
We compare the proposed methods with the leading monocular depth estimation models.Primarily, we choose the Adabins (Bhat et al., 2021) as our main competitor, which is a solid counterpart and achieved state-of-the-art on all of the datasets we consider.We reproduce the codes of Adabins and load the pre-trained models provided by the authors to get the resulting depth images.Other results are from their official codes.
NYU-Depth-v2: Tab. 2 lists the performance comparison results on the NYU-Depth-v2 dataset.While the performance of the state-of-the-art models tends to approach saturation, DepthFormer outperforms all the competitors with prominent margins in all metrics.It indicates the effectiveness of our proposed methods.Qualitative comparisons can be seen in Fig. 7. DepthFormer achieves more accurate and sharper depth estimation results.We combine camera parameters and predicted depth maps to inv-project the 2D images into the 3D world.As shown in Fig. 8, our reconstructed scenes are satisfying with sharp boundaries of objects and reasonable depth estimations.

KITTI:
We evaluate on the Eigen split (Eigen et al., 2014) and report the results on the Tab. 3. DepthFormer sig- nificantly outperforms all the leading methods.Qualitative comparisons can be seen in Fig. 9.We then train our model on the training set of the standard KITTI benchmark split and submit the prediction results of the testing set to the online website.We report the results in Tab. 4. While a saturation phenomenon persists in sqErrorRel, DepthFormer still achieves 16% improvement on this metric and achieves the most competitive result on the highly competitive benchmark as the submission time of Nov. 16th, 2021.We report some qualitative comparison results in Fig. 10.
SUN RGB-D: Following Adabins (Bhat et al., 2021), we conduct a cross-dataset evaluation by training our models on the NYU-Depth-v2 dataset and evaluating them on the test set of the SUN RGB-D dataset without any fine-tuning.As shown in Tab. 5, significant improvements in all the metrics indicate an outstanding generalization performance of DepthFormer.Qualitative results are shown in Fig. 11.It is engaging that DepthFormer presents a strong generalization when cross-dataset evaluation.Especially, our method can predict accurate depth estimation for extremely dark areas which are extremely hard to handle without training on the corresponding dataset.

Ablation Studies
For our ablation study, we conduct evaluations with each component of DepthFormer to prove the effectiveness of our method on the NYU and KITTI dataset.
Effectiveness of key components: We first validate the effectiveness of the key components of DepthFormer.From the baseline network (i.e., ResNet-50, Swin-T), we reinforce the network with our proposed methods and evaluate the improvement of the model performance.The results are reported in the Tab. 6.As the additional convolution branch and the HAHI are adopted, the overall performance is significantly improved, which demonstrates the effectiveness of our methods.Moreover, following previous methods (Yang et al., 2021;Ranftl et al., 2021), we utilize larger-scale dataset (i.e., ImageNet-22K) to pre-train our encoder.The results (+LP) indicate that the Transformer encoder can better benefit from the larger model capacity and the larger-scale pre-training dataset compared with the CNN encoder.
Fine-grained evaluation on convolution branch: The standard CNN encoder can be divided into several sequential blocks.We further scrutinize the influence of different level convolution features on the model performance.Following the default setting, we adopt ResNet-50 as the additional convolution branch.Results are shown in Fig. 12.Interestingly, the model achieves the best performance with only one convolutional block and then downgrades if more blocks are added.A possible explanation for this might be that the consecutive convolutions wash out low-level features, and the gradually reducing spatial resolution discards the finegrained information (Yang et al., 2021).Adopting the first block achieves a win-win scenario: it optimizes accuracy by preserving crucial local information while reducing complexity.This can reduce the training time by 2.5× or more and likewise decrease memory consumption, enabling us to easily scale our Transformer branch to large models.
Fine-grained evaluation on HAHI: Since the HAHI consists of a deformable self-attention module (DSA) for hierarchical aggregation and a deformable cross-attention module (DCA) for heterogeneous interaction, we conduct more detailed ablation studies on both of these two modules.
For fair comparison, we choose the Swin-T with CB as the default backbone.The results are reported in Tab. 7. We propose to apply the attention mechanism on all the hierarchical features (multi-level DSA) for sufficient aggregation.Compared with the one where only each single-layer feature is consid-ered in the attention module, denotes as single-level DSA, the multi-level aggregation strategy get a 4.9% enhancement on RMS.It demonstrates that the multi-level aggregation strategy is much more effective.When DSA is added without the multi-level DSA, the model performance is seriously impaired.However, with the multi-level DSA, DCA achieves a 2.2% improvement on RMS, verifying the importance of both the multi-level DSA and the DCA for heterogeneous Fig. 12 Effect of the convolution branch block on NYU depth estimation performance.We can observe that behaviour decreases after the first block of R-50 (He et al., 2016).interaction.We infer the reason that there are large discrepancies between the heterogeneous features.Multi-level DSA achieves the alignment of the features, which propels the affinity modeling.All the results demonstrate the effectiveness of our proposed HAHI module.
Details about pilot study results: We have discussed that the CNN branch can provide local information lost in the Transformer branch and the HAHI further promotes the depth estimation via feature enhancement and affinity modeling.They improve the model performance, especially on near object depth estimation.Tab. 8 demonstrates the effectiveness of our methods.Moreover, we draw more fine-grained results in Fig. 13.Interestingly, Swin-Transformer based model achieves better performance compared with ResNet50-based ones on near object depth estimation and satisfactory results compared with ViT-based ones on distant object depth estimation.We infer that the hierarchical design of the Swin Transformer benefits the extraction of the local information, and the special attention mechanism successfully models the long-range correlation.To compare the model performance in a more direct manner, we also present qualitative comparison results in Fig. 6.One can observe sharper and more accurate results can be achieved with our proposed CNN branch and HAHI module.
Inference time evaluation: Except for the accuracy (δ 1 ) of the depth prediction, the inference velocity is of importance as well.We thus evaluate the inference time of Depth-Former w/o HAHI on the KITTI validation set.The reso-Abs Rel (Lower is better) Depth Interval (m) Fig. 13 Fine-grained quantitative results of our pilot study on KITTI datset.We divide the depth range (0m -80m) into 80 intervals.Point (i, j) in the plot represents the abs rel of the model is j on depth interval (i, i + 1]m.Our method achieves a trade off between long and short range estimation.

Conclusion
We have presented DepthFormer, a novel framework for accurate monocular depth estimation.Our method fully exploits the long-range correlations and the local information by an encoder consisting of a Transformer branch and a CNN branch.Since independent branches with late fusion lead to insufficient feature aggregation for the decoder, we propose the hierarchical aggregation and heterogeneous interaction module to enhance the multi-level features and further model the feature affinity.DepthFormer achieves significant improvements compared with state-of-the-arts in the most popular and challenging datasets.We hope our study can encourage more works applying the Transformer architecture in monocular depth estimation and enlighten the framework design of other tasks.
Potential impact: Beyond the direct application of our work for autonomous driving or spatial reconstruction, there are several venues that warrant future investigation.For example, the common dense global attention in Transformer might be sumptuous.In terms of depth estimation, several key points that indicate the scene structure could be enough to provide crucial long-range information.Designing a more dedicated attention mechanism would improve the effectiveness of the Transformer branch.Furthermore, the HAHI is input-agnostic, and including other modalities such as sparse LiDAR would enhance performance and generalization.Finally, due to the lack of theoretical guarantees, future work to improve the applicability of DepthFormer might consider challenges of explainability and transparency.

Fig. 2
Fig. 2 An overview of DepthFormer.It comprises three major components: an encoder consisting of a Transformer branch and a convolution branch, a hierarchical aggregation and heterogeneous interaction (HAHI) module, and a standard decoder.The HAHI enhances the Transformer features F and models the affinity between the Transformer and the convolution features G.

Fig. 3
Fig. 3 Failure cases of previous methods on NYU dataset caused by a limited receptive field of the convolution operator.

Fig. 4
Fig. 4 Demonstration of artifacts and the lost of local information.Our method provides consistent and sharp depth estimation.

Fig. 5
Fig. 5 Illustration of our proposed HAHI.The deformable self-attention module enhances the input Transformer features F. The deformable cross-attention module models the affinity between the Transformer features F and the convolution features G in a set-to-set translation manner.The output of the HAHI, F o and G o , are sent to the decoder for the final aggregation.Due to the adoption of Swin Transformer layers to extract the hierarchical features, F exhibits different sizes and channels.

Fig. 6
Fig. 6 Qualitative comparisons in our pilot study.

Fig. 10
Fig. 10 Qualitative comparison with the state-of-the-art on the KITTI benchmark, better viewed by zooming on screen.Deeper red pixels in the error maps indicate higher errors.Deeper blue means lower errors.The figures are from the official KITTI benchmark website.

Fig. 11
Fig. 11 Qualitative comparison on the SUN RGB-D dataset.

Fig. 14
Fig. 14 Accuracy δ 1 vs. Inference Time on KITTI validation set.The speed is measured using a single RTX 3060 GPU.

Table 1
Pilot study results on the KITTI dataset.Overall means the measurements are made from 0m to 80m.

Table 2
Comparison of performances on the NYU-Depth-v2 dataset.The reported numbers are from the corresponding original papers.

Table 3
(Tan and Le, 2019)ormances on the KITTI validation dataset.The reported numbers are from the corresponding original papers.Measurements are made for the depth range from 0m to 80m.Best / Second best results are marked bold / underlined.R-50 and E-B5 are short for ResNet-50 and EfficientNet-B5(Tan and Le, 2019), respectively.C i represents the i th block of the ResNet-50 network.

Table 4
Comparison of performances on the KITTI depth estimation benchmark test set.Reported numbers are from the official benchmark website.

Table 6
Ablation study results on the NYU dataset.CB: Convolution Branch.LP: Larger-scale pre-training dataset (22K ImageNet) for boosting the model performance.For fair comparison, we utilize the 22K-ImageNet pre-trained ResNet-50-x3 provided by

Table 7
Ablation study of the HAHI module on NYU dataset.DSA, DCA: Deformable self-attention and deformable cross-attention.

Table 8
More detailed ablation quantitative results on KITTI dataset.