1 Introduction

In the field of environment perception in navigation, computer vision-related tasks such as point cloud recognition, 3D reconstruction, and point cloud semantic segmentation often rely on the depth information provided by depth estimation algorithms [1,2,3,4,5,6]. Many advanced monocular depth estimation methods have a strong desire for annotated datasets [7, 8]. Unfortunately, obtaining accurate depth labels for real scenes often requires a significant amount of time and resources. Therefore, unsupervised monocular depth estimation has become a popular research direction in the field of depth estimation [9, 10]. Several unsupervised methods have been proposed, such as self-supervision using video sequences [11] and domain adaptation to generalize network models with rich annotation information to unannotated real data. Among them, the commonly used steps for UDA depth estimation include style transfer, feature extraction, and depth estimation. Earlier UDA depth estimation used a one-way translation network to convert synthetic images to the real domain, and then performed task network learning in the domain [12, 13]. In recent years, the approach has been gradually improved and iterated with the use of a CycleGAN [14] network to achieve bidirectional symmetric style transfer between two domains [14,15,16,17,18].The two types of conversion results are input into the synthetic domain and the real domain, respectively, and each uses an independent task network for learning. According to statistics from the latest relevant review papers, 33 of the 49 papers in the field of unsupervised domain adaptation in recent years are GAN-Based [19, 20]. The effectiveness of the cycleGAN model is particularly prominent. With the advantages of dual-cycle generative adversarial learning, it achieves high semantic and phase consistency between the target domain and the source domain, leading to better domain transfer. This advantage makes it an absolute mainstream method for unsupervised domain adaptation [19,20,21,22]. While the above strategies have achieved good results in domain-adaptive depth estimation, most of the research focuses on optimizing the style transfer between two domain inputs [16]. They ignored the style correction of two loop branches at the end, resulting in missing depth information at the edges of object instances in the predicted depth map, as well as depth missing holes (predicting distances close to infinity) caused by strong light reflections from glass, car windows, mirrors, etc.. This issue also affected their prediction accuracy metrics. In our work, we use a GAN-based UDA model, but specifically optimized the network for end-depth estimation tasks. Some research has shown that the self-attention mechanism is conducive to discovering contextual connections in a single image, making it useful for pixel-wise image processing such as depth estimation [23,24,25,26]. Edge constraints can also preserve more shape semantic information on the basis of geometric constraints [24, 27,28,29], which can effectively improve the quality of predicted depth maps. Therefore, we make full use of these two features: First, in the depth estimation network task of a single domain, we insert a self-attention mechanism module to enhance the pixel-wise feature extraction capability of the network and strengthen the pixel correlation of a single image, filling depth missing holes [30]. After the depth map output results of the respective networks in the two domains, we add an edge prediction network to guide attention to high-frequency boundary information in the entire network training process, so that the depth estimation network outputs a more accurate depth map of the target contour. We also use deep learning algorithms to extract edge information from raw images in the target domain, which is used as an unsupervised prior for unlabeled datasets in the real domain. The loss function is formed with the output of the edge extraction network in the two domains, respectively. The edge consistency constraints of similar targets in different domains are established to narrow the gap between domains, thereby solving the problem of missing depth of large-area targets caused by domain adaptive deviation. Our contributions can be summarized as:

  1. 1.

    A new UDA depth estimation framework is proposed to establish edge consistency constraints between domains, which reduce domain gap and improve domain adaptation performance.

  2. 2.

    We effectively introduced the edge-guided self-attention mechanism in single-domain depth estimation tasks to complete depth map holes and missing areas, resulting in high-quality predicted depth maps.

  3. 3.

    Through a large number of experiments, We have successfully validated the efficacy of our approach on the KITTI dataset [31] and demonstrated its capability to generalize well on the Make3D dataset [32].

2 Related Work

2.1 UDA Depth Estimation

The depth estimation algorithm trains a neural network model based on the relationship between pixel values, enabling it to estimate depth information from a single image, which is a pioneering approach [5, 33, 34]. However, collecting the necessary datasets for training this algorithm is challenging. To address this issue, an unsupervised domain adaptive depth estimation algorithm has been introduced.

The approach involves a synthetic data corpus and their labeled counterparts as the source domain, with real but unlabeled data as the target domain. The objective is to train a depth estimation task network in the source domain in a supervised manner and then generalize it to the target domain, creating a network model that can predict depth in the target domain [1, 35,36,37].The UDA depth estimation algorithm has undergone multiple iterative improvements.

Earlier, Atapour et al. [38] developed a two-stage framework in which they first trained an adversarial network to translate natural images into synthetic images and then trained a depth estimation network in the synthetic image domain in a supervised manner. Then Kundu et al. [13] proposed a content congruent regularization method to tackle the model collapse issue caused by domain adaptation in high dimensional feature space.Zheng et al. [12] designed a wide-spectrum input translation network in T\(^2\)Net to further unify real images with synthetic images, leading to more robust translations. Zhao et al. [15] proposed a novel geometry-aware symmetric domain adaptation network, i.e., GASDA by exploiting the epipolar geometry of the stereo images, thereby suppressing undesired distortions during the image translation stage and obtaining better depth map quality.

GASDA [15] adopts the most advanced UDA algorithm framework: generative adversarial learning [14, 17, 37, 39, 40], there is still potential for improvement in certain aspects of the network structure, and we have developed our own algorithm framework based on it.

2.2 Attention Mechanism in Depth Estimation

The attention mechanism is a powerful tool in the field of deep learning image processing tasks, allowing the network model to focus on extracting effective information and improving the distinguishability of target features [6, 23,24,25, 41,42,43]. Woos’ proposed CBAM algorithm [44], aiming to guide the model to focus on spatial hierarchies and channel-wise spatial fusion hierarchies. Through the CBAM module, the model can dynamically learn channel and spatial attention weights in each convolutional block, adjusting the importance of different channels and spatial locations based on the content of the input feature map. In the context of pixel-wise monocular depth estimation, Xu et al. [43] proposed a multiscale spatial attention-guided block to enhance saliency of small objects and built a double-path prediction network to estimate depth map and semantic label simultaneously. CBAM tend to focus more on the importance of local regions, lacking global connections. However, self-attention mechanisms, with their unique per-pixel correlation mechanism, can simultaneously consider the global and local aspects of feature extraction, making them more suitable for pixel-wise image tasks with stronger requirements for contextual relationships. Chen et al. [41] propose a pixel-wise self-attention model for monocular depth estimation that can capture contextual information associated with each pixel. Huang et al. [24] effectively restore depth map integrity by using self-attention mechanism.

At present, attention mechanisms are usually widely used in supervised depth estimation networks [45].

Therefore, to address the issue of depth missing holes and large-area target depth loss in unsupervised domain adaptive depth estimation [12, 15, 30], we draw inspiration from the attention mechanism’s characteristics and add a self-attention module to the depth estimation network, resulting in excellent performance.

2.3 Edge Information for Pixel-wise Image Tasks

In deep learning-based image tasks, Xu et al. [46] proposed that the convolutional neural network tends to focus more on learning low-frequency content rather than high-frequency information in the image. The pixel-level image task has higher requirements on the learning ability of high-frequency regions such as edge contours [24, 28]. Liu et al. [27] developed a contour adaptation network to extract marginal information of brain tumors in magnetic resonance imaging (MRI) slices, aiding in the task of brain tumor cross-modality segmentation.

In the above pixel-wise image tasks, edge information is usually superimposed with the input to achieve the enhancement of high-frequency information in a single domain [27, 29, 36, 47]. While performing domain-adaptive depth estimation, we prefer to strengthen the connection between the two domains and use edge information to establish edge consistency constraints in the output section.

3 Methodology

3.1 Review Of GASDA and Modififications

Fig. 1
figure 1

An overview of the algorithm framework, the framework consists of four main parts:image style transfer,source domain task, target domain task, and unsupervised Cues. (i) The style transfer network contains two generators and two discriminators ( \(i.e. , \textrm{G}_{\textrm{T} \rightarrow \textrm{S}}, \textrm{G}_{\textrm{S} \rightarrow \textrm{T}}, \textrm{D}_{\textrm{T}}, \textrm{D}_{\textrm{S}}\) ),The way of transfer draws on CycleGAN [14]. (ii) The source domain and the target domain have a symmetric task network structure, including the improved attention mechanism depth estimation network (i.e.,  SA-F\(_{\textrm{S}}\) and SA-F\(_\textrm{T}\)) and auxiliary edge prediction network (i.e.,  Unet-s and Unet-t). (iii) Unsupervised cues including stereo Image pair with edge GT (edge ground truth). (iv) The green arrow is the source domain arrow flow path, the blue arrow is the target domain data flow path, GC is the geometric consistency loss, and L\(_{\textrm{1}}\) is the L1 loss

GASDA [15] is an advanced UDA monocular depth estimation algorithm based on adversarial training. In the style transfer stage, GASDA [15] uses the CycleGAN [14] network to obtain the style mapping from the target domain image to the source domain and from the source domain image to the target domain through adversarial training. In the depth estimation stage, the source domain and the target domain have independent encoder-decoders, and the input of the two encoders contains both the image datasets of each domain itself and the image map transferred from the neighborhood style.

We adopt the advanced UDA framework from the aforementioned GASDA algorithm [15] and incorporat some of its constraints. For the depth prediction output of the source domain image and its style transfer image, the ground-truth depth map of the source domain is used to construct the L1 loss function between them. For the depth prediction output of the target domain image and its style transfer image, since there is no direct ground truth, we introduce cross-domain consistency loss and binocular stereo image pairs to achieve constraints. Meanwhile, we design our own encoder-decoder. To enable the encoder to obtain more powerful feature extraction capabilities, we replace the original old encoder with EfficientNet-B5 [48] and implant a self-attention mechanism module. As shown in the Fig. 1, we add an edge detection network at the end of the depth estimation network and introduce additional edge information to construct inter-domain edge consistency constraints, which further enhanced the training effect and improved the network performance. Providing more details on each additional module will be done in the next few sections.

Fig. 2
figure 2

Self-attention architecture (Sect. 3.2), the feature layer size is represented as a tensor in the colored square, e.g., H\(\times \)W\(\times \)176 means the number of channels is 176, the resolution of a single channel is H\(\times \)W, 1\(\times \)1 conv represents a standard convolution process with the kernel size of 1\(\times \)1. (Because the batch parameter does not participate in the calculation, the batch parameter is omitted here)

3.2 Edge-guided Self-attention Mechanism

As a form of attention mechanism, the self-attention mechanism is commonly used to optimize pixel-wise tasks such as semantic segmentation and depth estimation. It emphasizes the correlation between different pixels within a single channel, thereby facilitating the exploration of contextual connections between pixels and resulting in an output with superior detail quality. This mechanism has proven to be particularly effective for the depth completion task [23,24,25, 41, 45, 49]. Meanwhile, the attention to edge information in the network prediction process can better generate a clear boundary and complete depth map [27, 28]. In UDA depth estimation, we attempt to incorporate self-attention guided by edge information into the network to address the problem of depth map missing holes. Therefore, as shown in the Fig 2, we draw inspiration from the non-local network structure [50] to develop a self-attention block and insert it between the encoder and decoder of the domain depth estimation task network (SA-F\(_{\textrm{S}}\) and SA-F\(_\textrm{T}\)). First, the output of the encoder is passed through three 1\(\times \)1 convolutions to obtain three feature vectors, namely query (q), key (k), and value (v). The query vector and key vector are then input into the pairing function, which can be expressed as:

$$\begin{aligned} f\left( x_i, x_j\right) =e^{\left( W_q x_i\right) ^T\left( W_k x_j\right) } \end{aligned}$$
(1)

where x is the output, i is a certain position in the input feature map matrix, j is the index of all positions,\(W_q\) and \(W_k\) are the weight matrices that need to be learned during the 1\(\times \)1 convolution process. The purpose of doing the Batch matrix multiplication operation is to calculate the correlation between the i-th pixel position and all other positions. We then use the softmax function to limit the output value of \(f\left( x_i, x_j\right) \) between 0 and 1, making the training gradient more stable.After performing the Batch matrix multiplication operation again on the correlation map and the value feature vector,a set of pixel matrices integrated into the self-attention mechanism are obtained. The overall function is expressed as:

$$\begin{aligned} y={softmax}\left( x^T W_q^T W_k x\right) g(x) \end{aligned}$$
(2)

where \(g(x)=W_v x\) is the value feature vector. After reconstructing the channel scale of the result through 1\(\times \)1 convolution, perform Element-wise add operation with shortcut, and finally get the output of the module.

Then we add an auxiliary edge prediction network (Unet-t and Unet-s) at the end of the depth prediction network in two domains, as shown the Fig. 3. This encourages the front network and self-attention module to prioritize the boundaries in the image, resulting in high-quality and clear boundary contours in the output depth map.

3.3 Consistency Constraint Between Two Domains

In previous domain-adaptive depth estimation approaches, since there are no supervised labels available in the target domain, the image in the target domain is input into the source domain task network, and a depth consistency loss is established with the target domain output as a form of supervision. Nevertheless, this approach lacks structural constraints, making it susceptible to producing distorted boundaries and even missing target depth in large areas, ultimately affecting the depth map’s quality. GASDA [15] addresses this issue by leveraging binocular stereo imaging pairs to create unsupervised prior constraints that suppress undesired distortions. Building upon this, in the single domain, we perform depth map edge prediction using the Encoder-Decoder structure i.e., Unet-t and Unet-s, followed by edge’s ground truth to establish constraints, further strengthening structural constraints and enhancing depth estimation’s quality. While in real dataset, we only have input images and lack the depth map used to generate the edge’s ground truth. Using classical algorithms such as the Sobel algorithm [51] to directly extract the input image generates a large number of internal textures that can be mistaken for real edges, impacting edge consistency. Nevertheless, deep learning edge detection networks can effectively mask these shallow texture details and represent deep feature information well [52]. Therefore, we use the RCF-Net [53] edge detection network to extract only the entity outlines in images, ignoring internal texture features. The extracted results are utilized as ground-truth supervision for training an edge prediction network. At the same time, a boundary consistency loss function is constructed using the output results of independent edge predictions in both domains, which helps to establish network attention association and narrow the gap between the two domains. This makes transferring between domains much easier.

Fig. 3
figure 3

Edge detection network structure diagram (Unet-t and Unet-s), the blue blocks is the feature layer of the downsampling process, the red blocks is the feature layer of the upsampling process, and the green is the part of the downsampling feature layer that participates in the skip link. The horizontal number represents the number of feature layers, and the vertical represents the number of feature channels.Regarding the arrows, Max pool, Conv2d and ConvTranspose2d represent the standard maximum pooling, convolution, and deconvolution operations in the pytorch library, respectively. 2\(\times \)2 and 3\(\times \)3 are the convolution kernel sizes, and ReLU is the activation function

3.4 Loss Function

In terms of loss function, we continue to use some of the same loss functions as GASDA [15], including the adversarial loss in style transfer:

$$\begin{aligned} \begin{aligned}&L_{g a n}\left( G_{s 2 t}, D_t, X_t, X_s\right) ={\mathbb {E}}_{x_t \sim X_t}\left[ D_t\left( x_t\right) -1\right] +{\mathbb {E}}_{x_s \sim X_s}\left[ D_t\left( G_{s 2 t}\left( x_s\right) \right) \right] \\&L_{g a n}\left( G_{t 2 s}, D_s, X_t, X_s\right) ={\mathbb {E}}_{x_s \sim X_s}\left[ D_s\left( x_s\right) -1\right] +{\mathbb {E}}_{x_t \sim X_t}\left[ D_s\left( G_{t 2 s}\left( x_t\right) \right) \right] \end{aligned} \end{aligned}$$
(3)

Among them, the generator \(G_{s 2 t}\left( G_{t 2 s}\right) \) and the corresponding discriminator \(D_t\left( D_s\right) \) constitute a bidirectional style transfer network. \(X_t\) and \(X_s\) represent the target domain and the source domain, respectively, while \(x_t\left( x_s\right) \) is the corresponding input image. The overall loss function used is the GAN loss [14].

We establish a cycle consistency loss \(L_{c y c l e}\left( G_{t 2\,s}, G_{s 2 t}\right) \) to prevent mode crashes, and the geometric consistency loss \(L_{i nf}\left( G_{t 2\,s}, G_{s 2 t}, X_s, X_t\right) \) is adopted, which promotes the generator to prefer to preserve the geometric information.

The loss of the entire style transfer part can be summarized as:

$$\begin{aligned} \begin{aligned} L_{style }\left( G_{t 2 s}, G_{s 2 t}, D_s, D_t\right) =&L_{g a n}\left( G_{s 2 t}, D_t, X_t, X_s\right) +L_{g a n}\left( G_{t 2 s}, D_s, X_t, X_s\right) \\&+ \lambda _1 L_{c y cle}\left( G_{t 2 s}, G_{s 2 t}\right) +\lambda _2 L_{i nf}\left( G_{t 2 s}, G_{s 2 t}, X_t, X_s\right) \\ \end{aligned} \end{aligned}$$
(4)

where \(\lambda _1\) and \(\lambda _2\) are the trade-off parameters.

For the depth estimation task, we feed the source domain image into the source domain depth estimation network to obtain the depth map. Simultaneously, after the generator transfers the same source domain image to the target domain, we input it into a target domain depth estimation network to obtain a depth map from an alternative pathway. We then use these estimated depth maps with the ground truth of the source domain dataset to build the L1 loss \(L_{stask}\left( L_{ttask}\right) \). The source domain supervised depth estimation loss \(L_{task}\) can be summarized as:

$$\begin{aligned} L_{task}\left( F_t, F_s, G_{s 2 t}\right) =L_{ttask}\left( F_t, G_{s 2 t}\right) +L_{s task}\left( F_s\right) \end{aligned}$$
(5)

Keeping the same as GASDA [15], we also exploit the epipolar geometry of real stereo images and unsupervised cues to implement stereo geometry constraints. And we employ L1 loss and single-scale SSIM [54] to construct a geometric consistency loss \(L_{sgc}\left( L_{tgc}\right) \) in the source (target) domain.

$$\begin{aligned} \begin{aligned}&L_{t g c}\left( F_t\right) =\eta \frac{1-{SSIM}\left( x_t, x_{t 2 t}^{\prime }\right) }{2} +\mu \left\| x_t-x_{t 2 t}^{\prime }\right\| _1 \\&L_{s g c}\left( F_s, G_{t 2 s}\right) =\eta \frac{1-{SSIM}\left( x_t, x_{t 2 s}^{\prime }\right) }{2} +\mu \left\| x_t-x_{t 2 s}^{\prime }\right\| _1 \\&L_{g c}\left( F_t, F_s, G_{t 2 s}\right) =L_{t g c}\left( F_t\right) +L_{s g c}\left( F_s, G_{t 2 s}\right) \end{aligned} \end{aligned}$$
(6)

where \(L_{g c}\) is the full geometry consistency loss. In the stereo image pairs of target domain, \(x_{t 2 t}^{\prime }\left( x_{t 2\,s}^{\prime }\right) \) is the warp of the right image based on the estimated depth map of the left image, using bilinear sampling. \(\mu \) is set to be 0.85, and \(\eta \) is 0.15.

In the depth map, in order to suppress the local jump of depths in the target edge region, we establish the edge depth smoothing loss [1, 15]:

$$\begin{aligned} \begin{aligned} L_{\text {smooth}_{\text {t }}}\left( F_t\right)&={\mathbb {E}}_{x_t \sim X_t}\left[ \vert \partial _x F_t \left( x_t\right) \vert e^{-\vert \partial _x x_t\vert }\right] \\&+{\mathbb {E}}_{x_t \sim X_t}\left[ \vert \partial _y F_t\left( x_t\right) \vert e^{-\vert \partial _y x_t\vert }\right] \\ L_{\text {smooth}_{\text {t2s }}}\left( F_s,G_{t2s}\right)&={\mathbb {E}}_{x_t \sim X_s}\left[ \vert \partial _x F_s \left( G_{t 2 s}\left( x_t\right) \right) \vert e^{-\vert \partial _x x_t\vert }\right] \\&+{\mathbb {E}}_{x_t \sim X_s}\left[ \vert \partial _y F_s\left( G_{t 2 s}\left( x_t\right) \right) \vert e^{-\vert \partial _y x_t\vert }\right] \\ L_{smooth}\left( F_t, F_s, G_{t 2 s}\right)&=L_{\text {smooth}_{\text {t }}}\left( F_t\right) +L_{\text {smooth}_{\text {t2s}}}\left( F_s,G_{t2s}\right) \end{aligned} \end{aligned}$$
(7)

The depth consistency loss established by estimated depth map of the target domain is as follows:

$$\begin{aligned} L_{d c}\left( F_t, F_s, G_{t 2 s}\right) =\left\| y_{t 2 t}-y_{t 2 s}\right\| _1 \end{aligned}$$
(8)

The depth maps of the real scene image mapped in both domains constructs this constraint through the L1 loss function

Additionally, we have incorporated a boundary consistency loss function, where the edges represent obvious high-frequency information, and thus,we only require pixel-wise alignment constraints using L1 loss Thus, the supervision of the single-domain edge prediction network and the edge consistency constraints between domains are realized:

$$\begin{aligned} \begin{aligned}&L_{ {tedge }}\left( F_t, U_t\right) =\left\| z-z_{t 2 t}\right\| _1 \\&L_{ {sedge }}\left( F_s, G_{t 2 s}, U_s\right) =\left\| z-z_{t 2 s}\right\| _1 \\&L_{ {edge }}\left( F_t, F_s, G_{t 2 s}, U_t, U_s\right) =\left\| z-z_{t 2 t}\right\| _1+\left\| z-z_{t 2 s}\right\| _1 \end{aligned} \end{aligned}$$
(9)

where \(L_{edge }\) is the overall edge consistency loss, \(L_{tedge }\) and \(L_{sedge }\) are the edge loss in the respective domain; \(U_t\) and \(U_s\) are the edge detection network in the target domain and the source domain, respectively. The input image of the target domain generates the edge truth label z through RCF-NET [53], and A loss constraint is established with the output of the edge detection network \(z_{t 2 t}\left( z_{t 2\,s}\right) \) in two domains.

Finally, integrating all the above loss functions, we can get:

$$\begin{aligned} L\left( F_t, F_s, G_{t 2 s}, G_{s 2 t}, D_t, D_s, U_t, U_s\right)&=L_{{style }}\left( G_{t 2 s}, G_{s 2 t}, D_t, D_s\right) +\gamma _1 L_{task}\left( F_t, F_s, G_{s 2 t}\right) \nonumber \\&+\gamma _2 L_{g c}\left( F_t, F_s, G_{t 2 s}\right) +\gamma _3 L_{d c}\left( F_t, F_s, G_{t 2 s}\right) \nonumber \\&+\gamma _4 L_{smooth}\left( F_t, F_s, G_{t 2 s}\right) \nonumber \\&+\gamma _5 L_{{edge }}\left( F_t, F_s, G_{t 2 s}, U_t, U_s\right) \end{aligned}$$
(10)

where \(\gamma _n(n \in \{1,2,3,4,5\})\) are trade-off factors.

4 Experiments

4.1 Network Architecture

We construct a style transfer network based on CycleGAN [14], utilize EfficientNet-B5 [48] pre-trained on ImageNet [55] as the encoder in the depth estimation network (i.e.,  SA-F\(_{\textrm{S}}\) and SA-F\(_\textrm{T}\)), and follow the decoder in T\(^2\)net [12]. A self-attention mechanism module is inserted after the output of the last layer of EfficientNet-B5 [48]. At the end of the depth estimation network, a basic auxiliary U-net [56] (i.e.,  Unet-s and Unet-t) is designed to perform the edge prediction task. To obtain the edge ground truth of the target domain image, we use RCF-NET [53], which directly loads the open-source pre-training weights based on BSDS500 [47] and is not involved in network training.

4.2 Datasets

Our source domain dataset is the standard vKITTI dataset [7], which consists of 21,260 image-depth pairs with depth labels extracted from 50 synthetic videos of size 375\(\times \)1242. The target domain dataset is the KITTI dataset [31], which contains 42,382 stereo image pairs of size 375\(\times \)1242 with corresponding depth label information collected from real-world scenes. In our experiment, these labels are only used to evaluate the accuracy of the algorithm on the test dataset. We select 32 scenes from the KITTI dataset [31], using 22,600 images for training and 888 images for validation. We use 697 pictures selected from other 29 scenes to test the model performance. The 200 training images from the KITTI stereo 2015 [57] dataset are also used for further result testing, and 134 test images from the Make3D [32] dataset are used to evaluate the model’s generalization performance.

4.3 Training Details

Using PyTorch, our network is trained on a single NVIDIA GeForce RTX 3090 with 22GB memory over two phases. The first phase involves warm-up training to give the single-domain network some basic depth prediction and edge detection capabilities. We introduce the pre-trained weights of CycleGAN [14] and fix the weight parameters of this part. Then, we train SA-F\(_\textrm{T}\), Unet-t on \(\left\{ X_t, G_{s 2 t}\left( x_s\right) \right\} \), and SA-F\(_{\textrm{S}}\), \(\left\{ X_s, G_{t 2 s}\left( x_t\right) \right\} \) for 20 epochs, setting the momentum of \(\beta _1\)=0.9, \(\beta _2\)=0.999, and the initial learning rate of \(\alpha \) = 0.0001 using the ADAM solver [58]. Hyperparameters \(\gamma _1\) = 1.0, \(\gamma _2\) = 1.0, \(\gamma _4\) = 0.01 and \(\gamma _5\) = 1.0 are set in this step.

In the second step, the same training mode as GASDA [15] is used to achieve global network training. The network model ( SA-F\(_{\textrm{T}}\), Unet-t, SA-F\(_{\textrm{T}}\), Unet-s) trained in the warm-up stage is frozen, and the \(G_{s 2 t}\) and \(G_{t 2 s}\) in the style transfer network are trained in m batches. Then, we freeze them and train the remaining network n batches. We set m = 3,n = 7, and repeat the whole process for about 55 batches until the network converges. In this step, we set the hyperparameters as \(\beta _1\)=0.9, \(\beta _2\)=0.999, and set \(\alpha \) = 0.000002 in the first part of training and \(\alpha \) = 0.00001 in the latter part of training. The trade-off factors are set to \(\lambda _1\) = 10, \(\lambda _2\) = 30, \(\gamma _1\) = 50, \(\gamma _2\) = 50, \(\gamma _3\) = 50, \(\gamma _4\) = 0.5, \(\gamma _5\) = 100.The original KITTI image data is downscaled to a size of 640 \(\times \) 192 and the same data augmentation strategy as GASDA [15] is employed throughout the training phase.

4.4 Ablation Study

In order to effectively demonstrate the impact of various improvements in our model on enhancing the UDA depth estimation algorithm’s performance, we conduct several ablation experiments on the KITTI dataset [31] using feature segmentation. Building upon the GASDA [15] baseline model, we design the following models: 1) A baseline model that solely incorporates the self-attention mechanism, 2) A baseline model that only integrates edge constraints, 3) A complete model with edge constraints supervised by L2 loss and 4) A complete model with edge constraints supervised by L1 loss. To ensure consistency in other factors such as training methods and hyperparameters, we separately train these four models and obtain the subsequent results.

Table 1 Ablation study on different variants of our models on KITTI

Table 1 highlights that our contributed baseline model performs the worst. Introducing a single self-attention mechanism or edge consistency constraint alone does not significantly improve performance. However, when we simultaneously introduce these two improvement schemes and use the complete improved model, it greatly enhances depth estimation performance. These observations suggest that the attention mechanism module enables the model to achieve pixel-level contextual relevance and the individual edge constraint technique helps optimize the depth map contours but provides limited improvement. By combining the advantages of attention mechanism and edge constraints, both can be mutually reinforced, allowing the model to focus more on high-frequency edge information ignored by traditional convolution structures and strengthen cross-domain association. Now for the loss function in the key edge constraint improvement, we have additionally conducted an L2 loss function ablation experiment. The results can be seen that the final model accuracy is lower than the L1 loss. Studies [6, 15] have shown that L1 loss is good at handling abnormal outliers because its penalty for outliers is fixed, while L2 loss performs better when handling small errors. The purpose of our additional edge constraint supervision is to solve the problems of poor edge quality and loss of local depth information in the predicted depth map. Such problems are caused by abnormal large errors in some areas, so L1 loss will have better results at this time. Ultimately, the entire network far exceeds the baseline model in multiple metrics, achieving significant improvements and outperforming current state-of-the-art techniques in some aspects.

4.5 Comparison with State-of-the-Art Methods

In Table 2 we use the scores obtained by Eigen split [2] as an evaluation criterion to achieve comparison with other existing algorithms, where the test split consists of 697 images in a total of 29 traffic scenes.And we take the range where ground true depth is less than 80 m as the evaluation area.

My algorithm is based on the GASDA algorithm, and it has a convincing improvement compared with the classical algorithm and the GASDA [15] algorithm. Compared with the state-of-the-art algorithms, the DESC algorithm proposed by Mikolajczyk et al. [36] requires additional semantic segmentation of the source domain and ground truth labels of edge images in order to effectively improve the performance of the algorithm.This process is complex and resource-intensive In contrast, we have achieved some surpassing indicators by using only simpler and less prior information.The following five scale-invariant metrics are used to measure the performance of the algorithm:

Abs Rel: \(\frac{1}{T} \sum _{k \in T} \frac{\left\| d_k-d_k^{g t}\right\| }{d_k^{g t}}\)       Sq Rel: \(\frac{1}{T} \sum _{k \in T} \frac{\left\| d_k-d_k^{g t}\right\| ^2}{d_k^{g t}}\)

RMSE: \(\sqrt{\frac{1}{T} \sum _{k \in I}\left\| d_k-d_k^{g t}\right\| ^2}\) RMSE Log: \(\sqrt{\frac{1}{T} \sum _{k \in I}\left\| \log d_k-\log d_k^{g t}\right\| ^2}\)

Threshold: \(\delta =max \left( \frac{d_k}{d_k^{g t}}, \frac{d_k^{g t}}{d_k}\right) \)

Here, T represents the total number of pixels across all test images. \(d_k\),\(d_k^{g t}\) represent the predicted depth and ground-truth depth, respectively, corresponding to the kth pixel.

Table 2 Quantitative results of our approach and comparison with other existing method on KITTI
Table 3 The test results on the KITTI 2015 stereo 200 training set [57]

The KITTI 2015 stereo 200 training set [57] is more accurate than the laser depth annotations in KITTI, thanks to its denser ground truth labels. As seen from the first two experiments, our algorithm demonstrates excellent performance in terms of depth structure integrity, and dense annotations on instances such as cars and buildings can better highlight the strengths of our algorithm. Therefore, in the evaluation results presented in Table 3, the accuracy gap between our algorithm and GASDA [15] is greater than that in Table 1, while compared with the semi-supervised DESC algorithm that uses complex semantic segmentation, edge contours, and height pseudo-labels, our algorithm also has a part of performance advantages.

Table 4 Generalization performance test results on the Make3D [32] dataset

In Table 4, We conduct a model generalization quality evaluation on the Make3D [32] dataset, which includes 134 test images. Despite the large domain gap between datasets, good algorithmic performance is still achieved. Although we do not train on this dataset, our performance is superior to that of some methods trained on this dataset, indicating the strong generalization capability of our model. However, as our philosophy emphasizes bridging the gap between the target and source domains, the lack of training in certain areas can result in weaker performance of some metrics compared to classical algorithms.

As shown in the Fig. 4, we compared our predicted depth maps with several classical algorithms. To address the issue of unmeasurable depth in images with sky scenes, we occluded the corresponding area at the top using the same processing method as T\(^2\)net [12] and GASDA [15]. The yellow box shows an enlarged view of the details in the red part. Note that the details of some image samples belonging to the T\(^2\)net [12] algorithm were not included in the comparison due to partial occlusion of the sky. Our results demonstrate that our algorithm outperforms the first two algorithms in terms of producing clearer and more complete representations of the buildings in the red boxes in the first and second rows of depth maps. Specifically, we found that the parts of the large vehicles indicated by the red boxes had obvious large-area depth loss in the previous algorithms, while in our algorithm these regions were patched. Moreover, in the last five rows, the leaves, glass, and brightly illuminated areas in the red box showed a significant number of abnormal deep missing holes in the results of past algorithms. By contrast, our algorithm solved this problem well, resulting in better quality depth maps.

Fig. 4
figure 4

Depth qualitative results of different road scene samples in KITTI dataset [31], Eigen split [2].From left to right: a input image, b results of T\(^2\)net algorithm [12], c results of GASDA algorithm [15], d results of our algorithm

5 Conclusion

In this paper, we present a method to enhance the performance of the UDA depth estimation algorithm by utilizing attention mechanism and edge consistency constraints. Specifically, we introduce edge-guided self-attention mechanisms in the depth estimation networks of both the source and target domains. Moreover, we establish consistency constraints between the ground truth edge contours of the target domain samples and the edge prediction outcomes of the two domains. By employing these techniques, we minimize the gap between the source and target domains, forcing the model to pay more attention to high-frequency edge information while suppressing geometrical distortions in the depth prediction process. It solves the problem of incomplete depth perception of objects such as vehicles and buildings in depth maps, as well as the frequent occurrence of missing depth holes, improving the accuracy and completeness of depth maps. Our experimental results demonstrate that our proposed approach outperforms the existing state-of-the-art methods on the KITTI and KITTI stereo 2015 dataset. Additionally, ablation studies validate the effectiveness of each component of our approach.Our model exhibits exceptional generalization performance on the Make3D dataset, further validating its efficacy. In future work, we would like to further extend the idea of combining attention mechanisms with edge information to Unsupervised Multi-Task Domain Adaptation