DSC-MVSNet: attention aware cost volume regularization based on depthwise separable convolution for multi-view stereo

Deep learning has recently been proven to deliver excellent performance in multi-view stereo (MVS). However, it is difficult for deep learning-based MVS approaches to balance their efficiency and effectiveness. Towards this end, we propose the DSC-MVSNet, a novel coarse-to-fine and end-to-end framework for more efficient and more accurate depth estimation in MVS. In particular, we propose an attention aware 3D UNet-shape network, which first uses the depthwise separable convolutions for cost volume regularization. This mechanism enables effective aggregation of information and significantly reduces the model parameters and computation by transforming the ordinary convolution on cost volume as depthwise convolution and pointwise convolution. Besides, a 3D-Attention module is proposed to alleviate the feature mismatching problem in cost volume regularization and aggregate the important information of cost volume in three dimensions (i.e. channel, space, and depth). Moreover, we propose an efficient Feature Transfer Module to upsample the low-resolution (LR) depth map to a high-resolution (HR) depth map to achieve higher accuracy. With extensive experiments on two benchmark datasets, i.e. DTU and Tanks & Temples, we demonstrate that the parameters of our model are significantly reduced to 25%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$25\%$$\end{document} of the state-of-the-art model MVSNet. Besides, our method outperforms or maintains on par accuracy with the state-of-the-art models. Our source code is available at https://github.com/zs670980918/DSC-MVSNet.


Introduction
Multi-view stereo (MVS) has been extensively studied and widely applied in augmented reality and 3D reconstruction [1][2][3][4][5][6].The goal of MVS is to reconstruct 3D scenes using a series of camera-calibrated 2D images by establishing dense correspondences, which can be formulated as an optimization problem.Thus, the optimization methods such as Markov discrete optimization [7] and spatial patch diffusion [8] are applied to solve this problem.However, the above methods may result in incomplete surfaces in scenes with weak textures or non-Lambertian surfaces [1,9].
With the development of deep learning in recent years, Yao et al. [10] show promising results by achieving MVS with a cost volume regularization process and using deep learning to solve the optimization problem.The cost volume is composed of the confidence between matched features at different depths, and the regularization process optimizes the cost volume to obtain the depth probability distribution which is then used to obtain the depth map.Attention that the regularization here is not a widely used strategy to avoid overfitting in machine learning, but a terminology denoting the optimization process of cost volume in MVS domain.Because the accuracy of regularized cost volume directly determines the quality of the final depth map obtained from the regression, the improvement of cost volume regularization network is the major part of the later research studies, such as the P-MVSNet [11] and Cascade-MVSNet [12].These methods can achieve better reconstruction results for fully exploiting the information of the images in multiple dimensions.But they still suffer from the problem of low efficiency.Therefore, some efficient frameworks are then proposed to improve the efficiency, such as the Fast-MVSNet [13], UCS-MVSNet [14].They propose some lightweight strategies to reduce computation (e.g.Sparse High-Resolution Representation, Adaptive Thin Volume).Nevertheless, all of the aforementioned methods will inevitably lead to higher computation cost for their used 3D cost volume regularization structure.Some other methods such as the R-MVSNet [15] and D 2 HC-RMVSNet [16] are proposed by dividing the original 3D cost volume into smaller or lower dimension pieces, i.e. channel sliced-based cost maps or depth sliced-based cost maps.Additionally, RNN [17] and LSTM [18] are used to establish the connections between these cost maps.These methods can effectively reduce the computation cost by converting the cost volume regularization into the regularization of a group of cost maps.However, these methods can not incorporate enough context information over the cost volume like 3D-UNet and their effectiveness might need further improvement.
Therefore, how to significantly reduce the computation with effectiveness maintenance is our main research problem.First, 2D depthwise separable convolution has been a standard module in current mainstream vision tasks to improve efficiency [19][20][21].It converts the ordinary convolution into depthwise convolution on channel-independent feature maps, and pointwise convolution to establish the relations between these feature maps, which seems as a RNN-like mechanism in MVS.Due to its full consideration of the information and the relations of the feature maps, 2D depthwise separable convolution can achieve a similar performance as the ordinary convolution with a much lower computation cost.However, for the MVS domain [8,10,[22][23][24][25], the usage of 3D depth separable convolution has not been explored and needs to be explored in combination with specific application scenarios.Inspired by the above thoughts, we try to use the depthwise separable convolution working on the regularization to construct our 3D UNet-shape network, which extends the 2D depthwise separable convolution into a 3D task.Second, features of the different positions may be easily mismatched in cost volume regularization because they share a visual similarity, and it will cause similar confidence in the same position at different depths.This feature mismatching problem will seriously affect the quality of the depth map by depth regression.Attention is a practical mechanism that can achieve the above-mentioned capabilities, but conventional attention (can only perform convolution on 3D volumes, without a dimension-wise process) does not effectively consider the mutuality of information between different dimensions, e.g.depth, space.Thus, we propose a 3D-Attention module to aggregate the more important multidimension matching information (channel, space, and depth) of cost volume and alleviate the above problem of feature mismatching.Third, the quality of the depth map directly affects the final reconstruction.Therefore, to achieve better performance, we propose a feature transfer module to upsample the low-resolution (LR) depth map to a highresolution (HR) depth map.In addition, the feature extraction module can obtain multi-level feature information by simultaneously incorporating low-level and high-level information learned from CNN, which can achieve accurate 3D points localization.We term our effective and efficient coarse-tofine framework as DSC-MVSNet.
The remainder of this paper is organized as follows.In Sect."Related work", we introduce related work, followed by an overview of our method in Sect."Methodology".Section "Experiments" presents our experimental results for two challenging datasets.Section "Limitation analysis" discusses the limitation of our proposed method, followed by some concluding remarks in Sect."Conclusion".
In summary, our main contributions are as follows: • We propose a 3D UNet-shape network and firstly use the depthwise separable convolution for 3D cost volume regularization, which can effectively improve the model efficiency with performance maintained.• We propose a 3D-Attention module to enhance the ability in cost volume regularization to fully aggregate the valuable information of cost volume and alleviate the problem of feature mismatching.• We proposed an effective and efficient feature transfer module to upsample the LR depth map to obtain the HR depth map to achieve higher quality reconstruction.• With extensive experiments on two benchmarks, our method demonstrates comparable or even better reconstruction results than the state-of-the-art methods with much lower computation cost.For instance, compared to state-of-the-art methods MVSNet, our model reduces the memory by 49% while improving the accuracy by 20%.

Traditional MVS reconstruction
The MVS is achieved by creating dense correspondences from multiple images of calibrated camera poses, which can be considered an optimization problem [24].Many optimization methods are then proposed.Due to the different presentations of scenes in MVS, these methods can be divided into three categories: voxel-based [7,[26][27][28], patchbased [8,24,29,30] or depth map-based [10,13,15,22,23,25].For example, the Markov discrete optimization is applied [7] by updating the state of chain voxels with constraints including luminosity consistency, smoothness, and visibility optimized; The spatial patch diffusion [8] considers each pixel in space as a patch and optimally expands the number of patches.The non-linear optimization [24] optimizes the depth and normal vector of given seed points by using stochastic gradient descent and least squares to estimate the depth map of the image.The depth map-based methods use the depth map as an intermediate, which decouples complex dense reconstruction problems into multiple simple subproblems and enables a more flexible scene reconstruction.Many recent deep learning-based MVS methods [10,15,31] are also performed based on the depth map.

Deep learning-based MVS
Recently, to overcome the blemish of traditional MVS methods, many deep learning-based methods [10-13, 15, 31, 32] are also introduced.For example, MVSNet [10] proposes an end-to-end MVS framework that extracts features from multiple views by CNNs to construct the matching cost volume.Then it uses 3D CNNs to regularize the cost volume to obtain a final depth map estimate.P-MVSNet [11] proposes a hybrid 3D U-Net to infer a probability volume from the cost volume and estimate the depth maps.These methods can achieve good results for their full consideration of the multi-dimensional information of images, but they are not efficient enough.To improve the efficiency, Fast-MVSNet [13] based on Point-MVSNet [32] is proposed to solve this problem by a sparse high-resolution depth map representation and some efficient modules.However, using 3D CNNs to regularize the cost volume inevitably results in a high computation cost.Thus, some methods try to slice the cost volume into cost maps, and higher efficiency can then be achieved for they convert the cost volume regularization into the regularization of the group of cost maps.For example, R-MVSNet [15] uses convolutional GRUs instead of 3D CNNs to regularize the 2D cost maps.D 2 HC-RMVSNet [16] slices cost volume into cost maps along the direction of depth, and uses a hybrid architecture DHU-LSTM which absorbs both the merits of LSTM [18] and U-Net to reduce the consumption cost.However, the structures such as RNN [17], or LSTM [18] inherently suffer a forgetfulness problem.They cannot fully consider the correlation of the cost maps and do not aggregate the multi-dimensional information of cost volume well.Also intending to improve the quality of the final reconstructed point cloud, DeepFusion [31] proposes a novel fusion strategy that accurately fuses all depth maps to obtain high quality point cloud results by balancing the geometric consistency and the predicted confidence.

Depthwise separable convolutions
The depthwise separable convolution is a useful lightweight strategy to build light and efficient networks.It is first proposed and applied in an AlexNet for image classification [19,33] by Laurent Sifre and achieves similar performance as ordinary convolution with lower computation cost.Then a similar idea is widely applied in other frameworks for object detection [34,35] and semantic segmentation [36,37], such as the MobileNetV1 [20] and the MobileNetV2 [21].Unlike ordinary convolution, the depthwise separable convolution transforms it into a depthwise convolution and a pointwise convolution.It computes each feature map independently by a channel-independent depthwise convolution and then uses a pointwise convolution to correlate each channel of feature maps to obtain the final feature map.This mechanism helps reduce the computation cost with a similar performance as the ordinary convolution.It is very similar to the strategies used in the above light MVS methods [15,16] which slice the cost volume into cost maps and use networks such as RNN [17] and LSTM [18] to correlate the maps.

Attention mechanism
It is well known that the attention mechanism plays an important role in deep learning.Except for natural language processing [38], the attention mechanism has been widely explored in many visual problems including scene segmentation [39][40][41], panoptic segmentation [42], and image classification [43].As the research progresses, some attention mechanisms incorporating convolution operations have been proposed.SE Block [44] adds a residual connection between different convolutions that assigns weights to different channels.CBAM [45] adds a spatial attention block based on SE Block [44] to achieve fine-grained allocation and processing of spatial information.However, these attention mechanisms only focus on the channel and spatial information.While for 3D cost volume, it also contains depth information.And the value of cost volume indicates the similarity between features, so there may be similarity confidence between different depths in the same spatial location of the same channel due to similar features.And just using the above attention mechanisms can not pay more attention to the more important depth information of cost volume.Therefore, we propose a depth attention mechanism combined with the original attention mechanisms, so that the regularization network can better optimize the matching information of cost volume, which allows us to obtain better depth maps and thus higher-quality point cloud reconstruction results.

Methodology
Our proposed DSC-MVSNet framework is a coarse-to-fine and end-to-end framework for estimating a goal depth map Dr of the reference image of size H × W × 3. We achieve this task with four subprocesses: Feature Extraction, Cost Volume Regularization, Depth Map Upsampling and Depth Map Refinement.The overall architecture of DSC-MVSNet is shown in Fig. 1.
In the cost volume regularization, we propose a DSC-Attention 3D UNet network based on depthwise separable convolution to significantly reduce the time and memory consumption while maintaining the performance.Moreover, to obtain high quality depth map, we also propose a feature transfer module to upsample the LR depth map.

3D depthwise separable convolution (3D-DSC)
Inspired by the mechanism of 2D depthwise separable convolution, we try to decrease the computation of 3D cost volume regularization by proposing 3D-DSC to replace ordinary 3D CNNs.We may have different dividing strategies for the applied 3D convolution due to it is a 3D task.But the cost volume regularization is constructed by matching similarities between feature points at different spatial positions in different views at different depths.Thus, we divide 3D CNN into 3D depthwise convolution (depthwise is depth-dimension and can perform cost aggregation for cost volume information in depth dimension) and 3D pointwise convolution (pointwise is space-dimension and perform cost aggregation for cost volume information in spatial dimension), which is consistent with the form of cost volume.The schematic of 3D-DSC is shown in the lower left part of Fig. 1.
(1) 3D depthwise convolution The 3D depthwise convolution is performed over the cost volume in each channel independently to obtain the channel-independent intermediate feature maps, as defined in Eq.( 1): where W 1 represent the weight of 3D depthwise convolution, V ∈ R C×D×H ×W represent the cost volume, i, j, u represent the position index, K , L, M denote the kernel size of convolution, and denotes the element-wise product.
(2) 3D pointwise convolution The 3D pointwise convolution acts on these channel-independent feature maps to aggregate the channel-wise information, as defined in Eq.( 2): where W 2 represent the weight of 3D pointwise convolution, V ∈ R C×D×H ×W represent the intermediate feature maps, N denotes the kernel size of convolution.
The two convolutions are performed sequentially to form a complete convolution.And the mathematical formulations are defined as Eq. ( 3): Here we compare our 3D-DSC regularization scheme with other mainstream regularization schemes theoretically, to demonstrate the effectiveness of our scheme.We display the Fig. 1 The architecture of the DSC-MVSNet.In the first part, we use an informative feature extraction network to extract features to build the coarse cost volume.In the second part, we use our DSC-Attention 3D UNet to regularize the cost volume.In the third part, we use the FTM to upsample the LR depth map.In the forth part, we use the Gauss-Newton layer [13] to further refine the depth map.The two bottom parts are used for cost volume regularization.The lower left part is the schematic of our 3D depthwise separable convolution.The lower right part is the schematic of our 3D-Attention module Fig. 2 Illustration of different regularization schemes.We denote the receptive field of voxels in cyan during the regularization.Horizontal is the depth dimension and vertical is the channel dimension.H and W denote the height and width respectively.In this figure, we set H and W as one dimension four regularization schemes in Fig. 2: (a) spatial Regularization (SR) [46] is a cost aggregation method, it filters cost volume at different depths.However, due to the small receptive field, the regularization results of SR are highly affected; (b) 3D CNN Regularization (3D-CNN) [10] is a CNN-based method, it uses 3D CNNs to obtain a larger receptive field for cost volume regularization.But it causes much more computation cost; (c) recurrent Regularization [15] is an RNN-based method, it proposes sequential processing to divide the cost volume into depth-independent cost maps to reduce computation cost; (d) our 3D-DSC Regularization is a DSC-based method, we split the cost volume into intermediate feature maps, then apply a point-wise convolution to establish the relations between these intermediate feature maps to maintain the performance of the model.Our method can obtain a larger receptive field when compared to SR.While 3D CNN regularization can obtain better performance, it also incurs higher computational cost.On the other hand, our scheme can achieve similar performance with lower cost.Moreover, the recurrent regularization scheme and our regularization scheme are two different but similar ideas, both of us split cost volume into intermediate feature maps to reduce the computation cost.Therefore, we conclude that adopting the 3D-DSC as our regularization scheme is both feasible and effective.
Then we compare the efficiency of our 3D-DSC and 3D-CNN.Assuming the cost volume is V ∈ R C×D×H ×W and the goal cost volume is V ∈ R Ĉ×D×H ×W , and the convolution kernel size is K , the computation cost of the ordinary 3D convolution and our proposed 3D depthwise separable convolution is shown in Table 1.We can see from the results that the computation cost of ordinary 3D convolution is (K 3 × Ĉ)/(K 3 + Ĉ) times that of 3D depthwise separable convolution.For instance, when K = 3 and Ĉ = 32, the computation cost of our 3D-DSC convolution is around 1   14   of 3D-CNN.Thus, our regularization scheme 3D-DSC will be more efficient than 3D-CNN based models.In summary, we have analyzed the effectiveness and efficiency separately, which demonstrates the feasibility of our 3D-DSC as a regularization scheme.

3D-attention module (3DA)
Although the cost volume information can be effectively aggregated after the 3D-DSC, there is still a feature mismatching problem affecting the cost volume quality.The feature mismatching problem happens when features from different key points are mistakenly matched, which will cause similarity confidence at different depths of the cost volume, and finally results in inaccurate depth estimation.Specifically, as shown in Fig. 3, a reference feature matches two similar source features at different depths (the two hands from the Buddha statue), and the confidences of different depths are similar in the cost volume.These similar confidences will affect the quality of the depth map regressed by Eq. 8.
Since attention mechanisms can highlight important information by calculating different weights, we here use an attention mechanism to address the feature mismatching problem.We propose a 3D-Attention module, which alleviates this problem by computing an attention weight using the information of the whole cost volume to enhance or weaken similar confidence in different depths.The schematic of the module is depicted in the lower right part of Fig. 1, and it consists of two blocks.
(1) Channel attention block.A channel attention block performs attention for channel wise information.It is constructed by a multi-layer perceptron (MLP) which acts on the channel of cost volume V ∈ R C×D×H ×W to obtain the channel attention enhancement weights Ŵ .We multiply the channel weights W with the cost volume V to obtain the channel-refined cost volume V ∈ R C×D×H ×W .The formula of channel attention block is defined as Eq.4: where Max Pool is max pooling, Avg Pool is avg pooling.Ŵ ∈ R C denotes the channel attention enhancement weights, and both of two parts share weights of MLP.(2) Spatial depth attention block.A spatial depth attention block is proposed to alleviate the problem of similarity confidence.Different from the ordinary attention, which uses full perception (without distinguishing between space and depth), the spatial depth attention block perceives cost information according to the composition of the cost volume in two different dimensions, e.g.space and depth, respectively.First, we use a spatialoriented anisotropic [11] convolutions with kernel sizes of 1×7×7 (different positions at same depth) to filter cost volume along the spatial direction to reduce noise while maintaining useful matching information at the same depth.It provides more accurate spatial information for next depth-oriented convolution.Then a depth-oriented anisotropic convolution with kernel sizes of 7×1×1 (different depths at same position) acts on depth dimension, it effectively enhances or weakens matching information at different depths at the same spatial location (illustration shown in Fig. 3).Finally, we use an isotropic [11] convolution with kernel sizes of 7 × 7 × 7 acts on multidimension (space, depth) to fully aggregate information from above processes.The formula of spatial depth attention block is defined as Eq. ( 5): where σ is the activation function; W ∈ R 1×D×H ×W is the spatial depth weight; f 1×7×7 is the spatial oriented convolution, f 7×1×1 is the depth oriented convolution and f 7×7×7 is the overall convolution.
We form a 3D-Attention module by cascading these two blocks.As shown in Fig. 3, the confidence of right depth is enhanced by using our module.The formula of 3D-Attention module is defined as Eq.6: where V ∈ R C×D×H ×W is the attention-weighted cost volume.
After regularization, we use a softmax operation (Eq.7) in the depth direction to regress all the values between [0, 1] to form our probability volume P for depth estimation.Finally, we multiply different depth hypothesis plane values with the probability volume P to obtain the LR depth map Ds .The formula is defined as Eq.8:

Feature transfer module
The high-resolution (HR) depth map obtained by upsampling directly affects the quality of the point cloud results.To obtain a high resolution and precise depth map, we propose a Feature Transfer Module (FTM) for the low-resolution (LR) depth map upsampling.The third part of Fig. 1 shows the framework of our FTM module.
The inputs of FTM are a three-channel reference image To unify the scale of inputs, we first use the bicubic interpolation algorithm [47] to upsample the LR depth map Ds to obtain a larger scale depth map And we downsample the reference image into a 16-channel image 4 W by a downsample layer.After unification, we propose a common offset and weight extraction backbone to obtain the offset and weight w 16 W of LR depth map, respectively.This backbone contains a seven convolutional feature extraction network, a offset convolution, a weight convolution, and a sigmoid layer.The equation of this backbone is defined as Eq. ( 9): where f F E represents the extraction network, f oc represents the offset convolution, f wc represents weight convolution, and the sigmoid represent the sigmoid layer.
Then we use the OWC Block to compute the weight 4 W for guiding depth map upsampling, where k is a hyperparameter and we set k = 12.In detail, we multiply the corresponding offsets p I 0 , p D s and weights w I 0 , w D s , and then pass the result through PixelShuffle to get the goal offset q and weight w.Then we use the offset to guide feature sampling and multiply the sampled features with the weight to obtain the final result.Finally, we obtain the HR depth map by a residual addition block.The equation of above process is defined as Eq. ( 10): where f ps represents the PixelShuffle [48] operation of PyTorch, f gs represents the grid_sample function of PyTorch, D res represents the depth residual, denotes the elementwise product.

Informative feature extraction network
In the feature extraction process, many previous methods [10,11,13,15,49] only use sequential convolution operations to extract the feature map from input images {I i } N i=0 , which only contain the high level semantic information.And the loss of low level spatial information will affect the quality of reconstruction results.Thus, we propose an informative feature extraction network using the skip connection to propagate low level spatial information to aggregate the multi-level feature information.This network has three components (Encoder, Decoder, Adjuster), and the architecture details is provided in Table 2.

Cost volume construction
Following the previous methods [12,13,15,32,50], to build the cost volume V , we use the same differentiable homography to warp all feature maps into different fronto-parallel planes of the reference camera to construct N feature volumes {V f i } N i=1 .Then we adopt the same cost metric [15] to aggregate them into the cost volume V .The equation of cost metric is defined as Eq. ( 11): V i is the average volume of all feature volumes.

Depth map refinement
The quality of the HR depth map Dd and obtain the refined depth map Dr ∈ R 4 W obtained in previous step is insufficient and needs to be refined.And the Gauss-Netwon Layer is an effective and efficient module for depth map refinement in Fast-MVSNet [13].Therefore, we use a Gauss-Netwon Layer to refine the HR depth map Dd and obtain the refined depth map Dr ∈ R 1×

Training loss
Following the previous methods [10,32], we compute the average absolute value error between the predicted depth map and ground truth depth map as our training loss as Eq. ( 12): where Dd denotes the HR depth map, Dr denotes the refined depth map, D denotes the Ground Truth Depth Map, p valid denotes the valid point set of the Ground Truth Depth Map, λ is used to balance loss 1 ( p) and loss 2 ( p).In the training process, we usually set λ to 1.0.

Experiments
In this section, we first introduce the experimental settings in this paper, then quantitatively and qualitatively demonstrate the performance on the DTU dataset, and finally verify the generalization ability of the proposed work on the TnT dataset.

Dataset
The DTU dataset [51] is a large-scale dataset that is captured with precise camera pose and lighting conditions using robot arm control in the laboratory.The dataset consists of the images, real point clouds, and their obtained camera parameters of 128 scenes with 7 different lighting conditions.Each scene has 49 or 64 images with a resolution of 1600 × 1200 and corresponding internal and external camera parameters for training.The dataset provides calibrated images and real point clouds, and Yao et al. [10] divide it into training set, validation set and test set.
The Tanks & Temples (TnT) [9] is captured from real outdoor sensors, which is different from DTU [51].These outdoor scenes contain a variety of different lighting conditions, reflection conditions, and other outdoor factors that make the TnT dataset more complex than obtaining a DTU dataset under specific conditions.The intermediate set used for evaluation contains eight different scenes, namely Family, Francis, Horse, Lighthouse, M60, Panther, Playground, and Train.

Implement details
Training The proposed DSC-MVSNet is implemented using PyTorch and trained on the DTU training set.The ground truths for evaluation in DTU are represented as real point clouds.The depth maps for training our framework are obtained using the screened Poisson surface reconstruction algorithm (SPSR) [52].In the training process, the input image resolution is set as 640 ×512, and the number of training views is set as N = 3.The selection of reference images and source images is the same as MVSNet [10] Testing The model obtained in the training process is tested on DTU test dataset [51].We use 5 adjacent images of 1280 × 960 as the input.The hypothetical depth plane for testing is set as D = 128.The evaluation of the DTU dataset [51] is performed by converting the output depth map into a predicted point cloud using the method according to Yao [10], and then comparing it with the ground truth point cloud by official Matlab code.

Evaluation metrics
To obtain comprehensive conclusions, we use three metrics for evaluating performance and three metrics for evaluating efficiency of our model.The performance evaluation metrics (Acc, Comp, and Overall) are all mentioned in DTU [51].Acc is measured as the distance from the MVS reconstruction to the structured light reference, encapsulating the quality of the reconstructed MVS points.A lower Acc value indicates more accurate positioning of the points in the point clouds.Com-

Comparison of the models performance
We compare our DSC-MVSNet with two groups of state-ofthe-art methods: traditional MVS methods e.g.Camp [55], Furu [8], Toal [25], Gipuma [23]; and deep learning-based MVS methods e.g.MVSNet [10], R-MVSNet [15], Fast-MVSNet [13], CVP-MVSNet [53], UCS-Net [14], DeepFusion [31], PatchmatchNet [59].Table 3 shows the results of the DTU [51] dataset.We have the following observations: our method establishes the state-of-the-art overall performance by comparing two groups of methods.For instance, DSC-MVSNet achieves significant improvement in Overall performance: 50.5% (Camp), 55.7% (Furu), 55.1% (Tola), 40.4% (Gipuma), 25.5% (MVSNet), 17.5% (R-MVSNet), 7.0% (Fast-MVSNet), 2.1% (CVP-MVSNet), 19.8% (Deep-Fusion) and 2.3% (PatchmatchNet).It indicates that our model can reconstruct a sufficient number of surfaces and the spatial locations of the points on these surfaces are accurate enough.In the generalized Acc metric that is more challenging, our method achieves notably gains over state-of-the-art methods: we achieve 0.316 on Acc.Although the Gipuma [23] method has the highest Acc, its Comp is much higher than our proposed method (0.873 vs 0.372).And compared to deep learning-based methods, our method achieves comparable results to CVP-MVSNet [53] (0.316 vs 0.296).This shows that our network is accurate in estimating the position of each point obtained from the reconstruction.Our DSC-MVSNet is comparable to or better than SOTA methods in terms of Comp.However, the PatchmatchNet [59] has the lowest Comp, its Acc and Overall are higher than our proposed method (0.427 vs 0.316; 0.352 vs 0.344).It indicates that our method can reconstruct more of the target surfaces to meet the low Comp.Thus, these results demonstrate that our proposed method has a better or comparable performance compared to the majority of state-of-the-art methods.Figure 4 shows the qualitative comparison results (Scan 77 in DTU [51]) between DSC-MVSNet and most of stateof-the-arts methods (Tola [25], Gipuma [23], Furu [8], Camp [55], MVSNet [10], R-MVSNet [15], P-MVSNet [11]).The colored boxes (red, yellow, green) shown in the figure, our method DSC-MVSNet reconstructs a more complete point cloud, which corresponds to the Comp value in Table 3.We think the improvement of completeness benefits from the introduction of the 3DA, which can alleviate the feature mismatch problem to improve the quality of depth map.
We further compare our DSC-MVSNet with R-MVSNet [15] on some scenes (Scan 1, Scan 75, Scan 110, Scan 114) of DTU [51].Because R-MVSNet can handle largescale scenarios for 3D model reconstruction [54].Figure 5 shows the visualization of various reconstructed point cloud models of DTU dataset.The comparisons reveal that our DSC-MVSNet reduces a considerable number of outliers compared to R-MVSNet.That shows our DSC-MVSNet estimates the position of each point to be reconstructed accurately, and the conclusion corresponds to the ACC value in Table 3.Furthermore, it is worth mentioning that our network occupies less memory and runs faster than R-MVSNet.We think the above improvements benefit from the introduction of the 3D-DSC.

Comparison of the models efficiency
We compared the efficiency of different methods by reporting their model parameters, memory consumption, and runtime (some results are obtained from official reports).Table 3 and Table 4 show that our framework has lower model parameters, memory consumption, and runtime than most state-of-the-art deep learning methods, with very competitive performance.Although our method runs with slower runtime, it uses smaller memory consumption and parameters (5.5 GB,253,585).We also compared our network with various state-of-the-art methods, such as Fast-MVSNet [13], Cascade-MVSNet [12], PVA-MVSNet [53], UCS-Net [14], and D 2 HC-RMVSNet [16].Table 4 shows that DSC-MVSNet achieves lower or comparable efficiency results compared to SOTA methods.Memory consumption directly affects the environment setting for model training.In terms of 5 Visualization of several scenes on DTU dataset between R-MVSNet [15] (left) and our DSC-MVSNet (right).The point cloud results clearly show that our method DSC-MVSNet achieve better reconstruction results even with much lower parameters   [53] and achieve faster runtime than D 2 HC-RMVSNet [16] (5.5 GB vs 17.3 GB; 0.74 s vs 29.15 s).Similarly, UCS-Net [14] is comparable to our method in terms of memory and time, but we reduce parameters by 73% compared to UCS-Net [14] on the DTU [51] dataset.In conclusion, our proposed method has better or comparable efficiency than most state-of-the-art methods.
Then we discuss the memory and the time consumption of the inference phase.The size of the inputs is H × W = 1600 × 1152, and the hypothetical depth plane is set as D = 96.Table 5 shows the results of the inference on the DTU [51] dataset w.r.t. the number of sources.It demonstrates that the memory occupied by inference and the inference time is linearly increasing with the number of sources.

Ablation experiments
The ablation experiments are also conducted on the DTU dataset to illustrate our method's efficiency and effectiveness.The network only with the 3D UNet-shape network for cost volume regularization is taken as a baseline for ablation experiments.The results are shown in Table 6.Effectiveness of DSC: Our novelty contribution is to explore the feasibility of 3D depth separable convolution as a cost volume regularization scheme in the MVS domain.As shown in Table 6, compared to Row 2 (Baseline + 3D CNNs) and Row 3 (Baseline + DSC), we can observe that replacing 3D CNN with 3D DSC in 3D UNet, which not cause a sharp decline in model performance, e.g.Acc from 0.391 to 0.398.Meanwhile, our model can greatly reduce the number of parameters, memory consumption and time.Therefore, it is feasible to use 3D DSC in the MVS domain.Based on the above phenomenon, we think that the regularization scheme we designed for cost volume plays a key role in the model.We divide 3D DSC into 3D pointwise convolution and 3D depthwise convolution, which perceives multi-dimensional cost information and aggregates in depth dimension and spatial dimension.This mechanism is similar to 3D CNN-based mechanism (as shown in Fig. 2b and d), so our model can still maintain an impressive performance, which proves the feasibility of using 3D DSC in the MVS domain.
Effectiveness of DSC 3D UNet: As shown in Table 6, compared to the baseline (+3D CNNs), the baseline using the DSC 3D UNet can effectively reduce the model parameters, memory consumption, training time, and the Acc, Comp, and Overall can also be maintained to some extent.It means a significant reduction in parameters without much accuracy loss can be achieved using the 3D depthwise separable convolution.
Effectiveness of 3D-Attention module: As shown in Table 6, the Acc, Comp, and Overall metrics can all be improved with only a slight increase in computation and memory consumption by adding the 3D-Attention module to the baseline + DSC.This means that adding the attention layer is effective and it helps to improve the information extraction of our proposed separable convolution.
As the problem of similarity confidence mentioned in Sect."3D-Attention module (3DA)", we discuss the effectiveness of the 3DA module in solving the above problems.We illustrate separately the confidence line charts for different depths at a spatial location with 3DA (red line chart) and without 3DA (blue line chart) in Fig. 7.We can see from the charts that the confidence of the GT depth in the blue dash is very similar to the confidence of the error depth, which can lead to incorrect depth estimates when the predicted depth value (the blue dashed line) is calculated via Eq.( 8), to obtain depth values that are far from the GT depth.After adding the 3DA module, we can see from the red line chart that the confidence of the GT depth has been enhanced and the confidence of the error depth has been weakened, so that we obtain a value similar to the GT depth value when calculating the predicted depth value (the red dashed line).This is also reflected in the higher Accuracy of ablation experiments with baseline + DSC + 3DA in Table 6.
Effectiveness of Informative Feature Extraction Network: As shown in Table 6, our baseline + DSC combines Informative Feature Extraction Network can achieve better performance with a small increase in the number of model parameters, memory, and time.
Effectiveness of Feature Transfer Module: We use a Feature Transfer Module in the baseline + DSC to upsample the LR depth map.Table 6 shows that the FTM can further improve the performance of our network with a small increase in model parameters, memory, and time.
The ablation reconstruction results of scan 118 of the DTU [51] when adding different modules of our method are shown in Fig. 6.As the areas identified by rectangles in Fig. 6, our baseline has higher completeness and richer detail information by combining different modules.

Generalization on TnT dataset
The Tanks & Temples (TnT) dataset [9] is widely used in previous methods [10,12,13,15,31,32] as a benchmark.Therefore, to evaluate the generalization of our DSC-MVSNet, we perform a test on TnT and evaluate the results by uploading the point cloud to the official website.We use the best model of training on DTU without fine-tuning to evaluate the TnT dataset [9], and we set 5 adjacent images with a resolution  As shown in Table 7, our model exhibits comparable results with lower consumption.Compare to traditional multi-view stereo methods (Colmap, Pix4D, Open-MVG+OpenMVS), our DSC-MVSNet obtains better reconstruction scores on all scenes.Besides, our DSC-MVSNet outperforms all listed learning-based MVS methods with a 53.48 mean F-score on Tanks and Temples intermediate [9].And we achieve a comparable generalization performance with the state-of-the-art methods e.g.DSC-MVSNet achieves the highest accuracy on several scenes, i.e., Family, Lighthouse, M60, Panther, and Train.Figure 8 shows the error visualization calculated according to the corresponding ground truth point clouds.Our DSC-MVSNet significantly improves the precision of reconstructions compared to the recent work PatchmatchNet [59].For example, as shown in the red boxes in Fig. 8, PatchmatchNet has more incorrect points and noise.Our method is able to obtain more accurate point positions while reducing noise, which is benefited from our proposed 3DA and FTM methods.

Limitation analysis
Although our model exhibits better or comparable performance than most of the state-of-the-art methods on the two benchmarks [9,51], we still have some limitations.
(1) For complex environmental factors (i.e.lighting conditions, reflection conditions, etc) that have never been obtained before, there are still some limitations in the accuracy of the reconstruction.Therefore, we consider improving the generalization ability of the model in future works.(2) As we use several images as input, our model is still higher than the best method in memory consumption as shown in Table 4.This motivates us to explore high-quality reconstruction with limited input images.123

Conclusion
Our proposed DSC-MVSNet a novel coarse-to-fine and end-to-end framework for efficient and accurate depth estimation in MVS.Firstly, we use depthwise separable convolution to construct our attention-aware 3D UNet-shaped network for cost volume regularization with lower parameters and memory cost.Additionally, we introduce a 3D-Attention module to focus on more critical information and alleviate the feature-mismatching problem.Furthermore, we propose an efficient and effective Feature Transfer Module to upsample the LR depth map.The experimental results verify the effectiveness and efficiency of our method.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

( 1 ) 1 8 H × 1 8 3 ) 1 8 H × 1 8W 1 4 H × 1 4 W . ( 4 ) 1 4 H × 1 4W
Feature extraction (in Sect."Informative feature extraction network"): we use an informative feature extraction network to extract the corresponding feature F i ∈ R C× H 4 × W 4 for each image I i , where I 0 and {I i } N i=1 denote the reference image and source images, respectively.(2) Cost volume regularization (in Sects."3D depthwise separable convolution (3D-DSC)", and "3D-attention module (3DA)"): we propose a DSC-Attention 3D UNet to regularize the coarse cost volume V ∈ R C×D× W , which is constructed by the reference feature F 0 and other source features {F i } N i=1 .(Depth map upsampling (in Sect."Feature transfer module"): we propose a feature transfer module to upsample the LR depth map Ds ∈ R 1× to a HR depth map Dd ∈ R 1× Depth map refinement (in Sect."Depth map refinement"): a Gauss-Netwon Layer is utilized to obtain the refined depth map Dr ∈ R 1× by using input images {I i } N i=0 and HR depth map Dd .Finally, we fuse the refined depth maps to obtain point clouds as the result.

Fig. 3
Fig. 3 Illustration of the problem of similarity confidence at different depth and use 3DA to alleviate it.Red voxels represent the similarity confidence; For representation of cost volume, we have excluded the channel dimension; The light red indicates that the confidence is weakened
layer represents a block of convolution, batch normalization (BN) and ReLU.'sp' means skip connection . The virtual hypothetical depth plane value is set as D = 48 and D = 96 for training, and the depth values are sampled within the range [425 mm, 921 mm].The learning rate is set using the RMSProp optimizer and the initial learning rate is set to 0.0008, and the decay weight is set as 0.002 every epoch.The batch size is set as 16 and trained on 6× NVIDIA GTX 2080ti GPU devices.Our best model is trained with two stages: (1) We use a virtual hypothetical depth plane of 48 for training, set 6 epochs for end-to-end training with the DSC-Attention 3D UNet and Feature Transfer Module, and use 12 epochs for overall training.(2) We retrain our network based on the best model obtained in the first stage with 10 epochs in the hypothetical depth plane of 96.The best model for the second stage is selected as our evaluation model.

Fig. 4
Fig.4 Visualization of the reconstructed point cloud models for scan77 in DTU dataset by different methods.The results are directly cited from the paper P-MVSNet[11].Three important parts: cover (yellow), handle (red) and base (green) are highlighted.Although the reference image

Fig. 6
Fig. 6 reconstruction results of scan118 of the DTU dataset [51].Two important parts: top (red) and bottom (red) are highlighted.The point cloud results show the effectiveness of each modules

Fig. 7 Fig. 8
Fig.7 We the similar confidences of an example of scan 77.On the top, we show an RGB reference image, and an RGB source image.The red point of right image is the matching point, and the green point is the mismatching point.On the bottom, we show the correspond-

Table 2
Summary of the informative feature extraction networkInput images size: 3 × H × W

Table 3
Quantitative [13,53,54]s means the best values compared to all list values of each columeUnderline value means the second lowest values compared to all listed Acc values Our method DSC-MVSNet outperforms all deep learning-based MVS methods in terms of reconstruction accuracy, and has a better result in terms of overall.The top group of methods are traditional MVS methods, and the bottom group exhibits the deep learning based-methods pleteness is measured as the distance from the reference to the MVS reconstruction, encapsulating how much of the surface is captured by the MVS reconstruction.A lower Comp value means that we reconstruct more point cloud surfaces.Acc and Comp are calculated using the official Matlab code provided by DTU[51].Overall is calculated as the average of Acc and Comp to evaluate overall reconstruction quality.The metrics used to evaluate efficiency are Parameters, Memory, and Time, which are widely adopted in previous methods[13,53,54].

Table 4
Comparison on the parameters, memory and time consumption on the evaluation DTU [51] dataset

Table 6
Ablation on the DTU evaluation dataset [51], which demonstrates the effectiveness of different modules of our method, where model parameters, memory, and time are recorded during training Method Acc.(mm) ↓ Comp.(mm) ↓ Overall (mm) ↓ Parameters ↓ Memory (MB) ↓ Time (s) ↓

Table 7
Generalization results on the Tanks & Temples benchmark [9] Methods Family ↑ Francis ↑ Horse ↑ Lighthouse ↑ M60 ↑ Panther ↑ Playground ↑ Train ↑ Intermediate mean ↑ Bold values means the best values compared to all list values of each colume We achieve comparable F-score results with many state-of-the-art methods.The top part of the table shows the comparison results with traditional MVS methods, and the bottom part exhibits the comparison results with deep learning based-methods 1920 × 1080 as the input.Meanwhile, the depth hypothesis plane is set as D = 128.