Introduction

Multi-view stereo (MVS) has been extensively studied and widely applied in augmented reality and 3D reconstruction [1,2,3,4,5,6]. The goal of MVS is to reconstruct 3D scenes using a series of camera-calibrated 2D images by establishing dense correspondences, which can be formulated as an optimization problem. Thus, the optimization methods such as Markov discrete optimization [7] and spatial patch diffusion [8] are applied to solve this problem. However, the above methods may result in incomplete surfaces in scenes with weak textures or non-Lambertian surfaces [1, 9].

With the development of deep learning in recent years, Yao et al. [10] show promising results by achieving MVS with a cost volume regularization process and using deep learning to solve the optimization problem. The cost volume is composed of the confidence between matched features at different depths, and the regularization process optimizes the cost volume to obtain the depth probability distribution which is then used to obtain the depth map. Attention that the regularization here is not a widely used strategy to avoid overfitting in machine learning, but a terminology denoting the optimization process of cost volume in MVS domain. Because the accuracy of regularized cost volume directly determines the quality of the final depth map obtained from the regression, the improvement of cost volume regularization network is the major part of the later research studies, such as the P-MVSNet [11] and Cascade-MVSNet [12]. These methods can achieve better reconstruction results for fully exploiting the information of the images in multiple dimensions. But they still suffer from the problem of low efficiency. Therefore, some efficient frameworks are then proposed to improve the efficiency, such as the Fast-MVSNet [13], UCS-MVSNet [14]. They propose some lightweight strategies to reduce computation (e.g. Sparse High-Resolution Representation, Adaptive Thin Volume). Nevertheless, all of the aforementioned methods will inevitably lead to higher computation cost for their used 3D cost volume regularization structure. Some other methods such as the R-MVSNet [15] and \( \text {D}^{2} \) HC-RMVSNet [16] are proposed by dividing the original 3D cost volume into smaller or lower dimension pieces, i.e. channel sliced-based cost maps or depth sliced-based cost maps. Additionally, RNN [17] and LSTM [18] are used to establish the connections between these cost maps. These methods can effectively reduce the computation cost by converting the cost volume regularization into the regularization of a group of cost maps. However, these methods can not incorporate enough context information over the cost volume like 3D-UNet and their effectiveness might need further improvement.

Therefore, how to significantly reduce the computation with effectiveness maintenance is our main research problem. First, 2D depthwise separable convolution has been a standard module in current mainstream vision tasks to improve efficiency [19,20,21]. It converts the ordinary convolution into depthwise convolution on channel-independent feature maps, and pointwise convolution to establish the relations between these feature maps, which seems as a RNN-like mechanism in MVS. Due to its full consideration of the information and the relations of the feature maps, 2D depthwise separable convolution can achieve a similar performance as the ordinary convolution with a much lower computation cost. However, for the MVS domain [8, 10, 22,23,24,25], the usage of 3D depth separable convolution has not been explored and needs to be explored in combination with specific application scenarios. Inspired by the above thoughts, we try to use the depthwise separable convolution working on the regularization to construct our 3D UNet-shape network, which extends the 2D depthwise separable convolution into a 3D task. Second, features of the different positions may be easily mismatched in cost volume regularization because they share a visual similarity, and it will cause similar confidence in the same position at different depths. This feature mismatching problem will seriously affect the quality of the depth map by depth regression. Attention is a practical mechanism that can achieve the above-mentioned capabilities, but conventional attention (can only perform convolution on 3D volumes, without a dimension-wise process) does not effectively consider the mutuality of information between different dimensions, e.g. depth, space. Thus, we propose a 3D-Attention module to aggregate the more important multi-dimension matching information (channel, space, and depth) of cost volume and alleviate the above problem of feature mismatching. Third, the quality of the depth map directly affects the final reconstruction. Therefore, to achieve better performance, we propose a feature transfer module to upsample the low-resolution (LR) depth map to a high-resolution (HR) depth map. In addition, the feature extraction module can obtain multi-level feature information by simultaneously incorporating low-level and high-level information learned from CNN, which can achieve accurate 3D points localization. We term our effective and efficient coarse-to-fine framework as DSC-MVSNet.

The remainder of this paper is organized as follows. In Sect.  “Related work”, we introduce related work, followed by an overview of our method in Sect.  “Methodology”. Section  “Experiments” presents our experimental results for two challenging datasets. Section  “Limitation analysis” discusses the limitation of our proposed method, followed by some concluding remarks in Sect.  “Conclusion”.

In summary, our main contributions are as follows:

  • We propose a 3D UNet-shape network and firstly use the depthwise separable convolution for 3D cost volume regularization, which can effectively improve the model efficiency with performance maintained.

  • We propose a 3D-Attention module to enhance the ability in cost volume regularization to fully aggregate the valuable information of cost volume and alleviate the problem of feature mismatching.

  • We proposed an effective and efficient feature transfer module to upsample the LR depth map to obtain the HR depth map to achieve higher quality reconstruction.

  • With extensive experiments on two benchmarks, our method demonstrates comparable or even better reconstruction results than the state-of-the-art methods with much lower computation cost. For instance, compared to state-of-the-art methods MVSNet, our model reduces the memory by 49% while improving the accuracy by 20%.

Related work

Traditional MVS reconstruction

The MVS is achieved by creating dense correspondences from multiple images of calibrated camera poses, which can be considered an optimization problem [24]. Many optimization methods are then proposed. Due to the different presentations of scenes in MVS, these methods can be divided into three categories: voxel-based [7, 26,27,28], patch-based [8, 24, 29, 30] or depth map-based [10, 13, 15, 22, 23, 25]. For example, the Markov discrete optimization is applied [7] by updating the state of chain voxels with constraints including luminosity consistency, smoothness, and visibility optimized; The spatial patch diffusion [8] considers each pixel in space as a patch and optimally expands the number of patches. The non-linear optimization [24] optimizes the depth and normal vector of given seed points by using stochastic gradient descent and least squares to estimate the depth map of the image. The depth map-based methods use the depth map as an intermediate, which decouples complex dense reconstruction problems into multiple simple subproblems and enables a more flexible scene reconstruction. Many recent deep learning-based MVS methods [10, 15, 31] are also performed based on the depth map.

Deep learning-based MVS

Recently, to overcome the blemish of traditional MVS methods, many deep learning-based methods [10,11,12,13, 15, 31, 32] are also introduced. For example, MVSNet [10] proposes an end-to-end MVS framework that extracts features from multiple views by CNNs to construct the matching cost volume. Then it uses 3D CNNs to regularize the cost volume to obtain a final depth map estimate. P-MVSNet [11] proposes a hybrid 3D U-Net to infer a probability volume from the cost volume and estimate the depth maps. These methods can achieve good results for their full consideration of the multi-dimensional information of images, but they are not efficient enough. To improve the efficiency, Fast-MVSNet [13] based on Point-MVSNet [32] is proposed to solve this problem by a sparse high-resolution depth map representation and some efficient modules. However, using 3D CNNs to regularize the cost volume inevitably results in a high computation cost. Thus, some methods try to slice the cost volume into cost maps, and higher efficiency can then be achieved for they convert the cost volume regularization into the regularization of the group of cost maps. For example, R-MVSNet [15] uses convolutional GRUs instead of 3D CNNs to regularize the 2D cost maps. \(D^{2}\)HC-RMVSNet [16] slices cost volume into cost maps along the direction of depth, and uses a hybrid architecture DHU-LSTM which absorbs both the merits of LSTM [18] and U-Net to reduce the consumption cost. However, the structures such as RNN [17], or LSTM [18] inherently suffer a forgetfulness problem. They cannot fully consider the correlation of the cost maps and do not aggregate the multi-dimensional information of cost volume well. Also intending to improve the quality of the final reconstructed point cloud, DeepFusion [31] proposes a novel fusion strategy that accurately fuses all depth maps to obtain high quality point cloud results by balancing the geometric consistency and the predicted confidence.

Depthwise separable convolutions

The depthwise separable convolution is a useful lightweight strategy to build light and efficient networks. It is first proposed and applied in an AlexNet for image classification [19, 33] by Laurent Sifre and achieves similar performance as ordinary convolution with lower computation cost. Then a similar idea is widely applied in other frameworks for object detection [34, 35] and semantic segmentation [36, 37], such as the MobileNetV1 [20] and the MobileNetV2 [21]. Unlike ordinary convolution, the depthwise separable convolution transforms it into a depthwise convolution and a pointwise convolution. It computes each feature map independently by a channel-independent depthwise convolution and then uses a pointwise convolution to correlate each channel of feature maps to obtain the final feature map. This mechanism helps reduce the computation cost with a similar performance as the ordinary convolution. It is very similar to the strategies used in the above light MVS methods [15, 16] which slice the cost volume into cost maps and use networks such as RNN [17] and LSTM [18] to correlate the maps.

Attention mechanism

It is well known that the attention mechanism plays an important role in deep learning. Except for natural language processing [38], the attention mechanism has been widely explored in many visual problems including scene segmentation [39,40,41], panoptic segmentation [42], and image classification [43]. As the research progresses, some attention mechanisms incorporating convolution operations have been proposed. SE Block [44] adds a residual connection between different convolutions that assigns weights to different channels. CBAM [45] adds a spatial attention block based on SE Block [44] to achieve fine-grained allocation and processing of spatial information. However, these attention mechanisms only focus on the channel and spatial information. While for 3D cost volume, it also contains depth information. And the value of cost volume indicates the similarity between features, so there may be similarity confidence between different depths in the same spatial location of the same channel due to similar features. And just using the above attention mechanisms can not pay more attention to the more important depth information of cost volume. Therefore, we propose a depth attention mechanism combined with the original attention mechanisms, so that the regularization network can better optimize the matching information of cost volume, which allows us to obtain better depth maps and thus higher-quality point cloud reconstruction results.

Fig. 1
figure 1

The architecture of the DSC-MVSNet. In the first part, we use an informative feature extraction network to extract features to build the coarse cost volume. In the second part, we use our DSC-Attention 3D UNet to regularize the cost volume. In the third part, we use the FTM to upsample the LR depth map. In the forth part, we use the Gauss-Newton layer [13] to further refine the depth map. The two bottom parts are used for cost volume regularization. The lower left part is the schematic of our 3D depthwise separable convolution. The lower right part is the schematic of our 3D-Attention module

Methodology

Our proposed DSC-MVSNet framework is a coarse-to-fine and end-to-end framework for estimating a goal depth map \( {\tilde{D}}_{r} \) of the reference image \(I_{0}\) from \( N+1 \) input images \( \left\{ I_{i} \right\} _{i=0}^{N} \) of size \( H \times W \times 3 \). We achieve this task with four sub-processes: Feature Extraction, Cost Volume Regularization, Depth Map Upsampling and Depth Map Refinement. The overall architecture of DSC-MVSNet is shown in Fig. 1.

In the cost volume regularization, we propose a DSC-Attention 3D UNet network based on depthwise separable convolution to significantly reduce the time and memory consumption while maintaining the performance. Moreover, to obtain high quality depth map, we also propose a feature transfer module to upsample the LR depth map.

Pipeline description

  1. (1)

    Feature extraction (in Sect. “Informative feature extraction network”): we use an informative feature extraction network to extract the corresponding feature \( F_{i} \in {\mathbb {R}}^{C\times \frac{H}{4} \times \frac{W}{4}} \) for each image \( I_{i} \), where \( I_{0} \) and \( \left\{ I_{i} \right\} _{i=1}^{N} \) denote the reference image and source images, respectively.

  2. (2)

    Cost volume regularization (in Sects. “3D depthwise separable convolution (3D-DSC)”, and “3D-attention module (3DA)”): we propose a DSC-Attention 3D UNet to regularize the coarse cost volume \( V \in {\mathbb {R}}^{C\times D \times \frac{1}{8}H \times \frac{1}{8}W} \), which is constructed by the reference feature \( F_{0} \) and other source features \( \left\{ F_{i} \right\} _{i=1}^{N}\).

  3. (3)

    Depth map upsampling (in Sect. “Feature transfer module”): we propose a feature transfer module to upsample the LR depth map \( {\tilde{D}}_{s} \in {\mathbb {R}}^{1\times \frac{1}{8}H\times \frac{1}{8}W} \) to a HR depth map \( {\tilde{D}}_{d} \in {\mathbb {R}}^{1\times \frac{1}{4}H\times \frac{1}{4}W} \).

  4. (4)

    Depth map refinement (in Sect. “Depth map refinement”): a Gauss–Netwon Layer is utilized to obtain the refined depth map \( {\tilde{D}}_{r} \in {\mathbb {R}}^{1\times \frac{1}{4}H\times \frac{1}{4}W} \) by using input images \( \left\{ I_{i} \right\} _{i=0}^{N} \) and HR depth map \( {\tilde{D}}_{d} \). Finally, we fuse the refined depth maps to obtain point clouds as the result.

3D depthwise separable convolution (3D-DSC)

Inspired by the mechanism of 2D depthwise separable convolution, we try to decrease the computation of 3D cost volume regularization by proposing 3D-DSC to replace ordinary 3D CNNs. We may have different dividing strategies for the applied 3D convolution due to it is a 3D task. But the cost volume regularization is constructed by matching similarities between feature points at different spatial positions in different views at different depths. Thus, we divide 3D CNN into 3D depthwise convolution (depthwise is depth-dimension and can perform cost aggregation for cost volume information in depth dimension) and 3D pointwise convolution (pointwise is space-dimension and perform cost aggregation for cost volume information in spatial dimension), which is consistent with the form of cost volume. The schematic of 3D-DSC is shown in the lower left part of Fig. 1.

  1. (1)

    3D depthwise convolution The 3D depthwise convolution is performed over the cost volume in each channel independently to obtain the channel-independent intermediate feature maps, as defined in Eq.(1):

    $$\begin{aligned} Conv_{\text{ Depth }}(V)_{(i,j,u)} {=}\!\!\sum _{k, l, m}^{K, L, M} {W_{1}}_{(k, l, m)} \odot V_{(i{+}k, j{+}l, u{+}m)}\nonumber \\ \end{aligned}$$
    (1)

    where \(W_{1}\) represent the weight of 3D depthwise convolution, \(V \in {\mathbb {R}}^{C\times D \times H \times W}\) represent the cost volume, iju represent the position index, K, L, M denote the kernel size of convolution, and \(\odot \) denotes the element-wise product.

  2. (2)

    3D pointwise convolution The 3D pointwise convolution acts on these channel-independent feature maps to aggregate the channel-wise information, as defined in Eq.(2):

    $$\begin{aligned} Conv_{\text{ Point }}({\hat{V}})_{(i,j,u)} =\sum _{n}^{N} {W_{2}}_{(n)} \cdot {\hat{V}}_{(i,j,u,n)} \end{aligned}$$
    (2)

    where \(W_{2}\) represent the weight of 3D pointwise convolution, \({\hat{V}} \in {\mathbb {R}}^{C \times D \times H \times W}\) represent the intermediate feature maps, N denotes the kernel size of convolution.

The two convolutions are performed sequentially to form a complete convolution. And the mathematical formulations are defined as Eq. (3):

$$\begin{aligned} \begin{aligned} Conv_{\text{ SepConv }}\left( V\right) = Conv_{\text{ Point }} \left( Conv_{\text{ Depth }} \left( V\right) \right) \end{aligned} \end{aligned}$$
(3)
Fig. 2
figure 2

Illustration of different regularization schemes. We denote the receptive field of voxels in cyan during the regularization. Horizontal is the depth dimension and vertical is the channel dimension. H and W denote the height and width respectively. In this figure, we set H and W as one dimension

Here we compare our 3D-DSC regularization scheme with other mainstream regularization schemes theoretically, to demonstrate the effectiveness of our scheme. We display the four regularization schemes in Fig. 2: (a) spatial Regularization (SR) [46] is a cost aggregation method, it filters cost volume at different depths. However, due to the small receptive field, the regularization results of SR are highly affected; (b) 3D CNN Regularization (3D-CNN) [10] is a CNN-based method, it uses 3D CNNs to obtain a larger receptive field for cost volume regularization. But it causes much more computation cost; (c) recurrent Regularization [15] is an RNN-based method, it proposes sequential processing to divide the cost volume into depth-independent cost maps to reduce computation cost; (d) our 3D-DSC Regularization is a DSC-based method, we split the cost volume into intermediate feature maps, then apply a point-wise convolution to establish the relations between these intermediate feature maps to maintain the performance of the model. Our method can obtain a larger receptive field when compared to SR. While 3D CNN regularization can obtain better performance, it also incurs higher computational cost. On the other hand, our scheme can achieve similar performance with lower cost. Moreover, the recurrent regularization scheme and our regularization scheme are two different but similar ideas, both of us split cost volume into intermediate feature maps to reduce the computation cost. Therefore, we conclude that adopting the 3D-DSC as our regularization scheme is both feasible and effective.

Then we compare the efficiency of our 3D-DSC and 3D-CNN. Assuming the cost volume is \( V \in {\mathbb {R}}^{C\times D \times H \times W} \) and the goal cost volume is \( {\hat{V}} \in {\mathbb {R}}^{{\hat{C}} \times D \times H \times W} \), and the convolution kernel size is K, the computation cost of the ordinary 3D convolution and our proposed 3D depthwise separable convolution is shown in Table 1. We can see from the results that the computation cost of ordinary 3D convolution is \( (K^3 \times {\hat{C}}) / (K^3 + {\hat{C}}) \) times that of 3D depthwise separable convolution. For instance, when \(K=3\) and \({\hat{C}}=32\), the computation cost of our 3D-DSC convolution is around \(\frac{1}{14}\) of 3D-CNN. Thus, our regularization scheme 3D-DSC will be more efficient than 3D-CNN based models. In summary, we have analyzed the effectiveness and efficiency separately, which demonstrates the feasibility of our 3D-DSC as a regularization scheme.

Table 1 Comparison of the ordinary 3D convolution (3D-CNN), depthwise convolution (Depthwise-Conv), pointwise convolution (Pointwise-Conv) and 3D depthwise separable convolution (3D-DSC)
Fig. 3
figure 3

Illustration of the problem of similarity confidence at different depth and use 3DA to alleviate it. Red voxels represent the similarity confidence; For representation of cost volume, we have excluded the channel dimension; The light red indicates that the confidence is weakened

3D-attention module (3DA)

Although the cost volume information can be effectively aggregated after the 3D-DSC, there is still a feature mismatching problem affecting the cost volume quality. The feature mismatching problem happens when features from different key points are mistakenly matched, which will cause similarity confidence at different depths of the cost volume, and finally results in inaccurate depth estimation. Specifically, as shown in Fig. 3, a reference feature matches two similar source features at different depths (the two hands from the Buddha statue), and the confidences of different depths are similar in the cost volume. These similar confidences will affect the quality of the depth map regressed by Eq. 8.

Since attention mechanisms can highlight important information by calculating different weights, we here use an attention mechanism to address the feature mismatching problem. We propose a 3D-Attention module, which alleviates this problem by computing an attention weight using the information of the whole cost volume to enhance or weaken similar confidence in different depths. The schematic of the module is depicted in the lower right part of Fig. 1, and it consists of two blocks.

  1. (1)

    Channel attention block. A channel attention block performs attention for channel wise information. It is constructed by a multi-layer perceptron (MLP) which acts on the channel of cost volume \( V \in {\mathbb {R}}^{C\times D \times H \times W} \) to obtain the channel attention enhancement weights \( {\hat{W}} \). We multiply the channel weights \( W' \) with the cost volume V to obtain the channel-refined cost volume \( V' \in {\mathbb {R}}^{C\times D \times H \times W} \). The formula of channel attention block is defined as Eq. 4:

    $$\begin{aligned} {\hat{W}}=\sigma (M L P(MaxPool(V)) + M L P(AvgPool(V)))\nonumber \\ \end{aligned}$$
    (4)

    where MaxPool is max pooling, AvgPool is avg pooling. \( {\hat{W}} \in {\mathbb {R}}^{C} \) denotes the channel attention enhancement weights, and both of two parts share weights of MLP.

  2. (2)

    Spatial depth attention block. A spatial depth attention block is proposed to alleviate the problem of similarity confidence. Different from the ordinary attention, which uses full perception (without distinguishing between space and depth), the spatial depth attention block perceives cost information according to the composition of the cost volume in two different dimensions, e.g. space and depth, respectively. First, we use a spatial-oriented anisotropic [11] convolutions with kernel sizes of \( 1 \times 7 \times 7 \) (different positions at same depth) to filter cost volume along the spatial direction to reduce noise while maintaining useful matching information at the same depth. It provides more accurate spatial information for next depth-oriented convolution. Then a depth-oriented anisotropic convolution with kernel sizes of \( 7 \times 1 \times 1 \) (different depths at same position) acts on depth dimension, it effectively enhances or weakens matching information at different depths at the same spatial location (illustration shown in Fig. 3). Finally, we use an isotropic [11] convolution with kernel sizes of \( 7 \times 7 \times 7 \) acts on multi-dimension (space, depth) to fully aggregate information from above processes. The formula of spatial depth attention block is defined as Eq. (5):

    (5)

where \( \sigma \) is the activation function; \( {\tilde{W}} \in {\mathbb {R}}^{1\times D \times H \times W} \) is the spatial depth weight; \( f^{1 \times 7 \times 7} \) is the spatial oriented convolution, \( f^{7 \times 1 \times 1} \) is the depth oriented convolution and \( f^{7 \times 7 \times 7} \) is the overall convolution.

We form a 3D-Attention module by cascading these two blocks. As shown in Fig. 3, the confidence of right depth is enhanced by using our module. The formula of 3D-Attention module is defined as Eq. 6:

$$\begin{aligned} \begin{aligned}&V' = {V} \times {\hat{W}} \\&V'' = V' \times {\tilde{W}} \end{aligned} \end{aligned}$$
(6)

where \( V'' \in {R}^{C \times D \times H \times W} \) is the attention-weighted cost volume.

After regularization, we use a softmax operation (Eq. 7) in the depth direction to regress all the values between [0, 1] to form our probability volume P for depth estimation. Finally, we multiply different depth hypothesis plane values with the probability volume P to obtain the LR depth map \( {\tilde{D}}_{s} \). The formula is defined as Eq. 8:

$$\begin{aligned} P= & {} \text {softmax}(V'') \end{aligned}$$
(7)
$$\begin{aligned} {\tilde{D}}_{s}= & {} \sum _{d=d_{m i n}}^{d_{\max }} d \times P(d) \end{aligned}$$
(8)

Feature transfer module

The high-resolution (HR) depth map obtained by upsampling directly affects the quality of the point cloud results. To obtain a high resolution and precise depth map, we propose a Feature Transfer Module (FTM) for the low-resolution (LR) depth map upsampling. The third part of Fig. 1 shows the framework of our FTM module.

The inputs of FTM are a three-channel reference image \( I_{0} \in {\mathbb {R}}^{3 \times H \times W} \) and a single-channel LR depth map \( {\tilde{D}}_{s} \in {\mathbb {R}}^{1 \times \frac{1}{8}H \times \frac{1}{8}W}\). To unify the scale of inputs, we first use the bicubic interpolation algorithm [47] to upsample the LR depth map \( {\tilde{D}}_{s}\) to obtain a larger scale depth map \( {\tilde{D}}_{s}' \in {\mathbb {R}}^{1 \times \frac{1}{4}H \times \frac{1}{4}W} \). And we downsample the reference image into a 16-channel image \( I_{0}' \in {\mathbb {R}}^{16 \times \frac{1}{4}H \times \frac{1}{4}W} \) by a downsample layer. After unification, we propose a common offset and weight extraction backbone to obtain the offset \( \Delta p_{I_{0}'} \in {\mathbb {R}}^{k^{2} \times \frac{1}{16}H \times \frac{1}{16}W} \) and weight \( \Delta w_{I_{0}'} \in {\mathbb {R}}^{k \times \frac{1}{16}H \times \frac{1}{16}W} \) of reference image and the offset \( \Delta p_{{\tilde{D}}_{s}'} \in {\mathbb {R}}^{k^{2} \times \frac{1}{16}H \times \frac{1}{16}W} \) and weight \( \Delta w_{{\tilde{D}}_{s}'} \in {\mathbb {R}}^{k \times \frac{1}{16}H \times \frac{1}{16}W} \) of LR depth map, respectively. This backbone contains a seven convolutional feature extraction network, a offset convolution, a weight convolution, and a sigmoid layer. The equation of this backbone is defined as Eq. (9):

$$\begin{aligned} \begin{aligned}&\Delta q_{input} = f_{oc}(f_{FE}(input)), \,\, input \, \in \, [I_{0}', {\tilde{D}}_{s}'] \\&\Delta w_{input} = sigmoid(f_{wc}(f_{FE}(input))) \\ \end{aligned} \end{aligned}$$
(9)

where \( f_{FE} \) represents the extraction network, \(f_{oc}\) represents the offset convolution, \( f_{wc} \) represents weight convolution, and the sigmoid represent the sigmoid layer.

Then we use the OWC Block to compute the weight \( \Delta w \in {\mathbb {R}}^{\frac{k^{2}}{16} \times \frac{1}{4}H \times \frac{1}{4}W} \) and offset \( \Delta q \in {\mathbb {R}}^{\frac{k^{2}}{8} \times \frac{1}{4}H \times \frac{1}{4}W} \) for guiding depth map upsampling, where k is a hyperparameter and we set \(k=12\). In detail, we multiply the corresponding offsets \(\Delta p_{I_{0}'}, \Delta p_{{\tilde{D}}_{s}'}\) and weights \(\Delta w_{I_{0}'},\Delta w_{{\tilde{D}}_{s}'}\), and then pass the result through PixelShuffle to get the goal offset \(\Delta q\) and weight\(\Delta w\). Then we use the offset to guide feature sampling and multiply the sampled features with the weight to obtain the final result. Finally, we obtain the HR depth map by a residual addition block. The equation of above process is defined as Eq. (10):

$$\begin{aligned} \begin{aligned}&\Delta q = f_{ps}(\Delta p_{I_{0}'} \odot \Delta p_{{\tilde{D}}_{s}'}) \\&\Delta w = f_{ps}(\Delta w_{I_{0}'} \odot \Delta w_{{\tilde{D}}_{s}'}) \\&D_{res} = \Delta w \odot f_{gs}(\Delta q, {\tilde{D}}_{s}') \\&{\tilde{D}}_{d} = D_{res} + {\tilde{D}}_{s}' \end{aligned} \end{aligned}$$
(10)

where \( f_{ps} \) represents the PixelShuffle [48] operation of PyTorch, \( f_{gs} \) represents the grid_sample function of PyTorch, \( D_{res} \) represents the depth residual, \(\odot \) denotes the element-wise product.

Other modules

Informative feature extraction network

In the feature extraction process, many previous methods [10, 11, 13, 15, 49] only use sequential convolution operations to extract the feature map from input images \( \left\{ I_{i} \right\} _{i=0}^{N} \), which only contain the high level semantic information. And the loss of low level spatial information will affect the quality of reconstruction results. Thus, we propose an informative feature extraction network using the skip connection to propagate low level spatial information to aggregate the multi-level feature information. This network has three components (Encoder, Decoder, Adjuster), and the architecture details is provided in Table 2.

Table 2 Summary of the informative feature extraction network

Cost volume construction

Following the previous methods [12, 13, 15, 32, 50], to build the cost volume V, we use the same differentiable homography to warp all feature maps into different fronto-parallel planes of the reference camera to construct N feature volumes \(\{ V^{f}_{i} \}_{i=1}^{N}\). Then we adopt the same cost metric [15] to aggregate them into the cost volume V. The equation of cost metric is defined as Eq. (11):

$$\begin{aligned} V = \frac{\sum _{i=1}^{N}\left( V_{i}-\overline{V_{i}}\right) ^{2}}{N} \end{aligned}$$
(11)

\(\overline{V_{i}}\) is the average volume of all feature volumes.

Depth map refinement

The quality of the HR depth map \( {\tilde{D}}_{d} \) and obtain the refined depth map \( {\tilde{D}}_{r} \in {\mathbb {R}}^{\frac{1}{4}H\times \frac{1}{4}W} \) obtained in previous step is insufficient and needs to be refined. And the Gauss–Netwon Layer is an effective and efficient module for depth map refinement in Fast-MVSNet [13]. Therefore, we use a Gauss–Netwon Layer to refine the HR depth map \( {\tilde{D}}_{d} \) and obtain the refined depth map \( {\tilde{D}}_{r} \in {\mathbb {R}}^{1 \times \frac{1}{4}H\times \frac{1}{4}W} \) for MVS reconstruction.

Training loss

Following the previous methods [10, 32], we compute the average absolute value error between the predicted depth map and ground truth depth map as our training loss as Eq. (12):

$$\begin{aligned} \begin{aligned}&\text{ Loss } {=}\sum _{p \in {\textbf{p}}_{\text{ valid } }} \Vert {\tilde{D}}_{d}(p){-}{\hat{D}}(p)\Vert _{2} {+}\lambda \cdot \left\| {\tilde{D}}_{r}(p)-{\hat{D}}(p)\right\| _{2} \\\ \end{aligned}\nonumber \\ \end{aligned}$$
(12)

where \( {\tilde{D}}_{d} \) denotes the HR depth map, \( {\tilde{D}}_{r} \) denotes the refined depth map, \( {\hat{D}} \) denotes the Ground Truth Depth Map, \({\textbf{p}}_{\text{ valid } }\) denotes the valid point set of the Ground Truth Depth Map, \( \lambda \) is used to balance \( \text {loss}_{1}(p) \) and \( \text {loss}_{2}(p) \). In the training process, we usually set \( \lambda \) to 1.0.

Experiments

In this section, we first introduce the experimental settings in this paper, then quantitatively and qualitatively demonstrate the performance on the DTU dataset, and finally verify the generalization ability of the proposed work on the TnT dataset.

Experimental settings

Dataset

The DTU dataset [51] is a large-scale dataset that is captured with precise camera pose and lighting conditions using robot arm control in the laboratory. The dataset consists of the images, real point clouds, and their obtained camera parameters of 128 scenes with 7 different lighting conditions. Each scene has 49 or 64 images with a resolution of \( 1600 \times 1200 \) and corresponding internal and external camera parameters for training. The dataset provides calibrated images and real point clouds, and Yao et al. [10] divide it into training set, validation set and test set.

The Tanks & Temples (TnT) [9] is captured from real outdoor sensors, which is different from DTU [51]. These outdoor scenes contain a variety of different lighting conditions, reflection conditions, and other outdoor factors that make the TnT dataset more complex than obtaining a DTU dataset under specific conditions. The intermediate set used for evaluation contains eight different scenes, namely Family, Francis, Horse, Lighthouse, M60, Panther, Playground, and Train.

Table 3 Quantitative results of different methods on DTU’s evaluation set [51](lower is better)

Implement details

Training The proposed DSC-MVSNet is implemented using PyTorch and trained on the DTU training set. The ground truths for evaluation in DTU are represented as real point clouds. The depth maps for training our framework are obtained using the screened Poisson surface reconstruction algorithm (SPSR) [52]. In the training process, the input image resolution is set as \( 640 \times 512 \), and the number of training views is set as \( {\textbf{N}}=3 \). The selection of reference images and source images is the same as MVSNet [10]. The virtual hypothetical depth plane value is set as \( {\textbf{D}}=48 \text {and} {\textbf{D}}=96 \) for training, and the depth values are sampled within the range \( [425\,mm, 921\,mm] \). The learning rate is set using the RMSProp optimizer and the initial learning rate is set to 0.0008, and the decay weight is set as 0.002 every epoch. The batch size is set as 16 and trained on \( 6 \times \) NVIDIA GTX 2080ti GPU devices. Our best model is trained with two stages: (1) We use a virtual hypothetical depth plane of 48 for training, set 6 epochs for end-to-end training with the DSC-Attention 3D UNet and Feature Transfer Module, and use 12 epochs for overall training. (2) We retrain our network based on the best model obtained in the first stage with 10 epochs in the hypothetical depth plane of 96. The best model for the second stage is selected as our evaluation model.

Testing The model obtained in the training process is tested on DTU test dataset [51]. We use 5 adjacent images of \( 1280 \times 960 \) as the input. The hypothetical depth plane for testing is set as \( {\textbf{D}}=128 \). The evaluation of the DTU dataset [51] is performed by converting the output depth map into a predicted point cloud using the method according to Yao [10], and then comparing it with the ground truth point cloud by official Matlab code.

Fig. 4
figure 4

Visualization of the reconstructed point cloud models for scan77 in DTU dataset by different methods. The results are directly cited from the paper P-MVSNet [11]. Three important parts: cover (yellow), handle (red) and base (green) are highlighted. Although the reference image sequences contain many reflective regions which is hard for 3D model reconstruction, our DSC-MVSNet reconstructs a more complete and more accuracy point clouds compared to the most of exist methods

Evaluation metrics

To obtain comprehensive conclusions, we use three metrics for evaluating performance and three metrics for evaluating efficiency of our model. The performance evaluation metrics (Acc, Comp, and Overall) are all mentioned in DTU [51]. Acc is measured as the distance from the MVS reconstruction to the structured light reference, encapsulating the quality of the reconstructed MVS points. A lower Acc value indicates more accurate positioning of the points in the point clouds. Completeness is measured as the distance from the reference to the MVS reconstruction, encapsulating how much of the surface is captured by the MVS reconstruction. A lower Comp value means that we reconstruct more point cloud surfaces. Acc and Comp are calculated using the official Matlab code provided by DTU [51]. Overall is calculated as the average of Acc and Comp to evaluate overall reconstruction quality. The metrics used to evaluate efficiency are Parameters, Memory, and Time, which are widely adopted in previous methods [13, 53, 54].

Evaluation on DTU dataset

Comparison of the models performance

We compare our DSC-MVSNet with two groups of state-of-the-art methods: traditional MVS methods e.g. Camp [55], Furu [8], Toal [25], Gipuma [23]; and deep learning-based MVS methods e.g. MVSNet [10], R-MVSNet [15], Fast-MVSNet [13], CVP-MVSNet [53], UCS-Net [14], DeepFusion [31], PatchmatchNet [59]. Table 3 shows the results of the DTU [51] dataset. We have the following observations: our method establishes the state-of-the-art overall performance by comparing two groups of methods. For instance, DSC-MVSNet achieves significant improvement in Overall performance: 50.5% (Camp), 55.7% (Furu), 55.1% (Tola), 40.4% (Gipuma), 25.5% (MVSNet), 17.5% (R-MVSNet), 7.0% (Fast-MVSNet), 2.1% (CVP-MVSNet), 19.8% (DeepFusion) and 2.3% (PatchmatchNet). It indicates that our model can reconstruct a sufficient number of surfaces and the spatial locations of the points on these surfaces are accurate enough. In the generalized Acc metric that is more challenging, our method achieves notably gains over state-of-the-art methods: we achieve 0.316 on Acc. Although the Gipuma [23] method has the highest Acc, its Comp is much higher than our proposed method (0.873 vs 0.372). And compared to deep learning-based methods, our method achieves comparable results to CVP-MVSNet [53] (0.316 vs 0.296). This shows that our network is accurate in estimating the position of each point obtained from the reconstruction. Our DSC-MVSNet is comparable to or better than SOTA methods in terms of Comp. However, the PatchmatchNet [59] has the lowest Comp, its Acc and Overall are higher than our proposed method (0.427 vs 0.316; 0.352 vs 0.344). It indicates that our method can reconstruct more of the target surfaces to meet the low Comp. Thus, these results demonstrate that our proposed method has a better or comparable performance compared to the majority of state-of-the-art methods.

Figure 4 shows the qualitative comparison results (Scan 77 in DTU [51]) between DSC-MVSNet and most of state-of-the-arts methods (Tola [25], Gipuma [23], Furu [8], Camp [55], MVSNet [10], R-MVSNet [15], P-MVSNet [11]). The colored boxes (red, yellow, green) shown in the figure, our method DSC-MVSNet reconstructs a more complete point cloud, which corresponds to the Comp value in Table 3. We think the improvement of completeness benefits from the introduction of the 3DA, which can alleviate the feature mismatch problem to improve the quality of depth map.

We further compare our DSC-MVSNet with R-MVSNet [15] on some scenes (Scan 1, Scan 75, Scan 110, Scan 114) of DTU [51]. Because R-MVSNet can handle large-scale scenarios for 3D model reconstruction [54]. Figure 5 shows the visualization of various reconstructed point cloud models of DTU dataset. The comparisons reveal that our DSC-MVSNet reduces a considerable number of outliers compared to R-MVSNet. That shows our DSC-MVSNet estimates the position of each point to be reconstructed accurately, and the conclusion corresponds to the ACC value in Table 3. Furthermore, it is worth mentioning that our network occupies less memory and runs faster than R-MVSNet. We think the above improvements benefit from the introduction of the 3D-DSC.

Fig. 5
figure 5

Visualization of several scenes on DTU dataset between R-MVSNet [15] (left) and our DSC-MVSNet (right). The point cloud results clearly show that our method DSC-MVSNet achieve better reconstruction results even with much lower parameters

Table 4 Comparison on the parameters, memory and time consumption on the evaluation DTU [51] dataset
Table 5 VRAM and time consumption of the inference on DTU [51] dataset

Comparison of the models efficiency

We compared the efficiency of different methods by reporting their model parameters, memory consumption, and runtime (some results are obtained from official reports). Table 3 and Table 4 show that our framework has lower model parameters, memory consumption, and runtime than most state-of-the-art deep learning methods, with very competitive performance. Although our method runs with slower runtime, it uses smaller memory consumption and parameters (5.5 GB, 253,585). We also compared our network with various state-of-the-art methods, such as Fast-MVSNet [13], Cascade-MVSNet [12], PVA-MVSNet [53], UCS-Net [14], and \(D^{2}\)HC-RMVSNet [16]. Table 4 shows that DSC-MVSNet achieves lower or comparable efficiency results compared to SOTA methods. Memory consumption directly affects the environment setting for model training. In terms of memory consumption, Fast-MVSNet and Cascade-MVSNet achieve the lowest memory among SOTA methods. Our method also has similar memory consumption to the above methods (5.3 GB vs 5.5 GB), and reduces parameters by 72% over Cascade-MVSNet and 44% over Fast-MVSNet. Although PVA-MVSNet and \(D^{2}\)HC-RMVSNet are similar to DSC-MVSNet in terms of model parameters, we reduce memory consumption by 68% over PVA-MVSNet [53] and achieve faster runtime than \(D^{2}\)HC-RMVSNet [16] (5.5 GB vs 17.3 GB; 0.74 s vs 29.15 s). Similarly, UCS-Net [14] is comparable to our method in terms of memory and time, but we reduce parameters by 73% compared to UCS-Net [14] on the DTU [51] dataset. In conclusion, our proposed method has better or comparable efficiency than most state-of-the-art methods.

Then we discuss the memory and the time consumption of the inference phase. The size of the inputs is \( H \times W = 1600 \times 1152 \), and the hypothetical depth plane is set as \( {\textbf{D}}=96 \). Table 5 shows the results of the inference on the DTU [51] dataset w.r.t. the number of sources. It demonstrates that the memory occupied by inference and the inference time is linearly increasing with the number of sources.

Table 6 Ablation study on the DTU evaluation dataset [51], which demonstrates the effectiveness of different modules of our method, where model parameters, memory, and time are recorded during training

Ablation experiments

The ablation experiments are also conducted on the DTU dataset to illustrate our method’s efficiency and effectiveness. The network only with the 3D UNet-shape network for cost volume regularization is taken as a baseline for ablation experiments. The results are shown in Table 6.

Effectiveness of DSC: Our novelty contribution is to explore the feasibility of 3D depth separable convolution as a cost volume regularization scheme in the MVS domain. As shown in Table 6, compared to Row 2 (Baseline + 3D CNNs) and Row 3 (Baseline + DSC), we can observe that replacing 3D CNN with 3D DSC in 3D UNet, which not cause a sharp decline in model performance, e.g. Acc from 0.391 to 0.398. Meanwhile, our model can greatly reduce the number of parameters, memory consumption and time. Therefore, it is feasible to use 3D DSC in the MVS domain. Based on the above phenomenon, we think that the regularization scheme we designed for cost volume plays a key role in the model. We divide 3D DSC into 3D pointwise convolution and 3D depthwise convolution, which perceives multi-dimensional cost information and aggregates in depth dimension and spatial dimension. This mechanism is similar to 3D CNN-based mechanism (as shown in Fig. 2b and d), so our model can still maintain an impressive performance, which proves the feasibility of using 3D DSC in the MVS domain.

Effectiveness of DSC 3D UNet: As shown in Table 6, compared to the baseline (+3D CNNs), the baseline using the DSC 3D UNet can effectively reduce the model parameters, memory consumption, training time, and the Acc, Comp, and Overall can also be maintained to some extent. It means a significant reduction in parameters without much accuracy loss can be achieved using the 3D depthwise separable convolution.

Effectiveness of 3D-Attention module: As shown in Table 6, the Acc, Comp, and Overall metrics can all be improved with only a slight increase in computation and memory consumption by adding the 3D-Attention module to the baseline + DSC. This means that adding the attention layer is effective and it helps to improve the information extraction of our proposed separable convolution.

As the problem of similarity confidence mentioned in Sect. “3D-Attention module (3DA)”, we discuss the effectiveness of the 3DA module in solving the above problems. We illustrate separately the confidence line charts for different depths at a spatial location with 3DA (red line chart) and without 3DA (blue line chart) in Fig. 7. We can see from the charts that the confidence of the GT depth in the blue dash is very similar to the confidence of the error depth, which can lead to incorrect depth estimates when the predicted depth value (the blue dashed line) is calculated via Eq. (8), to obtain depth values that are far from the GT depth. After adding the 3DA module, we can see from the red line chart that the confidence of the GT depth has been enhanced and the confidence of the error depth has been weakened, so that we obtain a value similar to the GT depth value when calculating the predicted depth value (the red dashed line). This is also reflected in the higher Accuracy of ablation experiments with baseline + DSC + 3DA in Table 6.

Fig. 6
figure 6

Ablation reconstruction results of scan118 of the DTU dataset [51]. Two important parts: top (red) and bottom (red) are highlighted. The point cloud results show the effectiveness of each modules

Effectiveness of Informative Feature Extraction Network: As shown in Table 6, our baseline + DSC combines Informative Feature Extraction Network can achieve better performance with a small increase in the number of model parameters, memory, and time.

Effectiveness of Feature Transfer Module: We use a Feature Transfer Module in the baseline + DSC to upsample the LR depth map. Table 6 shows that the FTM can further improve the performance of our network with a small increase in model parameters, memory, and time.

The ablation reconstruction results of scan 118 of the DTU [51] when adding different modules of our method are shown in Fig. 6. As the areas identified by rectangles in Fig. 6, our baseline has higher completeness and richer detail information by combining different modules.

Table 7 Generalization results on the Tanks & Temples benchmark [9]

Generalization on TnT dataset

The Tanks & Temples (TnT) dataset [9] is widely used in previous methods [10, 12, 13, 15, 31, 32] as a benchmark. Therefore, to evaluate the generalization of our DSC-MVSNet, we perform a test on TnT and evaluate the results by uploading the point cloud to the official website. We use the best model of training on DTU without fine-tuning to evaluate the TnT dataset [9], and we set 5 adjacent images with a resolution \( 1920 \times 1080 \) as the input. Meanwhile, the depth hypothesis plane is set as \( {\textbf{D}}=128\).

Fig. 7
figure 7

We illustrate the similar confidences of an example of scan 77. On the top, we show an RGB reference image, and an RGB source image. The red point of right image is the matching point, and the green point is the mismatching point. On the bottom, we show the corresponding confidence line charts for the two examples with 3DA (red line chart) and without 3DA (blue line chart). The red dashed line represents the predicted depth value of red line chart, and the blue dashed line is the predicted depth value of blue line chart

Fig. 8
figure 8

Error Visualization of Francis, Horse and Playground in the Tanks and Temples intermediate dataset [9], compared with PatchmatchNet [59]

As shown in Table 7, our model exhibits comparable results with lower consumption. Compare to traditional multi-view stereo methods (Colmap, Pix4D, OpenMVG+OpenMVS), our DSC-MVSNet obtains better reconstruction scores on all scenes. Besides, our DSC-MVSNet outperforms all listed learning-based MVS methods with a 53.48 mean F-score on Tanks and Temples intermediate [9]. And we achieve a comparable generalization performance with the state-of-the-art methods e.g. DSC-MVSNet achieves the highest accuracy on several scenes, i.e., Family, Lighthouse, M60, Panther, and Train. Figure 8 shows the error visualization calculated according to the corresponding ground truth point clouds. Our DSC-MVSNet significantly improves the precision of reconstructions compared to the recent work PatchmatchNet [59]. For example, as shown in the red boxes in Fig. 8, PatchmatchNet has more incorrect points and noise. Our method is able to obtain more accurate point positions while reducing noise, which is benefited from our proposed 3DA and FTM methods.

Limitation analysis

Although our model exhibits better or comparable performance than most of the state-of-the-art methods on the two benchmarks [9, 51], we still have some limitations. (1) For complex environmental factors (i.e. lighting conditions, reflection conditions, etc) that have never been obtained before, there are still some limitations in the accuracy of the reconstruction. Therefore, we consider improving the generalization ability of the model in future works. (2) As we use several images as input, our model is still higher than the best method in memory consumption as shown in Table 4. This motivates us to explore high-quality reconstruction with limited input images.

Conclusion

Our proposed DSC-MVSNet is a novel coarse-to-fine and end-to-end framework for efficient and accurate depth estimation in MVS. Firstly, we use depthwise separable convolution to construct our attention-aware 3D UNet-shaped network for cost volume regularization with lower parameters and memory cost. Additionally, we introduce a 3D-Attention module to focus on more critical information and alleviate the feature-mismatching problem. Furthermore, we propose an efficient and effective Feature Transfer Module to upsample the LR depth map. The experimental results verify the effectiveness and efficiency of our method.