1 Introduction

In computer vision, depth estimation from two-dimensional (2D) images is an important task and used for many applications such as VR, AR, or autonomous driving. Defocus blur is a useful cue for such depth estimation because the size of the blur depends on scene depth. Depth from focus/defocus (DfFD) takes defocus images as input for depth estimation. Typical inputs for DfFD are stacked images, i.e., focal stack, each of which is captured with a different focus distance. In this paper, we use the term, depth from focal stack for specifying a method taking a focal stack as input.

DfFD methods are roughly divided into two categories, model-based and learning-based. Model-based methods use a thin-lens model for modeling defocus blurs (Suwajanakorn et al., 2015; Kim et al., 2016; Tang et al., 2017) or define focus measures (Pertuz et al., 2013; Surh et al., 2017) to estimate scene depth. One of the drawbacks of such methods is difficulty in estimating scene depth with texture-less surfaces. Learning-based methods have been proposed to tackle the above drawback (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). For example, Hazirbas et al. (2018) proposed a convolutional neural network (CNN) taking a focal stack as input without any explicit defocus models. This is an end-to-end method that allows efficient depth estimation. It also enables the depth estimation of texture-less surfaces with learned semantic cues.

General learning-based methods often have limited generalization due to a domain gap between training and test data. Learning-based DfFD methods suffer from the difference of capture settings of a camera at training and test times (Ceruso et al., 2021). The amount of a defocus blur depends on not only scene depth but also camera settings such as focus distance, focal length, and f-number. Different depths and camera settings can generate defocus images with the same appearance; thus this difference cannot be compensated with often used domain adaptation method such as neural style transfer (Li et al., 2017; Zhu et al., 2017). If camera settings are different at training and test times, the estimated depth has some ambiguity, which is similar to the scale-ambiguity in monocular depth estimation (Hu et al., 2019). Current learning-based DfFD methods (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021) do not take into account the latent defocus model, thus the estimated depth is not correct if the camera settings at test time differ from those at training time, as shown in Fig. 1c. On the other hand, this problem does not matter for model-based methods with explicit defocus models under given camera settings.

We propose learning-based DfFD with a lens defocus model. Our method also takes a focal stack as input, i.e., deep depth from focal stack (DDFS). Our method is inspired by recent learning-based multi-view stereo (MVS) (Wang & Shen, 2018), where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). The proposed method also constructs a cost volume, which is passed through a CNN to estimate scene depth. Each defocus image in a focal stack is deblurred at each sampled depth in the plane sweep volume, then the consistency is evaluated between deblurred images. We found that scene depth is effectively learned from the cost volume in DDFS. Our method has several advantages over the other learning-based methods directly taking a focal stack as input without an explicit defocus model (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). First, output depth satisfies the defocus model because the cost volume imposes an explicit constraint among the scene depth, defocus images, and camera settings. Second, the camera settings, such as focus distances and f-number are absorbed into the cost volume as intermediate representation. This enables depth estimation with different camera settings at training and test times, as shown in Fig. 1d.

Fig. 1
figure 1

a One of input images in focal stack, b output depth of Suwajanakorn et al. (2015), c output depth of DefocusNet (Maximov et al., 2020), and d our result. Our model and DefocusNet were trained on dataset with camera settings that differed from those of test data. Our method with camera-setting invariance can estimate correct depth map

The primary contributions of this paper are summarized as follows:

  • We combine a learning framework and model-based DfFD through a plane sweep volume for camera-setting invariance.

  • Our method with camera-setting invariance can be applied to datasets with different camera settings at training and test times, which improves the applicability of DDFS.

2 Related Work

Depth from focus/defocus Depth from focus/defocus (DfFD) estimates scene depth from focus or defocus cues in captured images and is a major task in computer vision. In general, depth from focus takes many images captured with different focus distances and determines scene depth from an image with the best focus. On the other hand, depth from defocus aims to estimate scene depth from a small number of images, which do not necessarily need to include focused images (Xiong & Shafer, 1993). Recently, learning-based depth estimation from a focal stack implicitly uses both focus and defocus cues; thus, we use unified terminology, i.e., DDFS.

Fig. 2
figure 2

Overview of our method. Our method takes focal stack and camera settings as input then constructs cost volume as intermediate representation, which absorbs differences in camera settings. CNN takes this cost volume together with additional image as input then estimates refined cost volume in coarse-to-fine manner. Depth maps are computed by applying soft argmin operator at each resolution. Each upsample block has adaptive cost aggregation module

Traditional DfFD methods propose focus measures to evaluate the amount of a defocus blur (Zhuo & Sim, 2011; Pertuz et al., 2013; Moeller et al., 2015; Surh et al., 2017). If we have a focal stack as input, we can simply refer to the image with noticeable edges and its focus distance. Other methods formulate the amount of defocus blur with a lens defocus model and solve an optimization problem to obtain a depth map together with an all-in-focus image (Suwajanakorn et al., 2015; Kim et al., 2016). We refer to these methods as model-based methods. One of the drawbacks of such methods is difficulty in estimating scene depth with texture-less surfaces. Learning-based methods have been proposed to tackle these issues (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). These methods enable depth estimation at texture-less surfaces and the depth estimation is achieved efficiently in an end-to-end manner. Other learning-based methods leveraged defocus cues as additional information (Anwar et al., 2017; Carvalho et al., 2018) or supervision (Srinivasan et al., 2018; Gur & Wolf, 2019) for monocular depth estimation.

However, current DDFS, which directly take a focal stack as input, do not take into account the latent defocus model (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). For example, Hazirbas et al. (2018) proposed a CNN that directly takes a focal stack as input. Maximov et al. (2020) and Wang et al. (2021) simply used focus distances as intermediate inputs of neural networks. Yang et al. (2022) proposed a CNN that outputs a focus probability volume, which is multiplied by focus distances for depth estimation. These methods require the same camera settings at training and test times to obtain a correct depth map due to the lack of explicit defocus models. This characteristic reduces the applicability of DDFS. On the other hand, our method is a combination of model-based and learning-based methods through a cost volume, which is computed with a lens defocus model, allowing depth estimation with camera-setting invariance.

Learning from cost volume Learning from a cost volume is efficient in many applications. A cost volume is constructed by sampling solution space and evaluating costs at each sampled point. Examples of learning-based methods with a cost volume are optical flow (Ilg et al., 2017; Sun et al., 2018) and disparity (Mayer et al., 2016; Kendall et al., 2017) estimation. Learning-based MVS methods (Wang & Shen, 2018; Yao et al., 2018; Long et al., 2021; Duzceker et al., 2021) are also major examples, where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). Our method also constructs a plane sweep volume and evaluates consistency between defocus images in an input focal stack. We found that learning from a cost volume is also efficient for DDFS.

3 Deep Depth from Focal Stack

Our method combines a learning framework and model-based DfFD through a cost volume for depth estimation with camera-setting invariance. We first give an overview of the proposed method then describe the lens defocus model and ambiguity of estimated depth in DfFD, followed by details of cost volume construction. This cost volume as intermediate representation enables depth estimation with different camera settings at training and test times. The network architecture and loss function are also discussed at the end of this section.

3.1 Overview

Figure 2 shows an overview of the proposed method. Our method is inspired by recent learning-based MVS (Wang & Shen, 2018), where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). Our cost volume is constructed from an input focal stack by evaluating deblurred images at each depth hypothesis. This intermediate representation absorbs the difference in camera settings. The computed cost volume and an additional defocus image are passed through a CNN with an encoder-decoder architecture. At the decoder part, the cost volume is gradually upsampled for coarse-to-fine estimation. Output depth maps are obtained by applying a differentiable soft argmin operator (Kendall et al., 2017) to intermediate refined cost volumes. Each upsample block includes a cost aggregation module for learning local structures adaptively.

Fig. 3
figure 3

Circle of confusion (CoC) that corresponds to size of defocus blur

3.2 Lens Defocus Model

Our cost volume construction is based on a lens defocus model, with which the size of a defocus blur is formulated as a circle of confusion (CoC) (Zhuo & Sim, 2011), as shown in Fig. 3. Let d and \(d_f\) be the scene depth and focus distance of a camera, respectively. CoC can be computed as

$$\begin{aligned} c = b\frac{ \Vert d - d_f\Vert }{d} \frac{f^2}{N(d_f - f)}, \end{aligned}$$
(1)

where f is the focal length of the lens and N is the f-number. b [px/m] converts the unit of the CoC from [m] to [px]. When d is equal to \(d_f\), the light rays from the scene point converge on the image plane; otherwise, defocus blur results as the size of the diameter of the CoC. The blurred image can be computed as a convolution of an all-in-focus image with the point spread function (PSF), the kernel size of which corresponds to the size of the CoC.

The CoC can be computed from the scene depth d and the camera settings f, \(d_f\), N, and b in Eq. (1). Note that these parameters can easily be extracted from EXIF properties (Maximov et al., 2020) or calibrated beforehand (Tang et al., 2017), and the state-of-the-art methods assume these parameters are known (Maximov et al., 2020; Wang et al., 2021); thus, this paper also follows the same assumption. Our method realizes depth estimation with camera-setting invariance about these parameters, and this improves the applicability of DDFS because our method with camera-setting invariance can be applied to datasets with different camera settings at training and test times.

In DfFD, the number of the unknown parameters is two, i.e, an all-in-focus image and a depth map; thus, if we have two images captured with different camera settings and these parameters are known, we can solve the problem by using defocus cues. However, if there exists unknown camera parameters, the estimated depth has ambiguity. Now, we discuss two ambiguities in DfFD due to the camera settings. The first one is scale-ambiguity. From Eq. (1), the following relationship holds:

$$\begin{aligned} c&= b\frac{ \Vert d - d_f\Vert }{d} \frac{f^2}{N(d_f - f)}&\nonumber \\&= b^{-*}\frac{ \Vert d^* - d^*_f\Vert }{d^*} \frac{f^{*2}}{N(d^*_f - f^*)},&\end{aligned}$$
(2)

where \((\cdot )^* = (\cdot )\sigma , (\cdot )^{-*} = (\cdot )/\sigma , \, \forall \sigma \in {\mathbb {R}}\). This means scaled camera settings and depth give the same CoC as that of the original ones.

The other ambiguity is affine-ambiguity. From Eq. (1), we can obtain

$$\begin{aligned} c&= b\frac{f^2}{N(d_f-f)} \left\| 1 - \frac{d_f}{d} \right\|&\nonumber \\&= A(f, d_f, N) + \frac{B(f, d_f, N)}{d},&\end{aligned}$$
(3)

where \(A(f, d_f, N)\) and \(B(f, d_f, N)\) are constants. Thus, different camera settings and inverse depths can give the same CoC as follows:

$$\begin{aligned} c&= A(f, d_f, N) + \frac{B(f, d_f, N)}{d}&\nonumber \\&= A(f', d'_f, N') + \frac{B(f', d'_f, N')}{d'}.&\end{aligned}$$
(4)

This means the estimated inverse depth has affine-ambiguity (Similar discussion can be found in the previous study (Garg et al., 2019)). In the experiments, we evaluate the proposed method with respect to the scale-ambiguity in the depth space and the affine-ambiguity in the inverse depth space.

3.3 Cost Volume

The proposed method computes a cost volume from the focal stack for the input of a CNN to impose a constraint between the defocus images and scene depth. This has several advantages over current learning-based methods that directly takes a focal stack as input (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). First, output depth satisfies the lens defocus model because the cost volume imposes an explicit constraint between the defocus images and scene depth. Second, the camera settings are absorbed into the cost volume. This enables inference with camera settings that differ from those at training, and even in this case, the output depth satisfies the lens defocus model without any ambiguities.

Figure 4 shows a diagram of our cost volume construction. We first sample the 3D space in the camera coordinate system by sweeping a fronto-parallel plane. To evaluate each depth hypothesis, we deblur each image in the input focal stack. Let the cost volume be \(C:\{1,\cdots ,W\}\times \{1,\cdots ,H\}\times \{ 1,\cdots ,D\} \rightarrow {\mathbb {R}}\), and the focal stack be \(\{ I_{d_i}\}_{i=1}^{F}\), where \(I_{d_i}\) is a captured image with focus distance \(d_i\). Each element of the cost volume C is computed as follows:

$$\begin{aligned} C(u,v,d)&= \sum _{ch \in \{r,g,b\}}\rho \Bigl ({\tilde{I}}_{d_1}^{ch}(u,v),\cdots ,{\tilde{I}}_{d_F}^{ch}(u,v) \Bigr ),&\end{aligned}$$
(5)
$$\begin{aligned} {\tilde{I}}_{d_i}^{ch}&= k(d,d_i) *^{-1} I_{d_i}^{ch},&\end{aligned}$$
(6)

where \(k(d,d_i)\) is a blur kernel, the size of which is defined by Eq. (1) with the scene depth d and focus distance \(d_i\). We used a disk-shaped PSF (Watanabe & Nayar, 1998; Shi et al., 2015), while any types of PSFs can be used at training and test time. The operator \(*^{-1}\) indicates a deblurring process applied to each color channel of the input image. We used Wiener-Hunt deconvolution (Orieux et al., 2010) as this process. The function \(\rho \) evaluates the consistency between deblurred images. We adopt a standard deviation for \(\rho \), which allows an arbitrary number of inputs.

Fig. 4
figure 4

Cost volume construction. We first sweep fronto-parallel plane in camera coordinate system. At each swept plane, input image in focal stack is deblurred with Wiener-Hunt deconvolution (Orieux et al., 2010) on each color channel. Standard deviation is applied for computing cost, which is followed by outlier removal and normalization

The process mentioned above is the essential part of our cost volume construction. However, differing from a learning-based MVS method (Wang & Shen, 2018), which is based on differentiable image warping, our cost volume construction requires careful design because the difference between images due to focus distances is smaller than that due to camera positions in the MVS setup. Thus, for robustness and learning stability, the standard deviation in Eq. (5) is computed considering neighboring pixels as follows:

$$\begin{aligned}&\rho \Bigl ({\tilde{I}}_{d_1}^{ch}(u,v),\cdots ,{\tilde{I}}_{d_F}^{ch}(u,v) \Bigr )&\nonumber \\&= \sqrt{\frac{1}{F} \sum _{i=1}^F \sum _{(u',v') \in {\mathcal {N}}(u,v)} \gamma _{u',v'} ({\tilde{I}}_{d_i}^{ch}(u',v') - \mu (u',v'))^2 },&\end{aligned}$$
(7)
$$\begin{aligned}&\mu (u,v) = \frac{1}{F} \sum _{i=1}^F \sum _{(u',v') \in {\mathcal {N}}(u,v)} \gamma _{u',v'} {\tilde{I}}_{d_i}^{ch}(u',v'),&\end{aligned}$$
(8)

where \({\mathcal {N}}(u,v)\) is a set of neighboring pixels centered at (uv) and \(\gamma _{u',v'}\) is a 2D spatial Gaussian weight. Figure 5 shows an example of the estimated depth only from the index of the minimum cost. In (b)-(d), we computed costs with the different standard deviations (SD) of the 2D spatial Gaussian weights. The neighboring information can reduce noise, especially for the real captured data, while a large standard deviation causes oversmoothing. In our experiments, we set the standard deviation as 1.0.

Fig. 5
figure 5

Estimated depth from the index of the minimum cost a without neighboring pixels and bd with neighboring pixels under different standard deviations (SD)

We also remove outliers by applying a nonlinear function \(f(\cdot )\) that bounds the cost by 1 after computing Eq. (5). We use a \(\tanh \)-like function as follows:

$$\begin{aligned} f(x)&= \frac{e^{ax}-e^{-ax}}{e^{ax}+e^{-ax}},&\end{aligned}$$
(9)
$$\begin{aligned} a&= \frac{1}{2C_{max}}\log \frac{1+f_1}{1-f_1},&\end{aligned}$$
(10)

where \(C_{max}\) is the upper bound of the cost. f(x) is converged to \(f_1\) as x approaches \(C_{max}\). In our experiment, \(C_{max} \ge 0.3\) with \(f_1=0.999\) gave the effective results; thus we set \(C_{max} = 0.3\) and \(f_1=0.999\). Finally, the cost f(C(uvd)) at each pixel is normalized in [0, 1]. As shown in Fig. 6, the outlier removal (OR) and normalization (Norm.) produces a sharp peak at the ground-truth depth. However, this normalization includes the possibility that such sharp peaks also appear at texture-less pixels where defocus cues are not effective, thus have negative effects on training. Nevertheless, we found that our network automatically learns effective regions and dramatically improves the accuracy of the estimated depth. We describe the ablation study on this in Sect. 4.4.

Fig. 6
figure 6

Cost plots b without and c with outlier removal and normalization at green dot in (a). Red lines indicate positions of ground-truth depth indices

Now, we discuss the difference between the designs of the cost volume of our method and previous cost-volume-based DfFD methods. The major design of the cost volume construction is based on focus measure (Pertuz et al., 2013; Surh et al., 2017; Yang et al., 2022), where costs are computed at focus positions in an input focal stack. Wang et al. (2021) introduced the framework of the cost volume at the output layer of the CNN. On the other hand, our cost volume, which is designed for the input of the network, is based on image deblurring to enables us to compute costs at depths that are not contained in input focus distances. A cost volume construction similar to that of our method was also proposed in model-based methods (Suwajanakorn et al., 2015; Kim et al., 2016). However, these methods require an all-in-focus image, which leads to iterative optimization for the scene depth and all-in-focus image; thus these methods cannot be directly incorporated into sequential learning frameworks.

3.4 Architecture and Loss Function

As shown in Fig. 2, the cost volume and an additional defocus image, which helps the network to learn semantic cues (Wang & Shen, 2018), are concatenated and passed through the network. The input image is selected from the focal stack and we found that the selection of the input image does not affect the performance of the proposed method. During the training of our model, we selected the image with the farthest focus distance.

The cost volume and input image are passed through the encoder, the architecture of which is the same as for MVDepthNet (Wang & Shen, 2018). The outputs of the decoder are refined cost volumes \(C_{out}^s\) at different resolutions \(s\in \{ 1/8, 1/4, 1/2, 1\}\).

At each upsample block, we implement an adaptive cost aggregation module inspired by Wang et al. (2021) to aggregate neighboring information, and this enables depth estimation with clear boundaries by aggregating focus cues on edge pixels. The cost aggregation module is given as

$$\begin{aligned}&{\tilde{C}}_{out}^s(u,v,d_k) \nonumber \\&\quad =\sum _{(u_j,v_j)\in {\mathcal {N}}(u,v)} w_j C_{out}^s(u_j + \Delta u_j, v_j + \Delta v_j, d_k),&\end{aligned}$$
(11)

where the weight \(w_j\) and offset \((\Delta u_j, \Delta v_j)\) are parameters to aggregate neighboring information, which are estimated from intermediate features in the decoder. As shown in Fig. 2, our upsample block first upsamples the input cost volume by the scale factor of 2. The feature map from the encoder is then concatenated to this upsampled cost volume. From this volume, offsets and weights for adaptive cost aggregation are estimated together with a refined cost volume. The final cost volume is obtained by aggregating the neighboring costs following Eq. (11). Figure 7 shows an example of the estimated offsets and output depth with the cost aggregation module, which yields clear boundaries in the estimated depth.

The refined cost volume at each resolution is obtained through softmax layers. Thus, the output depth at each resolution can be computed by applying a differentiable soft argmin operator (Kendall et al., 2017) as follows:

$$\begin{aligned} d_s(u,v) = \sum _{k=1}^D {\tilde{C}}_{out}^s(u,v,d_k) d_k. \end{aligned}$$
(12)
Fig. 7
figure 7

Example of estimated offsets for cost aggregation in blue boxed region in (a). b At beginning of training, cost is aggregated from nearby grid points. c After training, cost is adaptively aggregated by considering local structures. e This yields clear boundaries in estimated depth

Training loss The training loss is defined as the sum of L1 loss between the estimated depth maps \(d_s\) and ground-truth depth maps \(d_s^*\) at different resolutions as follows:

$$\begin{aligned} {\mathcal {L}} = \frac{1}{4}\sum _s \frac{1}{H_sW_s}\Vert d_s - d_s^* \Vert _1. \end{aligned}$$
(13)
Table 1 Camera settings of datasets

4 Experiments

We evaluated the proposed method for its camera-setting invariance and comparison it with the state-of-the-art DDFS. Our method can be applied to datasets with camera settings that differ from those of a training dataset.

4.1 Implementation

Our network was implemented in PyTorch. The training was done on a NVIDIA RTX 3090 GPU with 24-GB memory. The size of a minibatch was 8 for the training of our model. We trained our network from scratch, and the optimizer was Adam (Kingma & Ba, 2015) with a learning rate of \(1.0\times 10^{-4}\).

During the cost volume construction, we uniformly sampled the depth between 0.1 and 3, and set the number of samples to \(D=64\). In our experiments, there is no significant accuracy difference between samplings in depth and inverse depth spaces.

4.2 Dataset

This section describes the datasets for training and evaluation. We used three datasets with the meta data of full camera-settings.

DefocusNet dataset (Maximov et al., 2020) This dataset consists of synthetic images, which were generated with physics-based rendering shaders on Blender. The released subset of this dataset has 400 and 100 samples for training and evaluation, respectively. The focal stack of each sample has five images with \(256\times 256\) resolution. Note that all models were trained only on this synthetic dataset unless otherwise noted.

NYU Depth V2 (Silberman et al., 2012) synthetically blurred by Carvalho et al. (2018) generated this dataset by adding synthetic blurs to the NYU Depth V2 dataset (Silberman et al., 2012) that consists of pairs of RGB and depth images. The defocus model was based on Eq. (1) and takes into account object occlusions. The official training and test splits of the NYU Depth V2 dataset are 795 and 654 samples. We extracted \(256\times 256\) patches from the original \(640\times 480\) images and finally obtained 9540 and 7848 samples for training and evaluation. As with (Maximov et al., 2020), we rescaled the depth range from [0, 10] to [0, 3]. Table 1 lists the camera settings of the DefocusNet dataset (Maximov et al., 2020) and this NYU Depth V2 dataset (Carvalho et al., 2018). Note that the camera settings of NYU Depth V2 dataset is rescaled by 0.3 to match the rescaled depth range. Figure 8 shows the plots of CoCs at sampled depths with different focus distances for each dataset.

Mobile Depth (Suwajanakorn et al., 2015) This dataset consists of real focal stacks captured with a mobile phone camera. The images in each focal stack were aligned and the authors estimated the camera parameters and depth (i.e., there are no actual ground-truth depth maps.). This dataset contains several scenes; thus, we used this dataset only for evaluation.

Fig. 8
figure 8

Plots of CoCs for a DefocusNet dataset (Maximov et al., 2020) and b NYU Depth V2 (Silberman et al., 2012)

Fig. 9
figure 9

Cost plots at green dot in a under \(\sigma =1.0\) and \(\sigma =5.0\) with and without outlier removal. When \(\sigma =1.0\), sharp peaks are observed both (c) with Norm. and d with Norm. and OR around ground-truth depth index (red line). When \(\sigma =5.0\), outliers are produced (bottom in (b)) due to artifacts in deblurred images from Wiener filter with large CoCs (bottom in (e)). Outlier removal is effective in such case for producing shape peak in cost volume (bottom in (d)). In each plot, green line indicates index of estimated depth from network

4.3 Data Augmentation

In the DefocusNet dataset, defocus cues are effective only a short distance from a camera (Maximov et al., 2020). Therefore, we found that our cost volume learned on this dataset is effective only on small depth indices. To enhance the scalability of our cost volume, we scaled the depth maps in the DefocusNet dataset by a scale factor of \(\sigma \in \{1.0,1.5,2.0,\cdots ,9.0\}\) when we trained our model on this dataset. We should also scale the camera parameters together with the depth map, i.e., if each data sample consists of \(\{ \{I_{d_1},\cdots ,I_{d_F}\}, \{d_1, \cdots , d_F \}, f, N, d^*, b \}\), the scaled sample is \(\{ \{I_{d_1},\cdots ,I_{d_F} \}, \{\sigma d_1,\cdots , \sigma d_F \}, \sigma f, N, \sigma d^*, b/\sigma \}\). Note that in both samples, the depth and camera parameters give the same amount of defocus blurs; thus the original focal stack can be used in the scaled sample. This data augmentation is essential for applying our method to other datasets.

Table 2 Ablation study for cost volume construction on DefocusNet dataset (Maximov et al., 2020)

4.4 Ablation Study

Table 2 lists the results from the ablation study on the cost volume construction. We separately computed the RMSE of the predicted depth on the DefocusNet dataset with a different scale factor of the data augmentation. The best values are in bold. The experimental results demonstrate that normalization dramatically improved the accuracy of depth estimation. Outlier removal also improved the accuracy under a large \(\sigma \), i.e., large depth scale scenes captured with large focus distances. Figure 8 shows that CoCs diverge immediately at small depths with large focus distances. This leads to artifacts in deblurred images from the Wiener filter, which causes outliers in a cost volume. Figure 9 shows the example of costs with \(\sigma =1.0\) and \(\sigma =5.0\), where the outlier removal is needed for producing the sharp peak when \(\sigma =5.0\).

Fig. 10
figure 10

Ablation study of number of depth samples D on DefocusNet dataset. \(D\ge 16\) provided effective results on this dataset

Figure 10 shows the comparison with respect to the number of depth samples D of the cost volume. Note that we keep D of the output cost volume before the softmax layer at 64 in this experiment. We computed the RMSE of the predicted depth on the DefocusNet dataset with scale factors \(\sigma =1.0,5.0,9.0\). \(D\ge 16\) provided the effective results on this dataset.

Table 3 Experimental results on different focus distances at train and test time on DefocusNet dataset (Maximov et al., 2020)
Table 4 Experimental results on blurred NYU Depth V2 dataset (Carvalho et al., 2018).

4.5 Evaluation on Different Camera Settings

We then evaluated the performance of depth estimation with different camera settings at training and test times. Table 3 lists the experimental results on the DefocusNet dataset. DefocusNet (Maximov et al., 2020), which is a state-of-the-art DDFS, was compared with our method. We first decomposed each focal stack into two subsets, one with focus distances \(\{0.1,0.3,1.5\}\) and the other with \(\{0.15,0.7\}\). Both methods were trained only on the subset with focus distances \(\{0.1,0.3,1.5\}\) and evaluated on the other subset with different focus distances. The best value is in bold. Our method outperformed DefocusNet, demonstrating the camera-setting invariance of our method.

Fig. 11
figure 11

Plots of RMSE on NYU Depth V2 dataset during training on DefocusNet dataset for a DefocusNet and b ours. Note that RMSE of DefocusNet is computed with rescaled depth (\(^*\))

Fig. 12
figure 12

Qualitative comparison on NYU Depth V2 (Carvalho et al., 2018). In df, all models were trained on DefocusNet dataset (Maximov et al., 2020). In g and h, both methods were trained on NYU Depth V2 dataset. Superscript \(^*\) means that depth is rescaled by median of ratios between output and ground-truth

Fig. 13
figure 13

Experimental results on Mobile Depth (Suwajanakorn et al., 2015). Superscript \(^*\) means that depth is rescaled by median of ratios between output and Suwajanakorn et al. (2015)

We also evaluated the proposed method on the NYU Depth V2 dataset, which has different scene statistics and different camera settings from the DefocusNet dataset, as shown in Table 1. Table 4 and Fig. 12 show the experimental results when comparing the proposed method other with state-of-the-art learning-based methods, i.e., DDFF (Hazirbas et al., 2018), AiFDepthNet (Wang et al., 2021), DFVDFF (Yang et al., 2022), and DefocusNet (Maximov et al., 2020). For AiFDepthNet, we used the authors’ trained model in a supervised manner, and the other methods were re-trained on the DefocusNet dataset. The parameters of DDFF were initialized by VGG16 (Simonyan & Zisserman, 2015) as in the original paper (Hazirbas et al., 2018). For each method, we provided the training epochs except for AiFDepthNet because we could not get the information from their code and paper. Figure 11 shows RMSE on the NYU Depth V2 dataset during training on the DefocusNet dataset for DefocusNet and ours. We also tested VDFF (Moeller et al., 2015) that is one of model-based methods. Note that VDFF did not work on the original focal stacks because the number of defocus images in each focal stack is small; thus we additionally synthesized two defocus images with focus distances \(\{1,6\}\) for the input of VDFF. We fit a polynomial function to convert the output focus index to a depth value. For error metrics, we used MAE, RMSE, absolute relative L1 error (Abs Rel), scale-invariant error (sc-inv) (Eigen et al., 2014), and affine- (scale- and shift-) invariant error in the inverse depth space denoted by ssitrim (Ranftl et al., 2020). This table consists of 3 sub-tables with different experimental settings, and the best values in each sub-table are in bold.

As shown in the upper part of Table 1, our method outperformed the other methods trained on the DefocusNet dataset by large margins on most evaluation metrics, and is comparable to DefocusNet on the affine-invariant error metric in the inverse depth space (ssitrim). This is because the camera settings of the DefocusNet and NYU Depth V2 datasets are different. The other methods cannot handle this difference, and the estimated depths have ambiguity. Although the model-based depth from focus method (VDFF) is not affected by the difference of the camera settings and gives acceptable results, it requires more input images than other learning-based methods to obtain plausible results. The failure cases of such a model-based depth from focus method are further discussed in Fig. 15.

We also computed the errors on the depths rescaled by the median of the ratios between the output and the ground-truth depths followed by Maximov et al. (2020) to compensate the scale-ambiguity. The compensation has been done also on our results for fair comparison. The errors are presented in the middle part of the table. Our method also outperformed the other methods in this comparison. In addition, our method without scaling (Ours) still outperformed the rescaled previous methods (\(^*\)) in most evaluation metrics. Figure 12 shows examples of the estimated depths. Note that the output depths of our method were not rescaled, i.e., our method can estimate depths without any ambiguities. In the bottom part of the table, we show the experimental results trained on the NYU Depth V2 dataset. Although DefocusNet performed better than our method, the accuracy of both methods improved dramatically as shown in Figs. 12g, h, and DefocusNet is heavily affected by the difference of the camera settings in training and test datasets.

Fig. 14
figure 14

Ablation study of focal stack size on Mobile Depth (Suwajanakorn et al., 2015). Horizontal and vertical axes represent RMSE and size of input focal stack

Fig. 15
figure 15

Experimental results on focal stacks captured with our camera. a one of input images in focal stack, b result of model-based depth from focus method (Surh et al., 2017), c, d results of learning-based methods (Maximov et al., 2020; Yang et al., 2022), and e ours

Figure 13 shows the experimental results on Mobile Depth with real focal stacks. We set the size of an input focal stack to 3 except for AiFDepthNet (Wang et al., 2021), which used 10 images for the size of a focal stack, and we show the results of two models; (d) trained on the synthetically blurred FlyingThings3D dataset (Mayer et al., 2016) in a supervised manner and (e) trained on the Mobile Depth in an unsupervised manner. The figure shows the qualitative comparison with the state-of-the-art learning-based methods, the output depths of which were rescaled by the median of the ratios between them and the outputs of Suwajanakorn et al. (2015) (\(^*\)) followed by Maximov et al. (2020). Note that the output depths of our method were not rescaled. The output depths of our method are qualitatively plausible and satisfy the defocus model under different camera settings. Figure 14 shows the quantitative errors between our method and Suwajanakorn et al. (2015) under different sizes of input focal stacks, demonstrating that a few images are enough to obtain effective results with our method.

Finally, we show an example of applying the proposed method to real focal stacks captured with our camera, Nikon D5300 with f-number of 1.8. The focal stacks were captured with “Focus Stacking Simple” in digiCamControl (http://digicamcontrol.com/). All parameters required for the cost volume computation were extracted from EXIF properties, and the focal stack size was 3. Figure 15 shows the qualitative comparison with existing methods including a model-based depth from focus method (RDF (Surh et al., 2017)) and state-of-the-art learning-based methods (DefocusNet (Maximov et al., 2020) and DFVDFF(Yang et al., 2022)). The values of the estimated depth maps are in meters. One of advantages of the learning-based methods is the applicability with a few input images. In this experiment, RDF did not work because the focal stack size is 3. The performance of the learning-based methods except for our method also degraded due to the difference of camera settings at training and test time. These results indicate the applicability of our method to real focal stacks and its camera-setting invariance.

Table 5 Runtime comparison

4.6 Computation Time

Table 5 shows the runtime comparison. We measured the processing time for each test sample in the DefocusNet dataset (Maximov et al., 2020). The cost volume construction was done on AMD EPYC 7232P@3.1 GHz with 128GB RAM and multi-threading for the Wiener deconvolution to each image in a focal stack. The depth estimation was done on a NVIDIA RTX 3090 GPU. The number of the depth samples in the cost volume is 64 and the image resolution is \(256 \times 256\). Although the cost volume construction takes a few seconds, the costs at different depth slices in our cost volume can be computed in parallel to further reduce the computation time.

4.7 Limitations

We finally discuss the limitations of the proposed methods due to the explicit lens defocus model.

Requirement of camera parameters Our method requires full camera parameters for constructing a cost volume. Although we demonstrate the applicability for the real data, this makes it difficult to evaluate our method on other DfDF datasets proposed in previous studies (Hazirbas et al., 2018; Herrmann et al., 2020; Wang et al., 2021).

Shape of blur kernel Herrmann et al. (2020) asserted the several challenges of an autofocus problem in real scenes. For example, manually-designed PSF is unrealistic. Our method uses the pre-determined shape of a blur kernel for constructing the cost volume. In the experiments, our method was evaluated between different blur models; the blurs in the DefocusNet dataset were physically rendered by the Blender, and the blurs in the NYU depth dataset were synthesized with disk-shaped kernels. We also applied our method to the real data demonstrating the applicability to the real PSF, while the generality in real data needs to be further investigated.

Dynamic scenes and focus breathing Similar to AiFDepthNet (Wang et al., 2021), our cost volume computation allows only static scenes. Focus breathing also affects our method (Herrmann et al., 2020). However, as mentioned in Wang et al. (2021), simple preprocessed alignment can solve this problem (In the experiments with real data (Fig. 13), we used aligned focal stacks).

Trade-off between defocus and semantic cues We finally discuss the trade-off between model- and learning-based approaches. Table 6 and Fig. 16 show the results on the DefocusNet dataset. The other learning-based methods outperformed our method. This is because the defocus cues in the DefocusNet dataset are effective only at a short distance from a camera, as mentioned in Sect. 4.3. The other learning-based methods handle this limitation through semantic cues. In the red box region in Fig. 16, our method fails to estimate the depth of the background object. On the other hand, DefocusNet (Maximov et al., 2020) and AiFDepthNet (Wang et al., 2021) can estimate the depths of both the foreground and background object by learning ordering cues. Although our method also learns semantic cues, our method with the explicit lens defocus model is more affected by this limitation.

Table 6 Experimental results on DefocusNet dataset
Fig. 16
figure 16

Limitations of our method. a One of input images in focal stack, b Ground-truth depth, c output depth of DefocusNet (Maximov et al., 2020) d output depth of AiFDepthNet (Wang et al., 2021), and e that of our method. Defocus cues in the DefocusNet dataset are effective only at short distance from the camera, and our method with explicit defocus model is more affected by this limitation. In red box region, DefocusNet (Maximov et al., 2020) and AiFDepthNet (Wang et al., 2021) can estimate depths of both foreground and background objects by learning ordering cues

5 Conclusion

We proposed DDFS with a lens defocus model. We combined a learning framework and defocus model with the construction of a cost volume. This method can absorb the difference in camera settings through the cost volume, which allows the method to estimate the scene depth from a focal stack with different camera settings at training and test times. The experimental results indicate that our model trained only on a synthetic dataset can be applied to other datasets including real focal stacks with different camera settings. This camera-setting invariance will enhance the applicability of DDFS.