Abstract
We propose deep depth from focal stack (DDFS), which takes a focal stack as input of a neural network for estimating scene depth. Defocus blur is a useful cue for depth estimation. However, the size of the blur depends on not only scene depth but also camera settings such as focus distance, focal length, and fnumber. Current learningbased methods without any defocus models cannot estimate a correct depth map if camera settings are different at training and test times. Our method takes a plane sweep volume as input for the constraint between scene depth, defocus images, and camera settings, and this intermediate representation enables depth estimation with different camera settings at training and test times. This camerasetting invariance can enhance the applicability of DDFS. The experimental results also indicate that our method is robust against a synthetictoreal domain gap.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In computer vision, depth estimation from twodimensional (2D) images is an important task and used for many applications such as VR, AR, or autonomous driving. Defocus blur is a useful cue for such depth estimation because the size of the blur depends on scene depth. Depth from focus/defocus (DfFD) takes defocus images as input for depth estimation. Typical inputs for DfFD are stacked images, i.e., focal stack, each of which is captured with a different focus distance. In this paper, we use the term, depth from focal stack for specifying a method taking a focal stack as input.
DfFD methods are roughly divided into two categories, modelbased and learningbased. Modelbased methods use a thinlens model for modeling defocus blurs (Suwajanakorn et al., 2015; Kim et al., 2016; Tang et al., 2017) or define focus measures (Pertuz et al., 2013; Surh et al., 2017) to estimate scene depth. One of the drawbacks of such methods is difficulty in estimating scene depth with textureless surfaces. Learningbased methods have been proposed to tackle the above drawback (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). For example, Hazirbas et al. (2018) proposed a convolutional neural network (CNN) taking a focal stack as input without any explicit defocus models. This is an endtoend method that allows efficient depth estimation. It also enables the depth estimation of textureless surfaces with learned semantic cues.
General learningbased methods often have limited generalization due to a domain gap between training and test data. Learningbased DfFD methods suffer from the difference of capture settings of a camera at training and test times (Ceruso et al., 2021). The amount of a defocus blur depends on not only scene depth but also camera settings such as focus distance, focal length, and fnumber. Different depths and camera settings can generate defocus images with the same appearance; thus this difference cannot be compensated with often used domain adaptation method such as neural style transfer (Li et al., 2017; Zhu et al., 2017). If camera settings are different at training and test times, the estimated depth has some ambiguity, which is similar to the scaleambiguity in monocular depth estimation (Hu et al., 2019). Current learningbased DfFD methods (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021) do not take into account the latent defocus model, thus the estimated depth is not correct if the camera settings at test time differ from those at training time, as shown in Fig. 1c. On the other hand, this problem does not matter for modelbased methods with explicit defocus models under given camera settings.
We propose learningbased DfFD with a lens defocus model. Our method also takes a focal stack as input, i.e., deep depth from focal stack (DDFS). Our method is inspired by recent learningbased multiview stereo (MVS) (Wang & Shen, 2018), where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). The proposed method also constructs a cost volume, which is passed through a CNN to estimate scene depth. Each defocus image in a focal stack is deblurred at each sampled depth in the plane sweep volume, then the consistency is evaluated between deblurred images. We found that scene depth is effectively learned from the cost volume in DDFS. Our method has several advantages over the other learningbased methods directly taking a focal stack as input without an explicit defocus model (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). First, output depth satisfies the defocus model because the cost volume imposes an explicit constraint among the scene depth, defocus images, and camera settings. Second, the camera settings, such as focus distances and fnumber are absorbed into the cost volume as intermediate representation. This enables depth estimation with different camera settings at training and test times, as shown in Fig. 1d.
The primary contributions of this paper are summarized as follows:

We combine a learning framework and modelbased DfFD through a plane sweep volume for camerasetting invariance.

Our method with camerasetting invariance can be applied to datasets with different camera settings at training and test times, which improves the applicability of DDFS.
2 Related Work
Depth from focus/defocus Depth from focus/defocus (DfFD) estimates scene depth from focus or defocus cues in captured images and is a major task in computer vision. In general, depth from focus takes many images captured with different focus distances and determines scene depth from an image with the best focus. On the other hand, depth from defocus aims to estimate scene depth from a small number of images, which do not necessarily need to include focused images (Xiong & Shafer, 1993). Recently, learningbased depth estimation from a focal stack implicitly uses both focus and defocus cues; thus, we use unified terminology, i.e., DDFS.
Traditional DfFD methods propose focus measures to evaluate the amount of a defocus blur (Zhuo & Sim, 2011; Pertuz et al., 2013; Moeller et al., 2015; Surh et al., 2017). If we have a focal stack as input, we can simply refer to the image with noticeable edges and its focus distance. Other methods formulate the amount of defocus blur with a lens defocus model and solve an optimization problem to obtain a depth map together with an allinfocus image (Suwajanakorn et al., 2015; Kim et al., 2016). We refer to these methods as modelbased methods. One of the drawbacks of such methods is difficulty in estimating scene depth with textureless surfaces. Learningbased methods have been proposed to tackle these issues (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). These methods enable depth estimation at textureless surfaces and the depth estimation is achieved efficiently in an endtoend manner. Other learningbased methods leveraged defocus cues as additional information (Anwar et al., 2017; Carvalho et al., 2018) or supervision (Srinivasan et al., 2018; Gur & Wolf, 2019) for monocular depth estimation.
However, current DDFS, which directly take a focal stack as input, do not take into account the latent defocus model (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). For example, Hazirbas et al. (2018) proposed a CNN that directly takes a focal stack as input. Maximov et al. (2020) and Wang et al. (2021) simply used focus distances as intermediate inputs of neural networks. Yang et al. (2022) proposed a CNN that outputs a focus probability volume, which is multiplied by focus distances for depth estimation. These methods require the same camera settings at training and test times to obtain a correct depth map due to the lack of explicit defocus models. This characteristic reduces the applicability of DDFS. On the other hand, our method is a combination of modelbased and learningbased methods through a cost volume, which is computed with a lens defocus model, allowing depth estimation with camerasetting invariance.
Learning from cost volume Learning from a cost volume is efficient in many applications. A cost volume is constructed by sampling solution space and evaluating costs at each sampled point. Examples of learningbased methods with a cost volume are optical flow (Ilg et al., 2017; Sun et al., 2018) and disparity (Mayer et al., 2016; Kendall et al., 2017) estimation. Learningbased MVS methods (Wang & Shen, 2018; Yao et al., 2018; Long et al., 2021; Duzceker et al., 2021) are also major examples, where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). Our method also constructs a plane sweep volume and evaluates consistency between defocus images in an input focal stack. We found that learning from a cost volume is also efficient for DDFS.
3 Deep Depth from Focal Stack
Our method combines a learning framework and modelbased DfFD through a cost volume for depth estimation with camerasetting invariance. We first give an overview of the proposed method then describe the lens defocus model and ambiguity of estimated depth in DfFD, followed by details of cost volume construction. This cost volume as intermediate representation enables depth estimation with different camera settings at training and test times. The network architecture and loss function are also discussed at the end of this section.
3.1 Overview
Figure 2 shows an overview of the proposed method. Our method is inspired by recent learningbased MVS (Wang & Shen, 2018), where a cost volume is constructed on the basis of a plane sweep volume (Collins, 1996). Our cost volume is constructed from an input focal stack by evaluating deblurred images at each depth hypothesis. This intermediate representation absorbs the difference in camera settings. The computed cost volume and an additional defocus image are passed through a CNN with an encoderdecoder architecture. At the decoder part, the cost volume is gradually upsampled for coarsetofine estimation. Output depth maps are obtained by applying a differentiable soft argmin operator (Kendall et al., 2017) to intermediate refined cost volumes. Each upsample block includes a cost aggregation module for learning local structures adaptively.
3.2 Lens Defocus Model
Our cost volume construction is based on a lens defocus model, with which the size of a defocus blur is formulated as a circle of confusion (CoC) (Zhuo & Sim, 2011), as shown in Fig. 3. Let d and \(d_f\) be the scene depth and focus distance of a camera, respectively. CoC can be computed as
where f is the focal length of the lens and N is the fnumber. b [px/m] converts the unit of the CoC from [m] to [px]. When d is equal to \(d_f\), the light rays from the scene point converge on the image plane; otherwise, defocus blur results as the size of the diameter of the CoC. The blurred image can be computed as a convolution of an allinfocus image with the point spread function (PSF), the kernel size of which corresponds to the size of the CoC.
The CoC can be computed from the scene depth d and the camera settings f, \(d_f\), N, and b in Eq. (1). Note that these parameters can easily be extracted from EXIF properties (Maximov et al., 2020) or calibrated beforehand (Tang et al., 2017), and the stateoftheart methods assume these parameters are known (Maximov et al., 2020; Wang et al., 2021); thus, this paper also follows the same assumption. Our method realizes depth estimation with camerasetting invariance about these parameters, and this improves the applicability of DDFS because our method with camerasetting invariance can be applied to datasets with different camera settings at training and test times.
In DfFD, the number of the unknown parameters is two, i.e, an allinfocus image and a depth map; thus, if we have two images captured with different camera settings and these parameters are known, we can solve the problem by using defocus cues. However, if there exists unknown camera parameters, the estimated depth has ambiguity. Now, we discuss two ambiguities in DfFD due to the camera settings. The first one is scaleambiguity. From Eq. (1), the following relationship holds:
where \((\cdot )^* = (\cdot )\sigma , (\cdot )^{*} = (\cdot )/\sigma , \, \forall \sigma \in {\mathbb {R}}\). This means scaled camera settings and depth give the same CoC as that of the original ones.
The other ambiguity is affineambiguity. From Eq. (1), we can obtain
where \(A(f, d_f, N)\) and \(B(f, d_f, N)\) are constants. Thus, different camera settings and inverse depths can give the same CoC as follows:
This means the estimated inverse depth has affineambiguity (Similar discussion can be found in the previous study (Garg et al., 2019)). In the experiments, we evaluate the proposed method with respect to the scaleambiguity in the depth space and the affineambiguity in the inverse depth space.
3.3 Cost Volume
The proposed method computes a cost volume from the focal stack for the input of a CNN to impose a constraint between the defocus images and scene depth. This has several advantages over current learningbased methods that directly takes a focal stack as input (Hazirbas et al., 2018; Maximov et al., 2020; Wang et al., 2021). First, output depth satisfies the lens defocus model because the cost volume imposes an explicit constraint between the defocus images and scene depth. Second, the camera settings are absorbed into the cost volume. This enables inference with camera settings that differ from those at training, and even in this case, the output depth satisfies the lens defocus model without any ambiguities.
Figure 4 shows a diagram of our cost volume construction. We first sample the 3D space in the camera coordinate system by sweeping a frontoparallel plane. To evaluate each depth hypothesis, we deblur each image in the input focal stack. Let the cost volume be \(C:\{1,\cdots ,W\}\times \{1,\cdots ,H\}\times \{ 1,\cdots ,D\} \rightarrow {\mathbb {R}}\), and the focal stack be \(\{ I_{d_i}\}_{i=1}^{F}\), where \(I_{d_i}\) is a captured image with focus distance \(d_i\). Each element of the cost volume C is computed as follows:
where \(k(d,d_i)\) is a blur kernel, the size of which is defined by Eq. (1) with the scene depth d and focus distance \(d_i\). We used a diskshaped PSF (Watanabe & Nayar, 1998; Shi et al., 2015), while any types of PSFs can be used at training and test time. The operator \(*^{1}\) indicates a deblurring process applied to each color channel of the input image. We used WienerHunt deconvolution (Orieux et al., 2010) as this process. The function \(\rho \) evaluates the consistency between deblurred images. We adopt a standard deviation for \(\rho \), which allows an arbitrary number of inputs.
The process mentioned above is the essential part of our cost volume construction. However, differing from a learningbased MVS method (Wang & Shen, 2018), which is based on differentiable image warping, our cost volume construction requires careful design because the difference between images due to focus distances is smaller than that due to camera positions in the MVS setup. Thus, for robustness and learning stability, the standard deviation in Eq. (5) is computed considering neighboring pixels as follows:
where \({\mathcal {N}}(u,v)\) is a set of neighboring pixels centered at (u, v) and \(\gamma _{u',v'}\) is a 2D spatial Gaussian weight. Figure 5 shows an example of the estimated depth only from the index of the minimum cost. In (b)(d), we computed costs with the different standard deviations (SD) of the 2D spatial Gaussian weights. The neighboring information can reduce noise, especially for the real captured data, while a large standard deviation causes oversmoothing. In our experiments, we set the standard deviation as 1.0.
We also remove outliers by applying a nonlinear function \(f(\cdot )\) that bounds the cost by 1 after computing Eq. (5). We use a \(\tanh \)like function as follows:
where \(C_{max}\) is the upper bound of the cost. f(x) is converged to \(f_1\) as x approaches \(C_{max}\). In our experiment, \(C_{max} \ge 0.3\) with \(f_1=0.999\) gave the effective results; thus we set \(C_{max} = 0.3\) and \(f_1=0.999\). Finally, the cost f(C(u, v, d)) at each pixel is normalized in [0, 1]. As shown in Fig. 6, the outlier removal (OR) and normalization (Norm.) produces a sharp peak at the groundtruth depth. However, this normalization includes the possibility that such sharp peaks also appear at textureless pixels where defocus cues are not effective, thus have negative effects on training. Nevertheless, we found that our network automatically learns effective regions and dramatically improves the accuracy of the estimated depth. We describe the ablation study on this in Sect. 4.4.
Now, we discuss the difference between the designs of the cost volume of our method and previous costvolumebased DfFD methods. The major design of the cost volume construction is based on focus measure (Pertuz et al., 2013; Surh et al., 2017; Yang et al., 2022), where costs are computed at focus positions in an input focal stack. Wang et al. (2021) introduced the framework of the cost volume at the output layer of the CNN. On the other hand, our cost volume, which is designed for the input of the network, is based on image deblurring to enables us to compute costs at depths that are not contained in input focus distances. A cost volume construction similar to that of our method was also proposed in modelbased methods (Suwajanakorn et al., 2015; Kim et al., 2016). However, these methods require an allinfocus image, which leads to iterative optimization for the scene depth and allinfocus image; thus these methods cannot be directly incorporated into sequential learning frameworks.
3.4 Architecture and Loss Function
As shown in Fig. 2, the cost volume and an additional defocus image, which helps the network to learn semantic cues (Wang & Shen, 2018), are concatenated and passed through the network. The input image is selected from the focal stack and we found that the selection of the input image does not affect the performance of the proposed method. During the training of our model, we selected the image with the farthest focus distance.
The cost volume and input image are passed through the encoder, the architecture of which is the same as for MVDepthNet (Wang & Shen, 2018). The outputs of the decoder are refined cost volumes \(C_{out}^s\) at different resolutions \(s\in \{ 1/8, 1/4, 1/2, 1\}\).
At each upsample block, we implement an adaptive cost aggregation module inspired by Wang et al. (2021) to aggregate neighboring information, and this enables depth estimation with clear boundaries by aggregating focus cues on edge pixels. The cost aggregation module is given as
where the weight \(w_j\) and offset \((\Delta u_j, \Delta v_j)\) are parameters to aggregate neighboring information, which are estimated from intermediate features in the decoder. As shown in Fig. 2, our upsample block first upsamples the input cost volume by the scale factor of 2. The feature map from the encoder is then concatenated to this upsampled cost volume. From this volume, offsets and weights for adaptive cost aggregation are estimated together with a refined cost volume. The final cost volume is obtained by aggregating the neighboring costs following Eq. (11). Figure 7 shows an example of the estimated offsets and output depth with the cost aggregation module, which yields clear boundaries in the estimated depth.
The refined cost volume at each resolution is obtained through softmax layers. Thus, the output depth at each resolution can be computed by applying a differentiable soft argmin operator (Kendall et al., 2017) as follows:
Training loss The training loss is defined as the sum of L1 loss between the estimated depth maps \(d_s\) and groundtruth depth maps \(d_s^*\) at different resolutions as follows:
4 Experiments
We evaluated the proposed method for its camerasetting invariance and comparison it with the stateoftheart DDFS. Our method can be applied to datasets with camera settings that differ from those of a training dataset.
4.1 Implementation
Our network was implemented in PyTorch. The training was done on a NVIDIA RTX 3090 GPU with 24GB memory. The size of a minibatch was 8 for the training of our model. We trained our network from scratch, and the optimizer was Adam (Kingma & Ba, 2015) with a learning rate of \(1.0\times 10^{4}\).
During the cost volume construction, we uniformly sampled the depth between 0.1 and 3, and set the number of samples to \(D=64\). In our experiments, there is no significant accuracy difference between samplings in depth and inverse depth spaces.
4.2 Dataset
This section describes the datasets for training and evaluation. We used three datasets with the meta data of full camerasettings.
DefocusNet dataset (Maximov et al., 2020) This dataset consists of synthetic images, which were generated with physicsbased rendering shaders on Blender. The released subset of this dataset has 400 and 100 samples for training and evaluation, respectively. The focal stack of each sample has five images with \(256\times 256\) resolution. Note that all models were trained only on this synthetic dataset unless otherwise noted.
NYU Depth V2 (Silberman et al., 2012) synthetically blurred by Carvalho et al. (2018) generated this dataset by adding synthetic blurs to the NYU Depth V2 dataset (Silberman et al., 2012) that consists of pairs of RGB and depth images. The defocus model was based on Eq. (1) and takes into account object occlusions. The official training and test splits of the NYU Depth V2 dataset are 795 and 654 samples. We extracted \(256\times 256\) patches from the original \(640\times 480\) images and finally obtained 9540 and 7848 samples for training and evaluation. As with (Maximov et al., 2020), we rescaled the depth range from [0, 10] to [0, 3]. Table 1 lists the camera settings of the DefocusNet dataset (Maximov et al., 2020) and this NYU Depth V2 dataset (Carvalho et al., 2018). Note that the camera settings of NYU Depth V2 dataset is rescaled by 0.3 to match the rescaled depth range. Figure 8 shows the plots of CoCs at sampled depths with different focus distances for each dataset.
Mobile Depth (Suwajanakorn et al., 2015) This dataset consists of real focal stacks captured with a mobile phone camera. The images in each focal stack were aligned and the authors estimated the camera parameters and depth (i.e., there are no actual groundtruth depth maps.). This dataset contains several scenes; thus, we used this dataset only for evaluation.
4.3 Data Augmentation
In the DefocusNet dataset, defocus cues are effective only a short distance from a camera (Maximov et al., 2020). Therefore, we found that our cost volume learned on this dataset is effective only on small depth indices. To enhance the scalability of our cost volume, we scaled the depth maps in the DefocusNet dataset by a scale factor of \(\sigma \in \{1.0,1.5,2.0,\cdots ,9.0\}\) when we trained our model on this dataset. We should also scale the camera parameters together with the depth map, i.e., if each data sample consists of \(\{ \{I_{d_1},\cdots ,I_{d_F}\}, \{d_1, \cdots , d_F \}, f, N, d^*, b \}\), the scaled sample is \(\{ \{I_{d_1},\cdots ,I_{d_F} \}, \{\sigma d_1,\cdots , \sigma d_F \}, \sigma f, N, \sigma d^*, b/\sigma \}\). Note that in both samples, the depth and camera parameters give the same amount of defocus blurs; thus the original focal stack can be used in the scaled sample. This data augmentation is essential for applying our method to other datasets.
4.4 Ablation Study
Table 2 lists the results from the ablation study on the cost volume construction. We separately computed the RMSE of the predicted depth on the DefocusNet dataset with a different scale factor of the data augmentation. The best values are in bold. The experimental results demonstrate that normalization dramatically improved the accuracy of depth estimation. Outlier removal also improved the accuracy under a large \(\sigma \), i.e., large depth scale scenes captured with large focus distances. Figure 8 shows that CoCs diverge immediately at small depths with large focus distances. This leads to artifacts in deblurred images from the Wiener filter, which causes outliers in a cost volume. Figure 9 shows the example of costs with \(\sigma =1.0\) and \(\sigma =5.0\), where the outlier removal is needed for producing the sharp peak when \(\sigma =5.0\).
Figure 10 shows the comparison with respect to the number of depth samples D of the cost volume. Note that we keep D of the output cost volume before the softmax layer at 64 in this experiment. We computed the RMSE of the predicted depth on the DefocusNet dataset with scale factors \(\sigma =1.0,5.0,9.0\). \(D\ge 16\) provided the effective results on this dataset.
4.5 Evaluation on Different Camera Settings
We then evaluated the performance of depth estimation with different camera settings at training and test times. Table 3 lists the experimental results on the DefocusNet dataset. DefocusNet (Maximov et al., 2020), which is a stateoftheart DDFS, was compared with our method. We first decomposed each focal stack into two subsets, one with focus distances \(\{0.1,0.3,1.5\}\) and the other with \(\{0.15,0.7\}\). Both methods were trained only on the subset with focus distances \(\{0.1,0.3,1.5\}\) and evaluated on the other subset with different focus distances. The best value is in bold. Our method outperformed DefocusNet, demonstrating the camerasetting invariance of our method.
We also evaluated the proposed method on the NYU Depth V2 dataset, which has different scene statistics and different camera settings from the DefocusNet dataset, as shown in Table 1. Table 4 and Fig. 12 show the experimental results when comparing the proposed method other with stateoftheart learningbased methods, i.e., DDFF (Hazirbas et al., 2018), AiFDepthNet (Wang et al., 2021), DFVDFF (Yang et al., 2022), and DefocusNet (Maximov et al., 2020). For AiFDepthNet, we used the authors’ trained model in a supervised manner, and the other methods were retrained on the DefocusNet dataset. The parameters of DDFF were initialized by VGG16 (Simonyan & Zisserman, 2015) as in the original paper (Hazirbas et al., 2018). For each method, we provided the training epochs except for AiFDepthNet because we could not get the information from their code and paper. Figure 11 shows RMSE on the NYU Depth V2 dataset during training on the DefocusNet dataset for DefocusNet and ours. We also tested VDFF (Moeller et al., 2015) that is one of modelbased methods. Note that VDFF did not work on the original focal stacks because the number of defocus images in each focal stack is small; thus we additionally synthesized two defocus images with focus distances \(\{1,6\}\) for the input of VDFF. We fit a polynomial function to convert the output focus index to a depth value. For error metrics, we used MAE, RMSE, absolute relative L1 error (Abs Rel), scaleinvariant error (scinv) (Eigen et al., 2014), and affine (scale and shift) invariant error in the inverse depth space denoted by ssitrim (Ranftl et al., 2020). This table consists of 3 subtables with different experimental settings, and the best values in each subtable are in bold.
As shown in the upper part of Table 1, our method outperformed the other methods trained on the DefocusNet dataset by large margins on most evaluation metrics, and is comparable to DefocusNet on the affineinvariant error metric in the inverse depth space (ssitrim). This is because the camera settings of the DefocusNet and NYU Depth V2 datasets are different. The other methods cannot handle this difference, and the estimated depths have ambiguity. Although the modelbased depth from focus method (VDFF) is not affected by the difference of the camera settings and gives acceptable results, it requires more input images than other learningbased methods to obtain plausible results. The failure cases of such a modelbased depth from focus method are further discussed in Fig. 15.
We also computed the errors on the depths rescaled by the median of the ratios between the output and the groundtruth depths followed by Maximov et al. (2020) to compensate the scaleambiguity. The compensation has been done also on our results for fair comparison. The errors are presented in the middle part of the table. Our method also outperformed the other methods in this comparison. In addition, our method without scaling (Ours) still outperformed the rescaled previous methods (\(^*\)) in most evaluation metrics. Figure 12 shows examples of the estimated depths. Note that the output depths of our method were not rescaled, i.e., our method can estimate depths without any ambiguities. In the bottom part of the table, we show the experimental results trained on the NYU Depth V2 dataset. Although DefocusNet performed better than our method, the accuracy of both methods improved dramatically as shown in Figs. 12g, h, and DefocusNet is heavily affected by the difference of the camera settings in training and test datasets.
Figure 13 shows the experimental results on Mobile Depth with real focal stacks. We set the size of an input focal stack to 3 except for AiFDepthNet (Wang et al., 2021), which used 10 images for the size of a focal stack, and we show the results of two models; (d) trained on the synthetically blurred FlyingThings3D dataset (Mayer et al., 2016) in a supervised manner and (e) trained on the Mobile Depth in an unsupervised manner. The figure shows the qualitative comparison with the stateoftheart learningbased methods, the output depths of which were rescaled by the median of the ratios between them and the outputs of Suwajanakorn et al. (2015) (\(^*\)) followed by Maximov et al. (2020). Note that the output depths of our method were not rescaled. The output depths of our method are qualitatively plausible and satisfy the defocus model under different camera settings. Figure 14 shows the quantitative errors between our method and Suwajanakorn et al. (2015) under different sizes of input focal stacks, demonstrating that a few images are enough to obtain effective results with our method.
Finally, we show an example of applying the proposed method to real focal stacks captured with our camera, Nikon D5300 with fnumber of 1.8. The focal stacks were captured with “Focus Stacking Simple” in digiCamControl (http://digicamcontrol.com/). All parameters required for the cost volume computation were extracted from EXIF properties, and the focal stack size was 3. Figure 15 shows the qualitative comparison with existing methods including a modelbased depth from focus method (RDF (Surh et al., 2017)) and stateoftheart learningbased methods (DefocusNet (Maximov et al., 2020) and DFVDFF(Yang et al., 2022)). The values of the estimated depth maps are in meters. One of advantages of the learningbased methods is the applicability with a few input images. In this experiment, RDF did not work because the focal stack size is 3. The performance of the learningbased methods except for our method also degraded due to the difference of camera settings at training and test time. These results indicate the applicability of our method to real focal stacks and its camerasetting invariance.
4.6 Computation Time
Table 5 shows the runtime comparison. We measured the processing time for each test sample in the DefocusNet dataset (Maximov et al., 2020). The cost volume construction was done on AMD EPYC 7232P@3.1 GHz with 128GB RAM and multithreading for the Wiener deconvolution to each image in a focal stack. The depth estimation was done on a NVIDIA RTX 3090 GPU. The number of the depth samples in the cost volume is 64 and the image resolution is \(256 \times 256\). Although the cost volume construction takes a few seconds, the costs at different depth slices in our cost volume can be computed in parallel to further reduce the computation time.
4.7 Limitations
We finally discuss the limitations of the proposed methods due to the explicit lens defocus model.
Requirement of camera parameters Our method requires full camera parameters for constructing a cost volume. Although we demonstrate the applicability for the real data, this makes it difficult to evaluate our method on other DfDF datasets proposed in previous studies (Hazirbas et al., 2018; Herrmann et al., 2020; Wang et al., 2021).
Shape of blur kernel Herrmann et al. (2020) asserted the several challenges of an autofocus problem in real scenes. For example, manuallydesigned PSF is unrealistic. Our method uses the predetermined shape of a blur kernel for constructing the cost volume. In the experiments, our method was evaluated between different blur models; the blurs in the DefocusNet dataset were physically rendered by the Blender, and the blurs in the NYU depth dataset were synthesized with diskshaped kernels. We also applied our method to the real data demonstrating the applicability to the real PSF, while the generality in real data needs to be further investigated.
Dynamic scenes and focus breathing Similar to AiFDepthNet (Wang et al., 2021), our cost volume computation allows only static scenes. Focus breathing also affects our method (Herrmann et al., 2020). However, as mentioned in Wang et al. (2021), simple preprocessed alignment can solve this problem (In the experiments with real data (Fig. 13), we used aligned focal stacks).
Tradeoff between defocus and semantic cues We finally discuss the tradeoff between model and learningbased approaches. Table 6 and Fig. 16 show the results on the DefocusNet dataset. The other learningbased methods outperformed our method. This is because the defocus cues in the DefocusNet dataset are effective only at a short distance from a camera, as mentioned in Sect. 4.3. The other learningbased methods handle this limitation through semantic cues. In the red box region in Fig. 16, our method fails to estimate the depth of the background object. On the other hand, DefocusNet (Maximov et al., 2020) and AiFDepthNet (Wang et al., 2021) can estimate the depths of both the foreground and background object by learning ordering cues. Although our method also learns semantic cues, our method with the explicit lens defocus model is more affected by this limitation.
5 Conclusion
We proposed DDFS with a lens defocus model. We combined a learning framework and defocus model with the construction of a cost volume. This method can absorb the difference in camera settings through the cost volume, which allows the method to estimate the scene depth from a focal stack with different camera settings at training and test times. The experimental results indicate that our model trained only on a synthetic dataset can be applied to other datasets including real focal stacks with different camera settings. This camerasetting invariance will enhance the applicability of DDFS.
Data Availibility
The data that support the findings of this study are all publicly available in online repositories. DefocusNet (Maximov et al., 2020): https://github.com/dvltum/defocusnet. NYU depth v2 with synthetic blurs (Carvalho et al., 2018): https://github.com/marcelampc/d3net_depth_estimation. Mobile depth dataset (Suwajanakorn et al., 2015): https://www.supasorn.com/dffdownload.html.
References
Anwar, S., Hayder, Z., & Porikli, F. (2017). Depth estimation and blur removal from a single outoffocus image. In: BMVC.
Carvalho, M., Le Saux, B., TrouvePeloux, P., Almansa, A., & Champagnat, F. (2018). Deep depth from defocus: How can defocus blur improve 3D estimation using dense neural networks? In: ECCVW. https://github.com/marcelampc/d3net_depth_estimation (GPLv3 license).
Ceruso, S., BonaqueGonzález, S., OlivaGarcía, R., & RodríguezRamos, J. M. (2021). Relative multiscale deep depth from focus. Signal Processing: Image Communication, 99, 116417.
Collins, R. T. (1996). A spacesweep approach to true multiimage matching. In: CVPR (pp. 358–363).
digiCamControl. http://digicamcontrol.com/
Duzceker, A., Galliani, S., Vogel, C., Speciale, P., Dusmanu, M., & Pollefeys, M. (2021). Deepvideomvs: Multiview stereo on video with recurrent spatiotemporal fusion. In: CVPR (pp. 15324–15333).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multiscale deep network. In: NeurIPS, (vol. 2, pp. 2366–2374).
Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dualpixels. In: ICCV (pp. 7628–7637).
Gur, S., & Wolf, L. (2019). Single image depth estimation trained via depth from defocus cues. In: CVPR (pp. 7683–7692).
Hazirbas, C., Soyer, S. G., Staab, M. C., LealTaixé, L., & Cremers, D. (2018). Deep depth from focus. In: ACCVhttps://github.com/soyers/ddffpytorch (GNU General Public License v3.0).
Herrmann, C., Bowen, R. S., Wadhwa, N., Garg, R., He, Q., Barron, J. T., & Zabih, R. (2020). Learning to autofocus. In: CVPR.
Hu, J., Ozay, M., Zhang, Y., & Okatani, T. (2019). Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1043–1051).
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (pp. 2462–2470).
Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). Endtoend learning of geometry and context for deep stereo regression. In: ICCV.
Kim, H., Richardt, C., & Theobalt, C. (2016). Video depthfromdefocus. In: International Conference on 3D Vision (3DV) (pp. 370–379).
Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In: ICLR
Li, Y., Wang, N., Liu, J., & Hou, X. (2017). Demystifying neural style transfer. In: IJCAI (pp. 2230–2236).
Long, X., Liu, L., Li, W., Theobalt, C., & Wang, W. (2021). Multiview depth estimation using epipolar spatiotemporal networks. In: CVPR (pp. 8258–8267).
Maximov, M., Galim, K., & LealTaixé, L. (2020). Focus on defocus: bridging the synthetic to real domain gap for depth estimation. In: CVPR (pp. 1071–1080) . https://github.com/dvltum/defocusnet (MIT License).
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T.(2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: In: CVPR (pp. 4040–4048).
Moeller, M., Benning, M., Schönlieb, C., & Cremers, D. (2015). Variational depth from focus reconstruction. IEEE TPAMI, 24(12), 5369–5378.
Orieux, F., Giovannelli, J.F., & Rodet, T. (2010). Bayesian estimation of regularization and point spread function parameters for wienerhunt deconvolution. Journal of the Optical Society of America A, 27(7), 1593–1607.
Pertuz, S., Puig, D., & Garcia, M. A. (2013). Analysis of focus measure operators for shapefromfocus. Pattern Recognition, 46(5), 1415–1432.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zeroshot crossdataset transfer. IEEE TPAMI.
Shi, J., Tao, X., Xu, L., & Jia, J. (2015). Break ames room illusion: Depth from general single images. ACM TOG, 34(6), 1–11.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In: ECCV (pp. 746–760).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for largescale image recognition. In: ICLR.
Srinivasan, P. P., Garg, R., Wadhwa, N., Ng, R., & Barron, J. T. (2018). Aperture supervision for monocular depth estimation. In: CVPR (pp. 6393–6401).
Sun, D., Yang, X., Liu, M.Y., & Kautz, J. (2018). Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (pp. 8934–8943).
Surh, J., Jeon, H.G., Park, Y., Im, S., Ha, H., & Kweon, I. S. (2017). Noise robust depth from focus using a ring difference filter. In: CVPR (pp. 6328–6337).
Suwajanakorn, S., & Hernandez, C., & Seitz, S. M. (2015). Depth from focus with your mobile phone. In: CVPR (pp. 3497–3506). https://www.supasorn.com/dffdownload.html
Tang, H., Cohen, S., Price, B., Schiller, S., & Kutulakos, K. N. (2017). Depth from defocus in the wild. In: CVPR (pp. 2740–2748).
Wang, K., & Shen, S. (2018). Mvdepthnet: Realtime multiview depth estimation neural network. In: International Conference on 3D Vision (3DV) (pp. 248–257).
Wang, F., Galliani, S., Vogel, C., Speciale, P., & Pollefeys, M. (2021). Patchmatchnet: Learned multiview patchmatch stereo. In: CVPR (p. 14194–14203).
Wang, N.H., Wang, R., Liu, Y.L., Huang, Y.H., Chang, Y.L., Chen, C.P., & Jou, K. (2021). Bridging unsupervised and supervised depth from focus via allinfocus supervision. In: ICCV. https://github.com/albert100121/AiFDepthNet
Watanabe, M., & Nayar, S. K. (1998). Rational filters for passive depth from defocus (article) author. IJCV (vol. 27(3), pp. 203–225).
Xiong, Y., & Shafer, S. A. (1993). Depth from focusing and defocusing. In: CVPR (pp. 68–73).
Yang, F., Huang, X., & Zhou, Z. (2022). Deep depth from focus with differential focus volume. In: CVPR (pp. 12642–12651).
Yao, Y., Luo, Z., Li, S., Fang, T., & Quan, L. (2018). Mvsnet: Depth inference for unstructured multiview stereo. In: ECCV (pp. 767–783).
Zhu, J.Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired imagetoimage translation using cycleconsistent adversarial networks. In: ICCV (pp. 2223–2232).
Zhuo, S., & Sim, T. (2011). Defocus map estimation from a single image. Pattern Recognition, 44(9), 1852–1858.
Acknowledgements
This work was supported by the Japan Society for the Promotion of Science KAKENHI Grant Number 22K17911.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Kong Hui.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Network architecture
Tables 7 and 8 show the architectures of the encoder and decoder, respectively. The encoder is the same with that of MVDepthNet (Wang & Shen, 2018). Each convolution layer is followed by batch normalization and then by rectified linear unit (ReLU) activation. In the decoder, each offset_conv has no activation, and each weight_conv and _score layer has softmax activation.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fujimura, Y., Iiyama, M., Funatomi, T. et al. Deep Depth from Focal Stack with Defocus Model for CameraSetting Invariance. Int J Comput Vis 132, 1970–1985 (2024). https://doi.org/10.1007/s1126302301964x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1126302301964x