Deep Depth from Focal Stack with Defocus Model for Camera-Setting Invariance

We propose a learning-based depth from focus/defocus (DFF), which takes a focal stack as input for estimating scene depth. Defocus blur is a useful cue for depth estimation. However, the size of the blur depends on not only scene depth but also camera settings such as focus distance, focal length, and f-number. Current learning-based methods without any defocus models cannot estimate a correct depth map if camera settings are different at training and test times. Our method takes a plane sweep volume as input for the constraint between scene depth, defocus images, and camera settings, and this intermediate representation enables depth estimation with different camera settings at training and test times. This camera-setting invariance can enhance the applicability of learning-based DFF methods. The experimental results also indicate that our method is robust against a synthetic-to-real domain gap, and exhibits state-of-the-art performance.


Introduction
In computer vision, depth estimation from twodimensional (2D) images is an important task and used for many applications such as VR, AR, or autonomous driving.Defocus blur is a useful cue for such depth estimation because the size of the blur depends on scene depth.Depth from focus/defocus (DFF) takes defocus images as input for depth estimation.Typical inputs for DFF are stacked images, i.e., focal stack, each of which is captured with a different focus distance.
DFF methods are roughly divided into two categories, model-based and learning-based.Model-based methods use a thin-lens model for modeling defocus blurs [13,29,30] or define focus measures [21,28] to estimate scene depth.One of the drawbacks of such methods is difficulty in estimating scene depth with texture-less surfaces.Learning-based methods have been proposed to tackle the above drawback [9,17,33].For example, Hazirbas et al. [9] proposed a con-  [29], (c) output depth of DefocusNet [17], and (d) our result.
Our model and DefocusNet were trained on a dataset with camera settings that differed from those of the test data.Our method with camera-setting invariance can estimate correct depth map.
volutional neural network (CNN) taking a focal stack as input without any explicit defocus models.This is an end-toend method that allows efficient depth estimation.It also enables the depth estimation of texture-less surfaces with learned semantic cues.
General learning-based methods often have limited generalization due to a domain gap between training and test data.Learning-based DFF methods suffer from the difference of capture settings of a camera at training and test times.The amount of a defocus blur depends on not only scene depth but also camera settings such as focus distance, focal length, and f-number.Different depths and camera settings can generate defocus images with the same appearance; thus this difference cannot be compensated with often 1 arXiv:2202.13055v1[cs.CV] 26 Feb 2022 used domain adaptation method such as neural style transfer [15,37].If camera settings are different at training and test times, the estimated depth has some ambiguity, which is similar to the scale-ambiguity in monocular depth estimation [10].Current learning-based DFF methods [9,17,33] do not take into account the latent defocus model, thus the estimated depth is not correct if the camera settings at test time differ from those at training time, as shown in Fig. 1(c).On the other hand, this problem does not matter for model-based methods with explicit defocus models under given camera settings.
We propose learning-based DFF with a lens defocus model.Our method is inspired by recent learning-based multi-view stereo (MVS) [32], where a cost volume is constructed on the basis of a plane sweep volume [4].The proposed method also constructs a cost volume, which is passed through a CNN to estimate scene depth.Each defocus image in a focal stack is deblurred at each sampled depth in the plane sweep volume, then the consistency is evaluated between deblurred images.We found that scene depth is effectively learned from the cost volume in DFF.Our method has several advantages over the other learningbased methods directly taking a focal stack as input without an explicit defocus model [9,17,33].First, output depth satisfies the defocus model because the cost volume imposes an explicit constraint among the scene depth, defocus images, and camera settings.Second, the camera settings, such as focus distances and f-number are absorbed into the cost volume as intermediate representation.This enables depth estimation with different camera settings at training and test times, as shown in Fig. 1(d).
The primary contributions of this paper are summarized as follows: • To the best of our knowledge, this is the first study to combine a learning framework and model-based DFF through a plane sweep volume.
• Our method with camera-setting invariance can be applied to datasets with different camera settings at training and test times, which improves the applicability of learning-based DFF methods.
• Similar to the previous learning-based method [17], our method is also robust against a synthetic-to-real domain gap and achieves state-of-the-art performance.

Related work
Depth from focus/defocus Depth from focus/defocus (DFF) estimates scene depth from focus or defocus cues in captured images and is a major task in computer vision.In general, depth from focus takes many images captured with different focus distances and determines scene depth from an image with the best focus.On the other hand, depth from defocus aims to estimate scene depth from a small number of images, which do not necessarily need to include focused images [35].Recently, depth estimation from a focal stack implicitly uses both focus and defocus cues; thus, we use unified terminology, depth from focus/defocus.Traditional DFF methods propose focus measures to evaluate the amount of a defocus blur [19,21,28,38].If we have a focal stack as input, we can simply refer to the image with noticeable edges and its focus distance.Other methods formulate the amount of defocus blur with a lens defocus model and solve an optimization problem to obtain a depth map together with an all-in-focus image [13,29].We refer to these methods as model-based methods.One of the drawbacks of such methods is difficulty in estimating scene depth with texture-less surfaces.Learning-based methods have been proposed to tackle these issues [9,17,33].These methods enable depth estimation at texture-less surfaces and the depth estimation is achieved efficiently in an endto-end manner.Other learning-based methods leveraged defocus cues as additional information [2,3] or supervision [8,26] for monocular depth estimation.
However, current learning-based DFF methods, which directly take a focal stack as input, do not take into account the latent defocus model [9,17,33].For example, Hazirbas et al. [9] proposed a CNN that directly takes a focal stack as input.Maximov et al. [17] and Wang et al. [33] simply used focus distances as intermediate inputs of neural networks.These methods require the same camera settings at training and test times to obtain a correct depth map due to the lack of explicit defocus models.This characteristic reduces the applicability of learning-based DFF methods.On the other hand, our method is a combination of model-based and learning-based methods through a cost volume, which is computed with a lens defocus model, allowing depth estimation with camera-setting invariance.
Learning from cost volume Learning from a cost volume is efficient in many applications.A cost volume is constructed by sampling solution space and evaluating costs at each sampled point.Examples of learning-based methods with a cost volume are optical flow [11,27] and disparity [12,18] estimation.Learning-based MVS methods [5,16,32,36] are also major examples, where a cost volume is constructed on the basis of a plane sweep volume [4].Our method also constructs a plane sweep volume and evaluates consistency between defocus images in an input focal stack.We found that learning from a cost volume is also efficient for learning-based DFF.

Deep depth from focal stack
Our method combines a learning framework and modelbased DFF through a cost volume for depth estimation with camera-setting invariance.We first give an overview of the  proposed method then describe the lens defocus model and ambiguity of estimated depth in DFF, followed by details of cost volume construction.This cost volume as intermediate representation enables depth estimation with different camera settings at training and test times.The network architecture and loss function are also discussed at the end of this section.

Overview
Figure 2 shows an overview of the proposed method.Our method is inspired by recent learning-based MVS [32], where a cost volume is constructed on the basis of a plane sweep volume [4].Our cost volume is constructed from an input focal stack by evaluating deblurred images at each depth hypothesis.This intermediate representation absorbs the difference in camera settings.The computed cost volume and an additional defocus image are passed through a CNN with an encoder-decoder architecture.At the decoder part, the cost volume is gradually upsampled for coarse-tofine estimation.Output depth maps are obtained by applying a differentiable soft argmin operator [12] to intermediate refined cost volumes.Each upsample block includes a cost aggregation module for learning local structures adaptively.

Lens defocus model
Our cost volume construction is based on a lens defocus model, with which the size of a defocus blur is formulated as a circle of confusion (CoC) [38], as shown in Fig. 3. Let d and d f be the scene depth and focus distance of a camera, respectively.CoC can be computed as   The CoC can be computed from the scene depth d and the camera settings f , d f , N , and b in Eq. (1).Note that these parameters can easily be extracted from EXIF properties [17] or calibrated beforehand [30], and the state-of-theart methods assume these parameters are known [17,33]; thus, this paper also follows the same assumption.Our method realizes depth estimation with camera-setting invariance about these parameters, and this improves the applicability of learning-based DFF methods because our method with camera-setting invariance can be applied to datasets with different camera settings at training and test times.Now, we discuss two ambiguities in DFF due to the camera settings.The first one is scale-ambiguity.From Eq. (1), the following relationship holds:

Circle of Confusion
where This means scaled camera settings and depth give the same CoC as that of the original ones.
The other ambiguity is affine-ambiguity.From Eq. (1), we can obtain where A(f, d f , N ) and B(f, d f , N ) are constants.Thus, different camera settings and inverse depths can give the same CoC as follows: This means the estimated inverse depth has affineambiguity (Similar discussion can be found in the previous study [7]).In the experiments, we evaluate the proposed method with respect to the scale-ambiguity in the depth space and the affine-ambiguity in the inverse depth space.

Cost volume
The proposed method computes a cost volume from the focal stack for the input of a CNN to impose a constraint between the defocus images and scene depth.This has several advantages over current learning-based methods that directly takes a focal stack as input [9,17,33].First, output depth satisfies the lens defocus model because the cost volume imposes an explicit constraint between the defocus images and scene depth.Second, the camera settings are absorbed into the cost volume.This enables inference with camera settings that differ from those at training, and even in this case, the output depth satisfies the lens defocus model without any ambiguities.
Figure 4 shows a diagram of our cost volume construction.We first sample the 3D space in the camera coordinate where k(d, d i ) is a blur kernel, the size of which is defined by Eq. (1) with the scene depth d and focus distance d i .We used a disk-shaped PSF [23,34], while any types of PSFs can be used at training and test time.The operator * −1 indicates a deblurring process applied to each color channel of the input image.We used Wiener-Hunt deconvolution [20] as this process.The function ρ evaluates the consistency between deblurred images.We adopt a standard deviation for ρ, which allows an arbitrary number of inputs.Note that a similar cost volume computation was proposed in modelbased methods [13,29].However, these methods require an all-in-focus image, which leads to iterative optimization for the scene depth and all-in-focus image; thus these methods cannot be directly incorporated into sequential learning frameworks.The process mentioned above is the essential part of our cost volume construction.However, differing from a learning-based MVS method [32], which is based on differentiable image warping, our cost volume construction re- quires careful design because the difference between images due to focus distances is smaller than that due to camera positions.Thus, for robustness and learning stability, the standard deviation in Eq. ( 5) is computed considering neighboring pixels as follows: where N (u, v) is a set of neighboring pixels centered at (u, v) and γ u ,v is a 2D spatial Gaussian weight.Figure 5 shows an example of the estimated depth only from the index of the minimum cost.The neighboring information can reduce noise, especially for the real captured data.We also remove outliers by applying a nonlinear function f (•) that bounds the cost by 1 after computing Eq. ( 5).We use a tanh-like function as follows: where C max is the upper bound of the cost.f (x) is converged to f 1 as x approaches C max .We set C max = 0.3 and f 1 = 0.999.Finally, the cost f (C(u, v, d)) at each pixel is normalized in [0, 1].As shown in Fig. 6, this post-processing produces a sharp peak at the ground-truth depth.However, this normalization includes the possibility that such sharp peaks also appear at texture-less pixels where defocus cues are not effective, thus have negative effects on training.Nevertheless, we found that our network automatically learns effective regions and dramatically im- proves the accuracy of the estimated depth.We describe the ablation study on this in Section 4.4.

Architecture and loss function
As shown in Fig. 2, the cost volume and an additional defocus image, which helps the network to learn semantic cues [32], are concatenated and passed through the network.The input image is selected from the focal stack and we found that the selection of the input image does not affect the performance of the proposed method.During the training of our model, we selected the image with the farthest focus distance.
The cost volume and input image are passed through the encoder, the architecture of which is the same as for MVDepthNet [32].The outputs of the decoder are refined cost volumes C s out at different resolutions s ∈ {1/8, 1/4, 1/2, 1}.
At each upsample block, we implement an adaptive cost aggregation module inspired by Wang et al. [31] to aggregate neighboring information, and this enables depth estimation with clear boundaries by aggregating focus cues on edge pixels.The cost aggregation module is given as where the weight w j and offset (∆u j , ∆v j ) are learnable parameters to aggregate neighboring information.As shown in Fig. 2, our upsample block first upsamples the input cost volume by the scale factor of 2. The feature map from the encoder is then concatenated to this upsampled cost volume.From this volume, offsets and weights for adaptive cost aggregation are learned together with a refined cost volume.The final cost volume is obtained by aggregating the neighboring costs following Eq.(11). Figure 7 shows an example of the learned offsets and output depth with the cost aggregation module, which yields clear boundaries in the estimated depth.The refined cost volume at each resolution is obtained through softmax layers.Thus, the output depth at each resolution can be computed by applying a differentiable soft argmin operator [12] as follows: Training loss The training loss is defined as the sum of L1 loss between the estimated depth maps d s and ground-truth depth maps d * s at different resolutions as follows:

Experiments
We evaluated the proposed method for its camerasetting invariance and comparison it with the state-of-the-art learning-based DFF.Our method can be applied to datasets with camera settings that differ from those of a training dataset.

Implementation
Our network was implemented in PyTorch.The training was done on a NVIDIA RTX 3090 GPU with 24-GB memory.The size of a minibatch was 8 for the training of our model.We trained our network from scratch, and the optimizer was Adam [14] with a learning rate of 1.0 × 10 −4 .
During the cost volume construction, we uniformly sampled the depth between 0.1 and 3, and set the number of samples to D = 64.

Dataset
This section describes the datasets for training and evaluation.We used three datasets with the meta data of full camera-settings.
DefocusNet dataset [17] This dataset consists of synthetic images, which were generated with physics-based rendering shaders on Blender.The released subset of this dataset has 400 and 100 samples for training and evaluation, respectively.The focal stack of each sample has five images with 256 × 256 resolution.Note that all models were trained only on this synthetic dataset unless otherwise noted.
NYU Depth V2 [24] synthetically blurred by [3] Carvalho et al. [3] generated this dataset by adding synthetic blurs to the NYU Depth V2 dataset [24] that consists of pairs of RGB and depth images.The defocus model was based on Eq. ( 1) and takes into account object occlusions.The official training and test splits of the NYU Depth V2 dataset are 795 and 654 samples.We extracted 256 × 256 patches from the original 640 × 480 images and finally obtained 9540 and 7848 samples for training and evaluation.As with [17], we rescaled the depth range from [0, 10] to [0, 3].Table 1 lists the camera settings of the DefocusNet dataset [17] and this NYU Depth V2 dataset [3].
Mobile Depth [29] This dataset consists of real focal stacks captured with a mobile phone camera.The images in each focal stack were aligned and the authors estimated the camera parameters and depth (i.e., there are no actual ground-truth depth maps.).This dataset contains several scenes; thus, we used this dataset only for evaluation.

Data augmentation
In the DefocusNet dataset, defocus cues are effective only a short distance from a camera [17].Therefore, we found that our cost volume learned on this dataset is effective only on small depth indices.To enhance the scalability of our cost volume, we scaled the depth maps in the DefocusNet dataset by a scale factor of σ ∈ {1.0, 1.5, 2.0, • • • , 9.0} when we trained our model on this dataset.We should also scale the camera parameters together with the depth map, i.e., if each data sample consists of Note that in both samples, the depth and camera parameters give the same amount of defocus blurs; thus the original focal stack can be used in the scaled sample.This data augmentation is essential for applying our method to other datasets.

Ablation study
Table 2 lists the results from the ablation study on the cost volume construction.We separately computed the RMSE on the DefocusNet dataset with a different scale factor of the data augmentation.The experimental results demonstrate that normalization (Norm.)dramatically improved the accuracy of depth estimation.Outlier removal (OR) also improved the accuracy, especially at a large depth scale, where the depth estimation will be more difficult than at a small depth scale, as mentioned in Section 4.3.

Evaluation on different camera settings
We then evaluated the performance of depth estimation with different camera settings at training and test times.Table 3 lists the experimental results on the Defo-cusNet dataset.DefocusNet [17], which is a state-of-theart learning-based DFF method, was compared with our method.We first decomposed each focal stack into two subsets, one with focus distances {0.1, 0.3, 1.5} and the other with {0.15, 0.7}.Both methods were trained only on the subset with focus distances {0.1, 0.3, 1.5} and evaluated on the other subset with different focus distances.Our method outperformed DefocusNet, demonstrating the camera-setting invariance of our method.
We also evaluated the proposed method on the NYU Depth V2 dataset, which has different scene statistics and different camera settings from the DefocusNet dataset, as shown in Table 1.Table 4 and Fig. 8 show the experimen-tal results when comparing the proposed method other with state-of-the-art learning-based methods, i.e., DDFF [9], AiFDepthNet [33], and DefocusNet [17].For AiFDepth-Net, we used the authors' trained model, and the other methods were re-trained on the DefocusNet dataset.The parameters of DDFF were initialized by VGG16 [25] as in the original paper [9].For error metrics, we used MAE, RMSE, absolute relative L1 error (Abs Rel), scale-invariant error (sc-inv) [6], and affine-(scale-and shift-) invariant error in the inverse depth space denoted by ssitrim [22].
As shown in the upper part of Table 1, our method outperformed the other methods trained on the DefocusNet dataset by large margins on most evaluation metrics, and is comparable to DefocusNet on the affine-invariant error metric in the inverse depth space (ssitrim).This is because the camera settings of the DefocusNet and NYU Depth V2 datasets are different.The other methods cannot handle this difference, and the estimated depths have ambiguity.
We also computed the errors on the depths rescaled by the median of the ratios between the output and the groundtruth depths followed by [17] to compensate the scaleambiguity.The compensation has been done also on our results for fair comparison.The errors are presented in the middle part of the table.Our method also outperformed the other methods in this comparison.In addition, our method without scaling (Ours) still outperformed the rescaled previous methods ( * ) in most evaluation metrics.Figure 8 shows examples of the estimated depths.In this figure, the affineambiguity of the other methods are compensated by estimating the scales and biases in a least-squares manner ( + ).Note that the output depths of our method were not rescaled, i.e., our method can estimate depths without any ambiguities.In the bottom part of the table, we show the experimental results trained on the NYU Depth V2 dataset.Although DefocusNet performed better than our method, the accuracy of both methods improved dramatically as shown in Figs.8(g) and (h), and DefocusNet is heavily affected by the difference of the camera settings in training and test datasets.
Figure 9 shows the experimental results on Mobile Depth with real focal stacks.We set the size of an input focal stack to 3 except for AiFDepthNet [33], which used from about 10 to 30 images for the size of a focal stack, and the model was trained on the synthetically blurred FlyingThings3D dataset [18].The figure shows the qualitative comparison with the state-of-the-art learning-based methods, the out-Table 4. Experimental results on blurred NYU Depth V2 dataset [3].We computed errors on output depth and its rescaled version * because scales of output depths of current learning-based methods largely differ due to camera setting difference at training and test times.Scaleinvariant errors in the depth space (sc-inv [6]) and affine-invariant errors in inverse depth space (ssitrim [22]) were also computed for fair comparison.

Method
Train  put depths of which were rescaled by the median of the ratios between them and the outputs of Suwajanakorn et al.
[29] ( * ) followed by [17].Note that the output depths of our method were not rescaled.The output depths of our method are qualitatively plausible and satisfy the defocus model under different camera settings.Figure 10 shows the quantitative errors between our method and Suwajanakorn et al. [29] under different sizes of input focal stacks, demonstrating that a few images are enough to obtain effective results with our method.Finally, we show an example of applying the proposed method to real focal stacks captured with our camera, Nikon D5300 with f-number of 1.8.The focal stacks were captured with "Focus Stacking Simple" in digiCamControl [1].All parameters required for the cost volume computation were extracted from EXIF properties, and the focal stack size was 3. Figure 11 shows the qualitative evaluation results.The values of the estimated depth maps are in meters.These results indicate the applicability of our method to real focal stacks.

Computation time
Table 5 shows the runtime comparison.We measured the processing time for each test sample in the Defocus-Net dataset [17].The cost volume construction was done on AMD EPYC 7232P@3.1 GHz with 128GB RAM.The number of the depth samples in the cost volume is 64 and the image resolution is 256 × 256.Although the cost vol-ume construction takes a few seconds, the costs at different depth slices in our cost volume can be computed in parallel to reduce the computation time.

Limitations
We finally discuss the limitations of the proposed methods due to the explicit lens defocus model.

Dynamic scenes and focus breathing Similar to
AiFDepthNet [33], our cost volume computation allows only static scenes.Focus breathing also affects our method.However, as mentioned in [33], simple preprocessed alignment can solve this problem (In the experiments with real data (Fig. 9), we used aligned focal stacks).
Trade-off between defocus and semantic cues We finally discuss the trade-off between model-and learningbased approaches.Table 6 and Fig. 12 show the results on the DefocusNet dataset.The other learning-based methods outperformed our method.This is because the defocus cues in the DefocusNet dataset are effective only at a short distance from a camera, as mentioned in Section 4.3.The other learning-based methods handle this limitation through semantic cues.Although our method also learns semantic cues, our method with the explicit lens defocus model is more affected by this limitation.For future work, a network architecture should be designed to effectively learn defocus and semantic cues simultaneously.

Conclusion
We proposed learning-based DFF with a lens defocus model.We combined a learning framework and defocus model with the construction of a cost volume.This method   (c) * DefocusNet [17] (d) * AiFDepthNet [33] (e) Ours

Figure 1 .
Figure 1.(a) One of input images in focal stack, (b) output depth of[29], (c) output depth of DefocusNet[17], and (d) our result.Our model and DefocusNet were trained on a dataset with camera settings that differed from those of the test data.Our method with camera-setting invariance can estimate correct depth map.

Figure 2 .
Figure 2. Overview of our method.Our method takes focal stack and camera settings as input then constructs cost volume as intermediate representation, which absorbs differences in camera settings.CNN takes this cost volume together with additional image as input then estimates refined cost volume in coarse-to-fine manner.Depth maps are computed by applying soft argmin operator at each resolution.Each upsample block has adaptive cost aggregation module.

<
l a t e x i t s h a 1 _ b a s e 6 4 = " D 2 q N Q v j Y 0 8 o m c v m N E N x k g l u T L U A = " > A A A C Z n i c h V F N S w J B G H 7 c v s x K r Y i C L p I Y n W Q M q e g k d e n o R 3 6 A i e y u o y 2 u u 8 v u K p j 0 B 4 K u e e h U E B H 9 j C 7 9 g Q 7 + g 6 K j Q Z c O v a 4 L U V K 9 w 8 w 8 8 8 z 7 v P P 5 5 5 n 3 e e m Z E M V b F s x j o e Y W h 4 Z H T M O + 6 b 8 E 9 O B Y L T M z l L b 5 g y z 8 q 6 q p s F S b S 4 q m g 8 a y u 2 y g u G y c W 6 p P K 8 V N v p 7 e e b 3 L Q where f is the focal length of the lens and N is the fnumber.b [px/m] converts the unit of the CoC from [m] to[px].When d is equal to d f , the light rays from the scene point converge on the image plane; otherwise, defocus blur results as the size of the diameter of the CoC.The blurred image can be computed as a convolution of an all-in-focus image with the point spread function (PSF), the kernel size of which corresponds to the size of the CoC.
t e x i t s h a 1 _ b a s e 6 4 = " J t E a z n g I K p Y R o b m 7 x k k 4 T c 5 2 T s 4 r b 2 j s y v c H e n p 7 e u P x g Y G c 6 5 V d X S R 1 S 3 D c g q a 6 g p D m i L r S c 8 Q B d s R a k U z R F 4 r r z T 3 8 z X h u N I y N 7 y 6 L b Y r a s m U u 1 J X P a Y y s h h L U J L 8 i P 8 E q Q A k E E T a i l 1 h C z u w o K O K C g R M e I w N q H C 5 b S I F g s 3 c N h r M O Y y k v y 9 w g A h r q 5 w l O E N l t s x j i V e b A W v y u l n T 9 d U 6 n 2 J w d 1 g Z x w T d 0 z W 9 0 B 3 d 0 B O 9 / 1 q r 4 d d o e q n z r L W 0 w i 5 G D 0 f W 3 / 5 V V X j 2 s P e p + t O z h 1 0 s + F 4 l e 7 d 9 p n k L v a W v 7 Z + + r C + u T T Q m 6 Y K e 2 f 8 5 P d A t 3 8 C s v e q X G b F 2 h g h / Q O r 7 c / 8 E u e l k a i 4 5 k 5 l N L C 0 H X x H G K M Y x x e 8 9 j y W s I o 0 s n y t w h G O c h B 6 V X m V I G W 6 l K q F A M 4 Q v o Y x 9 A N r o i f E = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " m 0

Figure 4 .
Figure 4. Cost volume construction.We first sweep fronto-parallel plane in camera coordinate system.At each swept plane, input image in focal stack is deblurred with Wiener-Hunt deconvolution[20] on each color channel.Standard deviation is applied for computing cost, which is followed by outlier removal and normalization.

Figure 5 .
Figure 5.Estimated depth from the index of the minimum cost (a) without and (b) with the neighboring pixels.

Figure 6 .
Figure 6.Cost plots (b) without and (c) with outlier removal and normalization at green dot in (a).Red lines indicate positions of ground-truth depth indices.

Figure 7 .
Figure 7. Example of learned offsets for cost aggregation in blue boxed region in (a).(b) At beginning of training, cost is aggregated from nearby grid points.(c) After training, cost is adaptively aggregated by considering local structures.(e) This yields clear boundaries in estimated depth.

Figure 8 .
Figure 8. Qualitative comparison on NYU Depth V2[3].In (a)-(f), all models were trained on DefocusNet dataset[17].In (g) and (h), both methods were trained on NYU Depth V2 dataset.Superscript + means that affine-ambiguity is compensated by estimating scales and biases in least-squares manner between output and ground-truth.

Figure 9 .
Figure 9. Experimental results on Mobile Depth [29].Superscript * means that depth is rescaled by median of ratios between output and Suwajanakorn et al. [29].

Figure 10 .
Figure 10.Ablation study of focal stack size on Mobile Depth [29].Horizontal and vertical axes represent RMSE and size of input focal stack.

Figure 12 .
Figure 12.Limitations of our method.(a) One of input images in focal stack, (b) Ground-truth depth, (c) output depth of AiFDepth-Net [33], and (d) that of our method.Defocus cues in the Defocus-Net dataset are effective only at a short distance from the camera, and our method with explicit defocus model is more affected by this limitation.

Table 1 .
Camera settings of datasets

Table 2 .
[17]tion study for cost volume construction on Defocus-Net dataset[17].Error metric is RMSE and errors were computed on datasets with different scales of data augmentation.
Rescaled by median of ratios between output and ground-truth depths. *

Table 6 .
Experimental results on DefocusNet dataset