1 Introduction

2D-to-3D conversion aims to estimate depth from 2D images and generates stereoscopic views from the depth, which is a key technology to produce 3D content [1]. Existing approaches are mainly categorized into two groups: automatic and semi-automatic methods.

Automatic methods try to create depth from 2D images using various depth cues, such as dark channel [2], motion [3], lighting bias [4], defocus [5], geometry [6], boundary [7], etc. Each cue is only applicable to certain scenes [8], and thus, these methods are hard to provide acceptable results in any general content. Recently, neural networks have been employed to learn the implicit relation between depth and color values [912]. However, these learning-based methods are limited to the trained image types [13].

Semi-automatic methods address these issues by introducing human interactions. The objective of these approaches is to produce a dense depth-map from user scribbles which indicate the labeled pixels are farther or closer from the camera [14]. In order to solve the problem of 3D content shortage, many methods have been developed for depth estimation from user input. Guttmann et al. [15] employed user scribbles to train a support vector machines (SVM) classifier that assigns depth to image patches, but results may be inaccurate due to misclassifications. S‘ykora et al. [16] proposed an interactive method for user adding depth (in)equalities information and formulated depth propagation as an optimization problem, but it may produce several artifacts due to the incorrect estimation of contour thickness. Rzeszutek et al. [17] utilized the random-walks (RW) algorithm to generate dense depth-maps from user input, but RW has problems in preserving strong edges. Phan et al. [18] appended graph-cuts (GC) segmentation to the neighbor cost in RW to preserve depth boundaries. Xu et al. [19] proposed a similar method which uses a fast watershed segmentation to replace GC. Zhang et al. [20] combined automatic depth estimation from multiple cues and interactive object segmentation to obtain the final depth. Zeng et al. [21] utilized occlusion cues and shape priors to obtain a rough approximation of depth and refined the estimation using an interactive ground fitting. These segmentation-based methods can preserve strong edges but may generate artifacts due to incorrect segments. Yuan et al. [22] incorporated non-local neighbors into the RW algorithm to improve depth quality. Liang et al. [23] extended this scheme to support video conversion using spatial-temporal information. Wang et al. [24] propagated user-specified sparse depth into dense depth using an optimization method originally used for colorization [25]. Wu et al. [26] improved this method with depth consistency between superpixels. Liao et al. [27] used a diffusion process to generate a depth map from user coarse annotations.

Depth-map is typically made of smooth regions separated by sharp transitions along the boundaries between different objects [28]. Therefore, existing semi-automatic methods require that user scribbles do not cross object boundaries; otherwise, the quality of produced depth degrades significantly. As shown in Fig. 1, when user scribbles cross object boundaries, the state-of-the-art methods [18, 22, 24] will produce depth artifacts. In 2D-to-3D conversion, the cross-boundary scribbles are introduced by users carelessly. As for a cross-boundary scribble, its longer part is usually user expected input and shorter part is unwanted input. It can be seen from Fig. 1 that the proposed method can remove depth artifacts caused by unwanted user input from cross-boundary scribbles.

Fig. 1
figure 1

Depth estimation with cross-boundary user input (depth artifacts caused by cross-boundary scribbles are marked by yellow rectangles). a Input image with user scribbles (the cross-boundary scribble is marked by the yellow rectangle). b Groundtruth. c Hybrid GC and RW [18]. d Nonlocal RW [22]. e Optimization [24]. f Proposed. Please zoom in to see details

Semi-automatic image segmentation methods have addressed the problem of cross-boundary scribbles [2931]. Although Subr et al. [29] and Bai et al. [30] can reduce artifacts caused by cross-boundary scribbles, they focus on the foreground object segmentation and are hard to apply in 2D-to-3D conversion. Oh et al. [31] used occurrence and co-occurrence probability (OCP) of color values at labeled pixels to estimate the confidence of user input. This method can be used for 2D-to-3D conversion, but it may mistake expected scribbles for unwanted ones.

Surprisingly, there are few methods to consider the impact of cross-boundary scribbles on 2D-to-3D conversion. To address this problem, we propose a robust method based on residuals between the user-specified and estimated depth values during the iteratively solving process. Thanks to the confidence of user scribbles measured by the residuals, experimental results show that the proposed method can remove depth artifacts caused by cross-boundary scribbles. The two most relevant to this work are Wang et al. [24] and Hong et al. [32]. Unlike the optimization model in Wang et al. [24], the proposed method utilizes residuals to eliminate the depth artifacts caused by cross-boundary scribbles. The main difference to Hong et al. [32] is that they use residuals to determine the relative weight between data fidelity and regularization, whereas this paper leverages residuals to compute the confidence of user scribbles.

Recently, Ham et al. [33] proposed a static dynamic filter (SDF) to reduce artifacts caused by structural differences between guidance and input signals. Although SDF [33] can handle differences in structure, it is not robust to outliers introduced by cross-boundary scribbles. Yuan et al. [34] proposed an 1 optimization method to remove user erroneous scribbles. However, 1 norm assumes that input image can be approximated by the sum of a piecewise-constant function and a smooth function [35]. Depth artifacts will be introduced when the assumption does not hold.

The remainder of this paper is organized as follows. In Section 2, the proposed method is described. The experimental results are given in Section 3. Finally, conclusion is given in Section 4.

2 Method

The workflow of 2D-to-3D image conversion based on the proposed method is shown in Fig. 2. Firstly, the user specifies sparse depth on an input image, where scribbles indicate the labeled pixels are closer or farther from the camera. Secondly, a sparse depth-map is extracted according to the intensities of user scribbles. Thirdly, the confidence of user scribbles is calculated based on the residuals between the estimated and user-specified depth values. Then, an energy function constraint by the confidence is designed and minimized to obtain the estimated dense depth-map. The procedure is repeated from the confidence computation step, until a maximum number of iterations is exceeded. Finally, the stereoscopic 3D image is generated by depth image-based rendering (DIBR).

Fig. 2
figure 2

A flowchart of the semi-automatic 2D-to-3D image conversion with the proposed method

2.1 Model

Let O be the set consisting of pixels with user-specified depth values. The objective of this paper is to estimate an accurate dense depth-map d from the user input and the given image I even when cross-boundary scribbles are present. It can be expressed as solving the energy minimization problem:

$$ \mathbf{d} \!= \! \mathop{\arg\min}_{\mathbf{d} \in {\mathbb R}^{n}} \underbrace{\sum\limits_{i \in \mathbf{O}} r_{i} (d_{i} \,-\, u_{i})^{2}}_{\text{data fidelity}} + \underbrace{\sum\limits_{i=1}^{n} \sum\limits_{j \in \mathcal{N}_{i}} w_{ij}(d_{i} \,-\, d_{j})^{2}}_{\text{regularization}}, $$
(1)

where di and ui denote the estimated and user-specified depth values at pixel i, respectively. n is the size of the input image I. \(\mathcal {N}_{i}\) represents the set of 8-connected neighbors for pixel i. wij is a weighting function to make pixels with similar colors have similar depth values and is defined as

$$ w_{ij} \,=\, \left\{\begin{array}{ll} \text{exp}\left(-{\beta} \left \| \mathbf{I}_{i} - \mathbf{I}_{j} \right \|^{2}\right) & \text{if }j \in \mathcal{N}_{i}, \\ 0 & \text{otherwise}, \end{array}\right. $$
(2)

where Ii and Ij are the color values of image I at pixel i and j, respectively. β in Formula (2) is a parameter controlling the strength of the weight wij.

ri in Formula (1) is a confidence measure of the user-specified depth value at pixel i and is defined asz

$$ r_{i} \,=\, \left\{\begin{array}{ll} \text{exp}\left(-{\eta} (d_{i} \,-\, u_{i})^{2}\right) & \text{if }i \in \mathbf{O}, \\ 0 & \text{otherwise}. \end{array}\right. $$
(3)

Here, η is a constant that controls how dissimilar two depth values are. In Formula (1), the data fidelity term enforces the estimated depth values of labeled regions to approximate the user-specified ones. Unlike Wang et al. [24], the proposed method maintains this consistency only when user inputs are confident. The confidence ri is low when the residual (diui)2 is high. The regularization term is used to penalize the difference of the estimated depth values between each pixel and its neighbors.

2.2 Solver

Formula (1) is nonlinear to d and thus is an unconstrained, non-linear optimization. A fixed point iteration strategy is adopted to solve Formula (1). Let \(\mathbf {d}^{k} =\left [d_{i}^{k}\right ]_{n \times 1}\) and u denote vectors representing the estimated depth image in iteration k and user-specified depth values, respectively. The i-th element of u is user-specified depth value ui if iO and 0 otherwise. Then, in iteration k, the objective function to be minimized is expressed as

$$ E\left(\mathbf{d}^{k}\right) = \left(\mathbf{d}^{k} - \mathbf{u}\right)^{T}\mathbf{R}^{k-1}\left(\mathbf{d}^{k} - \mathbf{u}\right) + \lambda \mathbf{d}^{k,T}\mathbf{L}\mathbf{d}^{k}, $$
(4)

where Rk − 1 is a n×n diagonal matrix and its i-th diagonal element is \(r_{i}^{k\,-\,1}\). Here, \(r_{i}^{k\,-\,1} = \text {exp}\left (-{\eta } \left (d_{i}^{k\,-\,1} \,-\, u_{i}\right)^{2}\right)\) if iO and 0 otherwise. L is the n×n sparse Laplacian matrix. Its element Lij=−wij (ij) and \(L_{ii} = \sum _{j \in \mathcal {N}_{i}} w_{ij}\). To minimize the energy function in Formula (4), taking its derivatives on dk, Formula (5) can be obtained.

$$ \frac{\partial E\left(\mathbf{d}^{k}\right)}{\partial \mathbf{d}^{k}} = 2\mathbf{R}^{k-1}\left(\mathbf{d}^{k} - \mathbf{u}\right) + 2\lambda \mathbf{L}\mathbf{d}^{k}. $$
(5)

The energy function in Formula (4) can be minimized by setting \(\frac {\partial E\left (\mathbf {d}^{k}\right)}{\partial \mathbf {d}^{k}}\) in Formula (5) equal to zero, and Formula (6) is obtained.

$$ \left(\mathbf{R}^{k-1} + \lambda \mathbf{L}\right) \mathbf{d}^{k} = \mathbf{R}^{k-1} \mathbf{u}. $$
(6)

The linear system in Formula (6) is sparse, and thus, it can be solved using standard methods such as pre-conditioned conjugate gradient.

2.3 Analysis

It can be seen from Formula (4) that in each iteration, user-specified depth values can only be preserved if the residuals between estimated and user-specified depth values are small.

Specifically, the unwanted user input introduced by cross-boundary scribbles will make the depth values of labeled pixels differ from their neighbors. Meanwhile, the regularization term will enforce the estimation to be consistent with their neighbors, and thus make the estimated depth to deviate from the user input. As a result, the residual between the estimated and user-specified depth values of the unwantedly labeled pixel will be increased, and the confidence computed from the residual in Formula (3) will be decreased to zero during the iterative solution process. Therefore, the proposed method can remove unwanted user input introduced by cross-boundary scribbles.

As for user-expected input, the specified values of labeled pixels are consistent with their neighbors; thus, the estimation mainly depends on the data fidelity term which enforces the estimated depth to approximate the user input. Therefore, the residuals of expectedly labeled pixels are almost 0, and their confidence will be remained at 1 with the proper setting of η in Formula (3). For this reason, the proposed method can preserve the expected user input.

Figure 3 shows the change curve of confidence from user scribbles in an input image. It can be seen that confidence of the unwanted input rapidly drops to 0 while confidence of the expected input remains at 1.

Fig. 3
figure 3

Change curve of the confidence from user scribbles during iterative solution process where blue and yellow curves are for scribbles inside the blue and yellow rectangles, respectively

3 Experimental results and discussion

3.1 Experimental details

RGBZ (red, green, blue plus z-axis depth) datasets [36] are used for comparison which include objects, human figures, and multiple human interaction. Performance are also evaluated on four Middlebury stereo datasets, Tsukuba, Venus, Teddy, and Cones [37]. The source code and more experimental results can be downloaded from https://github.com/tcyhx/rdopt.

In the proposed method, the bandwidth parameters, η, are empirically set to 9000. A maximum number of five iterations is used to solve Formula (1). β is set to 100 for RGBZ datasets and 50 for Middlebury datasets. Results of the proposed method are compared to the state-of-the-art: RW [17], hybrid GC and RW (HGR) [18], nonlocal RW (NRW) [22], optimization (OPT) [24], OCP [31], SDF [33], and 1 [34]. Note that OCP originally aims for interactive segmentation, and this paper applies it to 2D-to-3D conversion by replacing the confidence in Formula (3) with the aggregation of the OCPs in a local neighborhood. Structural similarity (SSIM) [38] is used for performance evaluation since it can predict human perception of image quality. The standard deviation of SSIM in the experiments is set to 4 so as to evaluate the similarity of semi-global structure [39].

In the experiments, a trained user is asked to draw scribbles with a standard brush by referring to the groundtruth depth values, where higher intensities indicate the labeled pixels are closer to the camera. Since depth propagation from user scribbles relies on color or intensity similarity between neighboring pixels, more scribbles are drawn in high textured areas. To make the comparison as fair as possible, a sparse depth-map is extracted from user scribbles, and each algorithm estimates a dense depth-map from the sparse depth-map.

3.2 Experiments with cross-boundary user scribbles

In this section, a user is asked to assign the initial depth values manually by drawing some scribbles across object boundaries. Tables 1 and 2 show the SSIM values of the proposed algorithm in comparison with other methods on the RGBZ and Middlebury datasets, respectively. As shown in Tables 1 and 2, the proposed method achieves the highest average SSIM among all of the competing methods for both datasets. Except for the comparison with 1 in RGBZ_05 and Teddy, the SSIM values of the proposed method are higher than those of the other methods.

Table 1 SSIM of estimated depth on RGBZ datasets when cross-boundary scribbles are present
Table 2 SSIM of estimated depth on Middlebury datasets when cross-boundary scribbles are present

For RGBZ datasets, qualitative comparisons are shown in Figs. 4, 5, 6, 7, 8, 9, 10, 11 and 12. Qualitative comparisons on Middlebury datasets are given in Figs. 13, 14, 15, and 16. The rendered images based on depth are only shown for Middlebury datasets in order to avoid making the lengthy paper. In each figure, the yellow rectangles on depth-maps or synthesized views represent artifacts caused by cross-boundary scribbles while the purple ones denote artifacts caused by other issues. The cross-boundary scribbles of user-labeled images are marked by the yellow rectangles (Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12b, 13, 14, 15, and 16a).

Fig. 4
figure 4

Results of RGBZ_01 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 5
figure 5

Results of RGBZ_02 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 6
figure 6

Results of RGBZ_03 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 7
figure 7

Results of RGBZ_04 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 8
figure 8

Results of RGBZ_05 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 9
figure 9

Results of RGBZ_06 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 10
figure 10

Results of RGBZ_07 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 11
figure 11

Results of RGBZ_08 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 12
figure 12

Results of RGBZ_09 with cross-boundary input. a Input image. b User-labeled image. c Sparse depth. d Groundtruth depth. e Depth of RW. f Depth of HGR. g Depth of NRW. h Depth of OPT. i Depth of OCP. j Depth of SDF. k Depth of 1. l Depth of the proposed method. Please zoom in to see details

Fig. 13
figure 13

Results of Tsukuba with cross-boundary input. a User-labeled image. b Sparse depth. c Groundtruth depth. d Synthesized view using c. e Depth of RW. f Synthesized view using e. g Depth of HGR. h Synthesized view using g. i Depth of NRW. j Synthesized view using i. k Depth of OPT. l Synthesized view using k. m Depth of OCP. n Synthesized view using m. o Depth of SDF. p Synthesized view using o. q Depth of 1. r Synthesized view using q. s Depth of the proposed method. t Synthesized view using s. Please zoom in to see details

Fig. 14
figure 14

Results of Venus with cross-boundary input. a User-labeled image. b Sparse depth. c Groundtruth depth. d Synthesized view using c. e Depth of RW. f Synthesized view using e. g Depth of HGR. h Synthesized view using g. i Depth of NRW. j Synthesized view using i. k Depth of OPT. l Synthesized view using k. m Depth of OCP. n Synthesized view using m. o Depth of SDF. p Synthesized view using o. q Depth of 1. r Synthesized view using q. s Depth of the proposed method. t Synthesized view using s. Please zoom in to see details

Fig. 15
figure 15

Results of Teddy with cross-boundary input. a User labeled image. b Sparse depth. c Groundtruth depth. d synthesized view using c. e Depth of RW. f Synthesized view using e. g Depth of HGR. h Synthesized view using g. i Depth of NRW. j Synthesized view using i. k Depth of OPT. l Synthesized view using k. m Depth of OCP. n Synthesized view using m. o Depth of SDF. p Synthesized view using o. q Depth of 1. r Synthesized view using q. s Depth of the proposed method. t Synthesized view using s. Please zoom in to see details

Fig. 16
figure 16

Results of Cones with cross-boundary input. a User-labeled image. b Sparse depth. c Groundtruth depth. d synthesized view using c. e Depth of RW. f Synthesized view using e. g Depth of HGR. h Synthesized view using g. i Depth of NRW. j Synthesized view using i. k Depth of OPT. l Synthesized view using k. m Depth of OCP. n Synthesized view using m. o Depth of SDF. p Synthesized view using o. q Depth of 1. r Synthesized view using q. s Depth of the proposed method. t Synthesized view using s. Please zoom in to see details

RW [17] assumes that user scribbles should not cross object boundaries and thus generates depth artifacts around cross-boundary labeled regions (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16e). These artifacts cause distortions when a new view is synthesized from the depth as shown in Figs. 13, 14, 15, and 16f. HGR [18] relies on GC to preserve depth boundaries. However, GC is sensitive to the outliers. The quality of depth-maps produced from HGR thus degrades significantly when user scribbles cross object boundaries (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12f, 13, 14, 15, and 16g), which leads to significant degradation of quality in synthesized views (see Figs. 13, 14, 15, and 16h). Although introducing non-local constraints, NRW [22] is difficult to remove depth artifacts caused by cross-boundary user scribbles (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12g, 13, 14, 15, and 16i), which results in distortions in synthesized views (see Figs. 13, 14, 15, and 16j). OPT [24] constrains the estimated depth values of labeled pixels to be consistent with the user input; thus, unwanted information propagates to the neighbors (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12h, 13, 14, 15, and 16k). Distortions in synthesized views caused by input errors are shown in yellow rectangles of Figs. 13, 14, 15, and 16l. OCP [31] can remove some depth artifacts caused by cross-boundary user input, but it fails when the cross-boundary-labeled pixels have similar color distributions; thus, residual artifacts are still visible (see Figs. 4, 5, 6, 7i, 10, 11, 12i, 13, and 14m). OCP may also consider some expected scribbles as unwanted ones [31], which yields distortions as shown in the purple rectangles of Figs. 7, 8, 9i, 14, 15, and 16m. SDF [33] can reduce depth artifacts caused by structural differences between color and depth images by using the Welsch function as a regularizer. However, SDF is hard to handle artifacts introduced by the cross-boundary scribbles (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12j, 13, 14, 15, and 16o), which leads to distortions in synthesized views as shown in Figs. 13, 14, 15, and 16p. 1 [34] tends to produce a nearly piecewise constant depth-map with sparse structures. Therefore, it generates artifacts when depth discontinuities do not coincide with object boundaries (see purple rectangles of Figs. 4, 5, 6, 7, 8, 9k, 14q, and 16q), which causes distortions in synthesized views (see purple rectangles of Figs. 14r and 16r). The proposed method alleviates the influence of cross-boundary user scribbles successfully and produces high-quality depth-maps (see Figs. 4, 5, 6, 7, 8, 9, 10, 11, and 12l, and 13, 14, 15 and 16s). Therefore, the proposed method can reduce distortions in synthesized views caused by cross-boundary input as shown in Figs. 13, 14, 15 and 16t.

3.3 Experiments without cross-boundary user scribbles

In this section, the user carefully draws on an input image, ensuring that scribbles do not cross object boundaries. In this case, unwanted scribbles are usually inside objects when depth discontinuity occurs. Tables 3 and 4 show the SSIM obtained from different methods on RGBZ and Middlebury datasets, respectively. It can be seen from Table 3 that the proposed method gives the highest average SSIM on RGBZ datasets. As shown in Table 4, both the proposed method and OPT [24] obtain the highest average SSIM on Middlebury datasets. Therefore, the proposed method has comparable performance to the state-of-the-art methods when user scribbles do not cross object boundaries.

Table 3 SSIM of estimated depth on RGBZ datasets when cross-boundary scribbles are absent
Table 4 SSIM of estimated depth on Middlebury datasets when cross-boundary scribbles are absent

4 Conclusion

To remove unwanted input from cross-boundary scribbles in semi-automatic 2D-to-3D conversion, this paper proposes a residual-driven energy function for depth estimation from user input. The residual between the estimation and user-specified depth value will be large at the unwantedly labeled pixel due to inconsistency with its neighbors and be small at expectedly labeled pixel due to consistency with the neighbors. Therefore, the residual can differentiate unwanted scribbles from the user input. The experimental results demonstrate that the proposed method eliminates the depth artifacts caused by cross-boundary scribbles effectively and outperforms existing methods when cross-boundary input is present.