Keywords

1 Introduction

Estimation of foreground and background layers of an image is fundamental in image and video editing. In the computer vision literatures, this problem is known as image matting or alpha matting. Mathematically, the process is modeled in [1] by considering the observed color of a pixel as a combination of foreground color and background color:

$$\begin{aligned} I_{z} = \alpha _{z}F_{z} + (1-\alpha _{z}B_{z}) \end{aligned}$$
(1)

where \(F_{z}\) and \(B_{z}\) are the foreground and background colors of pixel z, \(\alpha _{z}\) represents the opacity of a pixel and takes values in the range [0,1] with \(\alpha _{z} = 1\) for foreground pixels and \(\alpha _{z} = 0\) for background pixels. This is a highly ill-posed problem since we have to estimate seven unknowns from three composition equations for each pixel - one for each color channel. Typically, matting approaches rely on constraints such as assumption on image statistics [2, 3] or user interactions such as a trimap to reduce the solution space. A trimap [4] partitions an image into three regions - known foreground, known background and unknown regions that consist of a mixture of foreground and background colors.

From the aspect of assumptions on image statistics, existing natural image matting methods fall into three categories: (1) propagation-based [2, 510]; (2) color sampling-based [1118]; (3) combination of sampling-based and propagation-based [1922] methods. Propagation-based methods assume that neighboring pixels are correlated under some image statistics and use their affinities to propagate alpha values of known regions toward unknown ones. Sampling-based methods assume that the foreground and background colors of an unknown pixel can be explicitly estimated by examining nearby pixels. Thus, these methods collect sets of known foreground and background samples to estimate alpha values of unknown pixels. Early parametric sampling-based methods usually fit parametric statistical models to known foreground and background samples and then estimate alpha values by considering the distances of unknown pixels to known foreground and background distributions. However, it will generate large fitting errors when the color distribution could not significantly fit a statistical model. Recently, non-parametric sampling-based methods simply collect sets of known foreground and background samples and select best (FB) pairs via an objective function combining spatial, photometric and probabilistic characteristics of an image to estimate alphas value of unknown pixels. Once the best (FB) pair is selected, the alpha value is computed as

$$\begin{aligned} \alpha _{z} = \frac{{(I_{z}-B)}\cdot {(F-B)}}{\parallel {F-B}\parallel ^2} \end{aligned}$$
(2)

Combined methods [1922] cast matting as an optimization problem and combine the color sampling component and the alpha propagation component in an energy function; solving for alpha matte becomes an energy minimization task. By the combination, more accurate and robust matting solutions can be expected. For a more comprehensive review on image matting methods, we refer the reader to [21, 23].

Fig. 1.
figure 1

Sampling-based matting approaches. Top and middle row: (a) An Original image with foreground and background boundaries marked as red and blue line respectively, (e) the ground truth alpha matte, sampling strategies of (b) Robust [21], (c) Shared [13], (d) Global [14], (f) Comprehensive [16], (g) KL-Divergence [18] and (h) proposed matting approaches. Bottom row: Comparison of the estimated alpha mattes by the proposed approach with Comprehensive sampling method [16] and KL-Divergence based sparse sampling method [18] (from left to right are zoomed area and corresponding alpha mattes estimated by [16, 18] and the proposed, respectively) (Color figure online)

The matting method proposed in this paper belongs to the group of sampling-based approaches. As we will discuss in next section in detail, these approaches suffer from the fact that the quality of the extracted matte highly depends on the selected samples and the performance degrades when the true foreground and background colors (true samples) of unknown pixels are not in the sample sets. Existing sampling-based methods sample foreground and background colors based on their spatial closeness to the unknown pixels only (sample around the boundaries of the known regions [8, 13, 21] or expand the sampling range for pixels farther from the foreground and background boundaries [16]) which lead to the missing out true samples problem, especially when the trimaps are coarse. To overcome this problem, we build a large set of representative samples that covers all the color clusters in the image to avoid the loss of true samples, and then select a set of candidate samples for each unknown pixel from these representative samples via an objective function that takes advantage of spatial as well as color statistics of the samples. The samples selected by proposed method are shown in Fig. 1(h).

The second disadvantage of current non-parametric sampling-based approaches is that they choose the best (FB) pair from candidate samples through optimization and use that pair to estimate \(\alpha _{z}\) via Eq. (2). This implies that \(\alpha _{z}\) is determined by a single (FB) pair and the goodness of that pair depends on how well the optimization is done. Thus if the optimization process fails to find the best pair, the extracted matte will not be accurate. Inspired by sparse coding matting [17], a new objective function is proposed which gives the estimation of alpha value directly from a bunch of candidate foreground and background samples for a given pixel instead of estimating it from a single best pair. This objective function contains measures of chromatic distortion and spatial statistics in the image, which is the main difference from the original sparse coding matting [17] which contains the chromatic distortion only.

This paper is organized as follows. We review sampling-based matting methods and their limitations in Sect. 2 followed by description of the proposed approach in Sect. 3. Experimental results are discussed in Sect. 4 and we conclude the paper in Sect. 5.

2 Related Work

Sampling-based image matting methods mainly differ from each other in (1) how they collect the candidate foreground and background samples for unknown pixels, and (2) how they estimate the alpha matte from the candidate samples.

Samples Collection: Early sampling-based methods simply collect foreground and background samples that are spatially close to the unknown pixel, either from a local window containing the unknown pixel [11] or along the boundaries of known regions [12, 21] based on local smooth assumptions. This will cause large fitting errors when the assumptions do not hold.

Shared sampling matting [13] shots rays in different directions from unknown pixels that divide the image plane into disjoint sectors containing equal planar angles and collects samples on the rays. Fore each ray, it collects at most one background and at most one foreground sample - the ones closer to the unknown pixel along the ray as shown in Fig. 1(c). Global sampling matting [14] proposes a approach that takes all available samples into consideration. Their foreground (background) sample set consists of all known foreground (background) pixels on the boundaries of unknown regions as shown in Fig. 1(d).

The aforementioned sampling-based methods generally collect samples only around the boundaries of the known regions which may miss out true samples. Comprehensive sampling matting [16] builds a more comprehensive and representative set of known samples by expanding the sampling range farther from the foreground and background boundary and sampling from all color distributions in the sample regions as shown in Fig. 1(f). This approach gives better results than the previous sampling-based approaches. However, there is still a possibility of missing out true samples since the sampling strategy depends on spatial closeness. KL-Divergence sampling matting [18] formulates sampling as a row-sparsity regularized trace minimization problem and picks a small set of candidate samples that best explain the unknown pixels based on pairwise dissimilarities between known and unknown pixels as shown in Fig. 1(g). This method gathers a uniform sparse set of samples for all unknown pixels which might also miss out true samples. A visual comparison of the alpha mattes estimated by the proposed method with comprehensive [16] and KL-Divergence sampling methods [18] is shown in the last row of Fig. 1.

Alpha Matte Estimation: Classical parametric sampling-based image matting algorithms focus on how to model the relations between the samples and the alpha parameter. The Knockout method [12] adopts a weighted sum of candidate samples to estimate foreground and background colors of unknown pixels and uses them to estimate the alpha value in each channel. The final alpha value is estimated as a weighted sum of the values in all channels. Bayesian matting [11] models foreground and background colors as mixtures of Gaussians and the matting problem is formulated in a well-defined Bayesian framework, then the matte is solved with a maximum-likelihood criterion.

Due to the improperness of estimating alpha values with the statistical model in parametric sampling-based methods, recent non-parametric sampling-based approaches focus on selecting a best foreground and background sample pair (FB) from candidate samples and using the best pair to estimate alpha value via Eq. (2), as suggested in [1316, 18, 21]. They use an objective function containing different image characteristics to find the best (FB) pair. These methods differ from each other in what image characteristics they use.

In non-parametric sampling-based methods, the alpha values are determined by a single (FB) pair, thus when the designed objective function fails to find the best sample pair, inaccurate alpha matte will be generated. To overcome this limitation, sparse coding matting [17] cast image matting as a sparse coding problem and generates alpha values from a bunch of foreground and background samples instead of choosing a single best (FB) pair. This approach gives visually superior matte than previous non-parametric sampling-based approaches.

3 Proposed Method

In this section, we first describe our clustering-based sampling method which collects a representative set of samples for all known pixels. Next, a simple objective function is proposed to select a set of candidate foreground and background samples for each unknown pixel from the previous collected representative set of samples. Then, we elaborate how an objective function containing both chromatic distortion and spatial statistics is proposed that gives the estimation of alpha values directly from a bunch of foreground and background samples. Finally, we describe how the pre and post-processing are used to refine the matting performance.

3.1 Gathering Samples Using K-means Clustering

The goal of sampling is to gather a representative set of foreground and background samples that covers a large range of diverse color clusters in the image so as not to miss out true samples. This is accomplished by clustering the foreground and background pixels respectively via a two-level hierarchical k-means clustering framework considering the spatial statistics as well as the color statistics in the image. This is motivated by the observation that the foreground and background colors in an image could be represented by a sparse set of pixels.

For the foreground region defined by a trimap, we first cluster the pixels into K clusters. We defined the feature vector q(z) at a given pixel z as a 5-D vector \([R_z\ G_z\ B_z\ x_z\ y_z]^T\) consisting of the concatenation of RGB color and spatial position in the image coordinates. Then, we create a matrix Q such that each column corresponds to a feature vector of one known foreground pixel. Thus, we can treat Q as the data matrix in the k-means clustering algorithm [24]. After the first level of clustering, the same clustering process is applied on pixels in each cluster but with respect to color statistics only. The numbers of clusters in the second level clustering are determined by the sum color variances of three color channels in each cluster obtained in the first level. The mean color values in each cluster at the second level constitutes the representative set of foreground samples. Using exactly the same method, a representative set of background samples could be obtained.

3.2 Selecting Candidate Samples

In the k-means clustering sample gathering step, we collect two large set of foreground and background samples that covers various color clusters in the image for all the unknown pixels. To reduce the number of legal hypotheses to be tested in the estimation of alpha matte, for each unknown pixel z, we choose a set of candidate samples that could better represent the true foreground and background colors of the pixel from that representative sample sets. Hence, a simple objective function \(O_{z}\) adopting previously suggested measures of chromatic distortion \(C_{z}\) and spatial statistics \(S_{z}\) in [8, 15, 16, 18] is proposed:

$$\begin{aligned} O_{z}(F_{i}, B_{j}) = C_{z}(F_{i}, B_{j}) \times S_{z}(F_{i}, B_{j}) \end{aligned}$$
(3)

\(C_{z}\) quantifies how well the estimated alpha value \(\alpha _{z}\) of pixel z obtained using Eq. (2) from a sample pair \((F_{i}, B_{j})\) fits the linear model of composition Eq. (1), and is given by:

$$\begin{aligned} C_{z}(F_{i}, B_{j}) = exp(-\parallel {I_{z}-(\alpha _{z}F_{i} + (1-\alpha _{z}B_{j}))}\parallel ) \end{aligned}$$
(4)

where \(I_{z}\) denotes the observed color of unknown pixel z. It has a high value for (FB) pair whose estimated alpha could well fit the linear composite equation.

The term \(S_{z}\) quantifies the closeness between the unknown pixel z and the sample pair(FB) in the spatial coordinates domain. It is formulated as:

$$\begin{aligned} S_{z}(F_{i}, B_{j}) = exp(-\frac{\parallel {z-F_{i}^{s}}\parallel }{Z^{F}}) \times exp(-\frac{\parallel {z-B_{j}^{s}}\parallel }{Z^{B}}) \end{aligned}$$
(5)

where \(F_{i}^{s}\) denotes the spatial coordinates of foreground sample \(F_{i}\) and \(B_{j}^{s}\) denotes the spatial coordinates of \(B_{j}\). \(Z^{F}=\frac{1}{|\mathcal {S}^{F}|}\sum _{F_{k}\in \mathcal {S}^{F}}\parallel {z-F_{k}^{s}}\parallel \) and \(Z^{B} = \frac{1}{|\mathcal {S}^{B}|}\sum _{B_{k}\in \mathcal {S}^{B}}\parallel {z-B_{k}^{s}}\parallel \) correspond to the mean spatial distances from the unknown pixel z to all the foreground samples \(\mathcal {S}^{F}\) with \(|\mathcal {S}^{F}|\) elements and background samples \(\mathcal {S}^{B}\) with \(|\mathcal {S}^{B}|\) elements respectively which are used as scaling factors. Hence, it tends to select samples that are spatially close to the unknown pixel.

Finally, for each pixel z, we select a number of N foreground and background pairs with the highest values for the objective function (3). The corresponding foreground samples and background samples of the N pairs constitutes the foreground sample set \({\mathcal {S}_{z}^F}\) and background sample set \({\mathcal {S}_{z}^B}\) of the unknown pixel z.

Figure 2 shows the sampling process of the proposed method. The original image is shown in Fig. 2(a) whose trimap consists of background, unknown and foreground regions labeled as black, gray and white respectively as shown in Fig. 2(b). The foreground and background clusters obtained by a two-level k-means clustering framework are shown in Fig. 2(c), with clusters represented by different colors. Figure 2(d) shows the selected candidate samples (with red and blue points representing foreground and background samples, respectively) for pixel p (yellow point). As it can be seen, the proposed sampling strategy selects foreground and background samples from known regions for each pixel meanwhile avoids missing out true samples.

Fig. 2.
figure 2

Cluster-based sampling and candidate samples choosing. (a) Original image. (b) Trimap. (c) Foreground and background clusters via two-level k-means clustering. (d) Candidate samples for pixel p. (e) The generated alpha matte (Color figure online)

3.3 Estimating \(\alpha \) via Sparse Coding

As mentioned in Sect. 2, previous non-parametric sampling-based methods generally select the best foreground and background pair (FB) for each pixel from the candidate sample through an optimization process and use it to estimate alpha value by Eq. (2). The main drawback of these methods is that the alpha value is determined by a single best pair thus that they would generate incorrect alpha matte if the optimization fails to find the best pairs. To overcome this limitation, inspired by [17], the proposed method capitalizes on the sparse coding to establish an objective function for generating alpha values directly from a bunch of candidate foreground and background samples.

In [17], the authors form a dictionary \(\mathcal {D}\) for each unknown pixel z using the collected foreground and background samples. The word vector used for constituting the dictionary is a 6-D vector \([R\ G\ B\ L\ a\ b]^T\) consisting of the concatenation of the RGB and Lab color spaces, and is normalized to unit length. \(\mathcal {D}\) is a matrix with each column corresponding to a word vector with respect to the candidate sample. Then, the alpha value of pixel z is determined by sparse coding as

$$\begin{aligned} \varvec{\beta } = \mathop {argmin}\limits _{\varvec{\beta }}\parallel {v_{z}-\mathcal {D}\varvec{\beta }}\parallel ^2\qquad s.t. \quad \parallel {\varvec{\beta }}\parallel _1 \le 1;\ \varvec{\beta } \ge 0 \end{aligned}$$
(6)

where \(v_{z}\) is the single vector at pixel z composed of \((R_{z},G_{z},B_{z},L_{z},a_{z},b_{z})\). The sparse codes \(\varvec{\beta }\) corresponding to words in the dictionary that belong to foreground sample set are added to form the alpha value for the unknown pixel.

$$\begin{aligned} \alpha _{z} = \sum _{p\in F_z} \beta _{p} \end{aligned}$$
(7)

where \(F_z\) is the set of foreground samples of pixel z. Since the non-zero values in \(\varvec{\beta }\) indicate the ratios of the corresponding sample colors in composing the color of unknown pixel, the sparse codes directly provide the alpha value.

The proposed method also takes advantage of the sparse coding to directly generate \(\alpha \) from a bunch of candidate samples. Moreover, we take extra characteristics of the samples into consideration while sparse coding. The extra characteristics we use are spatial distances of the samples to the unknown pixel and the color variances of the clusters which generate the sample colors. The alpha value is determined by an objective function derived from a weighted sparse coding as

$$\begin{aligned} \begin{aligned} \varvec{\beta } =&\mathop {argmin}\limits _{\varvec{\beta }} \parallel {v_{z} - \mathcal {D} \varvec{\beta }} \parallel _{2}^{2} + \lambda \parallel {diag(\varvec{w})\varvec{\beta }\parallel _{1}} \\&s.t. \quad \parallel {\varvec{\beta }}\parallel _1 \le 1;\ \varvec{\beta } \ge 0\\ \end{aligned} \end{aligned}$$
(8)

where \(v_{z}\) and \(\mathcal {D}\) have the same meaning as that in Eq. (6). \(\lambda \) is a weighting parameter balancing the weights of the chromatic distortion and spatial statistics. \(diag(\varvec{w})\) is a diagonal matrix corresponding to vector \(\varvec{w}\) and indicates the weights of the words in the dictionary with respect to the characteristics of the corresponding samples, thus it is formulated as:

$$\begin{aligned} w_{p} = 1-T_{z}(Y_{p}) \times U_{z}(Y_{p}) \end{aligned}$$
(9)

where T represents the spatial statistics of the image and U indicates the color variances of the clusters.

The term \(T_{z}\) measures the spatial distance of the sample \(Y_{p}\) to the unknown pixel z and is given by:

$$\begin{aligned} T_{z}(Y_{p}) =\left\{ \begin{aligned}&exp(-\frac{\parallel {z-Y_p^s}\parallel }{Z_{z}^{F}}), \quad&Y_p\in \mathcal {S}_z^F\\&exp(-\frac{\parallel {z-Y_p^s}\parallel }{Z_{z}^{B}}),&Y_p\in \mathcal {S}_z^B\\ \end{aligned} \right. \quad p =1,2,\cdots ,P \end{aligned}$$
(10)

where P is the size of the dictionary \(\mathcal {D}\). \(Z_{z}^{F}\) and \(Z_{z}^{B}\) represent the mean spatial distances from the unknown pixel z to all the candidate foreground and background samples of that pixel, respectively. Hence, the sparse codes tend to have high values for the words in \(\mathcal {D}\) that are computed by spatially close samples to the unknown pixel.

The term \(U_{z}\) forces the sparse codes be biased towards those samples that come from clusters with low color variances and is formulated as:

$$\begin{aligned} U_{z}(Y_{p}) = \left\{ \begin{aligned}&exp(-(1+\frac{\log _{10}{Y_{p}^r}}{M_{F}})), \quad&Y_p\in \mathcal {S}_z^F\\&exp(- (1+\frac{\log _{10}{Y_{p}^r}}{M_{B}})),&Y_p\in \mathcal {S}_z^B\\ \end{aligned} \right. \quad p =1,2,\cdots ,P \end{aligned}$$
(11)

where \(Y_{p}^r\) is the sum variances of each color channel in the cluster the sample \(Y_{p}\) comes from. The scalars \(M_{F} = max_{F_{k} \in S_{z}^F}(\left| {\log _{{10}} F_{k}^{r} } \right| )\) and \(M_{B} = max_{B_{k} \in B_{z}^F}(\left| {\log _{{10}} B_{k}^{r} }\right| )\) are used as scaling factors, which correspond to the maximum absolute logarithm of the sum variances in three color channels in clusters forming the foreground sample set and background sample set, respectively.

Fig. 3.
figure 3

Effect of spatial statistics and color variances. (a) Original image. (b) Zoomed areas. Estimated matte with (c) \(\lambda = 0\) and (d) \(\lambda = 0.0025\). (e) Ground truth mattes (Color figure online)

The optimization of Eq. (8) can be solved as quadratic programming problem. We use a variant of active-set algorithm [25] that can benefit from the sparsity solution [26] to solve the optimization problem. Once the codes \(\varvec{\beta }\) are generated, the alpha value of pixel z could be obtained using Eq. (7).

Figure 3 shows the effect of taking spatial statistics and color variances into consideration while estimating alpha matte. The original image is shown in Fig. 3(a) with corresponding foreground and background boundaries. Zoomed areas are shown in Fig. 3(b). Figure 3(c) shows the alpha mattes of zoomed areas obtained with \(\lambda = 0\) and (d) with \(\lambda = 0.0025\) in Eq. (8). The ground truth mattes of zoomed areas are shown in Fig. 3(e). As it can be seen, combining chromatic distortion, spatial statistics and color variances into a weighted spare coding framework provides more accurate alpha matte than just using chromatic distortion [17].

3.4 Pre and Post-processing

Akin to recent sampling-based matting approaches [1618], we adopt some pre- and post-processing steps on the proposed method.

Expansion of Known Regions: To obtain a more refined trimap, the proposed method uses a pre-processing step to extrapolate known foreground and known background regions into the unknown regions based on certain chromatic and spatial thresholds. An unknown pixel z is considered as foreground if, there exists a pixel \(r\in F\) satisfying

$$\begin{aligned} (\parallel {z-r}\parallel < E_{thr})\wedge (\parallel {I_z-I_r}\parallel \le C_{thr}-\parallel {z-r}\parallel )) \end{aligned}$$
(12)

where \(E_{thr}\) and \(C_{thr}\) are threshold in spatial and color spaces, respectively. A similar formulation is applied to expand the background regions.

Local Smoothing: As a post-processing, we perform local smoothing on the initial alpha matte estimated by weighted sparse coding to obtain a smooth matte using a modified version of the Laplacian matting model [2] adopted in [13]. Hence, the final alpha matte is optimized with a cost function consisting of the data term \(\hat{\varvec{\alpha }}\) and a confidence value f together with a local smoothness term expressed by matting Laplacian given by:

$$\begin{aligned} \varvec{\alpha }= \mathop {argmin}\limits _{\varvec{\alpha }} \varvec{\alpha }^T \varvec{L} \varvec{\alpha } + \varvec{(\alpha -\hat{\alpha })}^T (\varepsilon \varvec{\varSigma } +\gamma \varvec{\varGamma })\varvec{(\alpha -\hat{\alpha })} \end{aligned}$$
(13)

where \(\hat{\varvec{\alpha }}\) is the initial alpha matte generated using Eq. (7). \(\varvec{L}\) is the matting Laplacian defined in [2]. \(\varepsilon \) is a large weighting parameter penalizing the divergence of the alpha values of the known pixels and \(\gamma \) is a constant value denoting the relative importance of the data and smoothness terms. The data term imposes the final alpha matte to be close to the initial alpha matte \(\hat{\varvec{\alpha }}\) and the matting Laplacian enforce local smoothing. \(\varvec{\varSigma }\) is a diagonal matrix with values 1 for known foreground and background pixels and 0 for unknown pixels, while the diagonal matrix \(\varvec{\varGamma }\) has values 0 for known pixels and f for unknown pixels. The confidence value \(f_{z}\) at a given pixel z is computed by:

$$\begin{aligned} f_{z} = R_{z}\times {C_{z}} \end{aligned}$$
(14)
$$\begin{aligned} R_{z} = exp(-\parallel {v_{z}-\mathcal {D}\varvec{\beta }}\parallel ) \end{aligned}$$
(15)

where \(R_{z}\) measures the deviation in reconstructing the input single vector based on sparse coefficients; \(C_{z}\) measures the distortion between estimated color and observed color which has been explained in Eq. (4).

4 Experimental Results

In this section, we first assess the effect of \(\lambda \) in Eq. (8). Then the performance of the proposed matting method is evaluated on a benchmark dataset [27]. It consists of 27 training images and 8 testing images. The training images have two types of trimaps: small and large while the testing images have three types of trimaps: small, large and user defined which are available at www.alphamatting.com. The ground-truth alpha mattes for the training set are publicly available and hidden from the public for testing images. An independent quantitative evaluation is provided in terms of the mean squared error (MSE), the sum of absolute difference (SAD), the gradient error and the connectivity error. Finally, we evaluate the effectiveness of the proposed sampling method in dealing with missing out true samples problem.

Fig. 4.
figure 4

Effect of \(\lambda \) on the performance. Plot shows average MSE values over all training images and all trimaps

4.1 Effect of Parameter \(\lambda \)

To verify the effectiveness of our weighted sparse coding in generating alpha value in a quantitative manner, we evaluate the average MSE values over all the training images with all trimaps on the benchmark dataset with different values of \(\lambda \), and is shown in Fig. 4. When \(\lambda =0\), the objective function used to estimate alpha matte becomes the same to that in [17]. As it can be seen in Fig. 4, our objective function considers both chromatic distortion and spatial statistics performs better than that in [17] which only considers chromatic distortion when \(\lambda \) is set properly. In the experiments, \(\lambda \) is set to 0.0025 as it provides the minimum MSE value for the training set.

Table 1. Evaluation of matting methods on the benchmark dataset [27] with three trimaps with respect to SAD, MSE and Gradient error metrics

4.2 Evaluation on Benchmark Dataset

Table 1 shows the quantitative evaluation of the proposed matting approach compared with current matting methods via the alpha matting website [27]. Only ten best preforming methods are shown in the table. We report the average rankings over the 8 testing images according to SAD, MSE and gradient error metrics. “Average small/large/user ranks” represent the average ranks over images for each of the three types of trimaps. The overall rank is the average over all the testing images for all types of trimaps. The proposed method ranks first with respect to SAD with a overall rank of 10.2. We achieve the best ranking among all the methods with respect to SAD and gradient error on the large trimap and ranks second with respect to MSE (the first is LNSP [22]). The proposed method also ranks first on the user trimap with respect to SAD error. This implies that the proposed method are more robust to the fineness of the trimap than previous sampling-based methods since it weakens the spatial assumptions while sampling for unknown pixels.

Fig. 5.
figure 5

Visual Comparison of our approach with other matting methods. (a) Original image, (b) Zoomed areas. (c) LNSP [22], (d) Comprehensive sampling [16], (e) Sparse coding matting [17], (f) KL-Divergence based sparse sampling [18] and (g) proposed approach (Color figure online)

Figure 5 shows the visual comparison of our approach with the recent matting methods [1618, 22] on the doll, plant and pineapple images from the benchmark dataset. Original images and zoomed areas are shown in Fig. 5(a) and (b) respectively. The estimated mattes for zoomed areas by recent matting methods of LNSP [22], Comprehensive sampling [16], Sparse coding matting [17], KL-Divergence based sparse sampling [18] and ours are shown in Fig. 5(c–g). The doll (first and second rows) is placed in front of a highly textured background which makes it hard for sampling-based approaches to discriminate between foreground and background as shown in Fig. 5(c,d,f). Sparse coding matting [17] which employs the success of sparse coding proposes a better matte as shown in Fig. 5(e). The same problem happens on the first zoomed area of plant (third row) where some characters of the background are considered as foreground as shown in Fig. 5(c,d,f). Sampling-based methods typically rely on certain spatial assumptions while collecting samples from known regions which might lead to missing out true foreground and background colors for some unknown pixels such as pineapple and the second zoomed area in plant (last three rows) as shown in Fig. 5(d). Although KL-Divergence sampling approach formulates sampling as a sparse subset selection problem, it collects the same set of samples for all the unknown pixels which also leads to missing out true samples as seen in Fig. 5(f).

The proposed method builds a representative sample set for all unknown pixels to cover all true samples, and then selects a set of candidate samples for each unknown pixel via an objective function. Moreover, inspired by sparse coding matting [17], we use a weighted sparse coding to generate alpha value directly from a bunch of foreground and background samples which avoids the limitation that the quality of the alpha matte is highly rely on the goodness of a single simple pair. These two characteristics make the proposed approach extract out a visually superior matte in these ambiguous areas as shown in Fig. 5(g).

4.3 Missing True Samples

Previous sampling-based image matting methods typically rely on spatial closeness while collecting samples which would fail to generate accurate alpha matte when the true samples are not spatially close to the unknown pixels, this problem is known as missing out true samples. Figure 6(a) shows two original images with their corresponding foreground and background boundaries from the benchmark dataset [27]. Zoomed areas and their ground truth alpha mattes are shown in Fig. 6(b) and (c), respectively.

Fig. 6.
figure 6

Illustration of missing out true samples. (a) Original images. (b) Zoomed areas. (c) Ground truth mattes. Estimated mattes by (d) Proposed method, (e) KL-Divergence sampling [18] (f) Comprehensive sampling [16] and (g) Global Sampling [14] (Color figure online)

In the zoomed area of doll girl image (first row), the true background colors of the gray pixels in the unknown region are far away from the them, thus the set of background samples by comprehensive and global sampling methods do no contain the gray colors. They wrongly estimate these pixels as foreground as shown in the first row of Fig. 6(f) and (g). The pumpkin image (second row) has a complex foreground and the distribution of the true foreground samples do not satisfy the spatial closeness assumption, thus are missed out in the foreground sample set collected by comprehensive and global sampling methods for some parts of the pumpkin. They are mistakenly estimated as background as shown in the second row of Fig. 6(f) and (g). KL-Divergence sampling select a sparse set of foreground and background samples which might also miss out true samples as shown in Fig. 6(e). The proposed method collects a relatively large and representative set of samples from all the known regions and selects a candidate set of samples for each unknown pixels based on both color and spatial statistics to solve the problem of missing out true samples. The visual comparison between ground truth mattes and estimated ones by proposed method is shown in Fig. 6(c) and (d), demonstrating that the sampling strategy proposed in this paper could well solve the missing out true samples problem.

5 Conclusions

A robust sampling-based image matting approach is proposed that applies a new sampling strategy to build a representative set of samples from known regions. Rather than collecting samples according to spatial assumptions or selecting a uniform sample set for all unknown pixels, we select samples for each unknown pixel based on both color and spatial statistic to solve the problem of missing out true samples. Moreover, based on weighted sparse coding, we adopt a new objective function to generate alpha values from a bunch of candidate samples directly to remove the restriction of a single (F,B) pair in determining the alpha value. Finally, the quality of the estimated matte is refined using a local smooth priors. Experimental results show that the proposed method achieves more robust performance than state-of-the-art approaches evaluated on a benchmark dataset.