1 Introduction

Many low-level computer-vision problems (e.g., stereo matching and optical flow estimation) are formulated as multi-labeling problems, where discrete labels (e.g., disparity and motion vector) are assigned to pixels. In general, there are two approaches to solve these problems: global and local. The former models a labeling problem as a Markov random field (MRF), where global optimization techniques [7, 8, 14, 19, 30, 32, 35, 37] are used to minimize the energy function. Although such an approach is effective, using it to solve a large optimization problem makes the inference intractable when the image size or label space is large. Rhemann et al. [28] presented a local approach called cost-volume filtering (CVF), which efficiently solves general multi-labeling problems by performing MRF optimization via fast local filtering of label costs instead of global smoothing. CVF is easy to implement and provides high-quality results; therefore, it has been widely used to solve various multi-labeling problems [10, 12, 18, 20, 40]. However, a limitation of CVF is that it does not scale to extremely large label sets (e.g., sub-pixel stereo matching and up-sampling of 16-bit depth maps captured by a Kinect sensor).

To overcome this limitation, Lu et al. [22] proposed the PatchMatch filter (PMF), which performs CVF iteratively on local superpixels with compact label subsets instead of performing it on the entire image coordinate space. In general, the average size of local label subsets is much smaller than the size of the entire label space; therefore, although PMF and CVF provide similar levels of accuracy, the efficiency of PMF is considerably higher. Nevertheless, PMF relies on global optimization based on the complex PatchMatch approach [3, 6] to estimate a label subset for each superpixel. Thus, the computational complexity of PMF increases with the number of superpixels, and therefore, PMF becomes less effective when an image is divided into many superpixels.

This paper presents an alternative coarse-to-fine strategy for efficiently estimating compact label subsets to solve the label space problem in cost-volume filtering. Based on the observation that true labels at the same coordinates in an image of different scales are highly correlated, we propose that lower-scale labeling outputs be leveraged for estimating higher-scale local label subsets. Starting with an image of very low-resolution, we iteratively truncate unimportant labels at each higher scale, and finally, we assign compact and approximately optimal label subsets to local regions of the original scale. The advantage of the proposed framework is a simple and efficient coarse-to-fine strategy, which does not require any global optimization as in [22]; moreover, its computational complexity is not significantly affected by the number of local regions. Extensive experiments described in Section 4 show that our algorithm achieves higher efficiency than PMF and CVF while providing a comparable or often superior level of accuracy.

The fundamental algorithm of our method and the experimental results of stereo matching have already been presented in our preliminary study [15]. In this paper, we provide detailed explanations and present the results of additional experiments for optical flow estimation. Note that we are not proposing a better algorithm for stereo matching and optical flow estimation, but proposing a coarse-to-fine method to drastically reduce the computational time of CVF while preserving its accuracy. As presented in [10, 12, 18, 20, 28], the CVF can be used for wide range of applications and the stereo matching and the optical flow estimation presented in this paper is just an example.

Our proposed algorithm can be directly applied to not only original CVF [28] but also several of its variants picked up in Section 2. In addition, our proposed algorithm can be implemented on GPU similar to the original CVF. However, in this paper, we did not perform those implementations, and compared with only the original CVF because we focus on “how to deal with the large label space efficiently”, not to improve the accuracy and not the real-time application.

The reminder of this paper is organized as follows. Section 2 reviews related studies. Section 3 briefly reviews CVF [28] and describes the details of the proposed coarse-to-fine strategy. Section 4 presents the experimental results and describes their evaluation using the Middlebury benchmark [24, 25]. Finally, Section 5 summarizes our findings and concludes the paper.

2 Related work

In this section, we mainly focus on related works about stereo matching and optical flow estimation, because they are main problems among multi-labeling problems and a lot of methods using cost-volume filtering techniques have been proposed in stereo matching and optical flow estimation. However, as mentioned in Section 1, the cost-volume filtering technique is not only used for them but also applied to wide range of multi-labeling problems such as image segmentation [20], and depth-map up-sampling [12].

2.1 Cost aggregation methods for labeling problems

First, we review cost aggregation methods for correspondence field estimation. Yoon and Kweon [44, 45] proposed a cost aggregation method using an adaptive weighted window such as an edge-preserving bilateral filter [36]. This method is slow because it needs to perform naive bilateral filtering iteratively, where the number of iterations is equal to the number of disparity candidates. To address this problem, Richard et al. [29] proposed an approximate bilateral filtering technique that reduces the computational complexity of adaptive support weight calculation. However, this approach provides low-quality results , as compared to state-of-the-art stereo matching methods. On the other hand, Yang [40, 41] proposed a tree-based non-local cost aggregation method using a minimum spanning tree. This method aggregates the cost values based on a tree structure constructed using input images, and the final disparity refinement process is also performed on the basis of the tree structure. Bai et al. [2] proposed an algorithm based on loop-erased random walk to improve the support weighted window of [40] near depth discontinuities. As stated in Section 1, Rhemann et al. [28] proposed CVF for general multi-labeling problems. By using an O(1) edge-preserving filter called a guided filter (GF) [16] for cost aggregation, CVF can efficiently solve general multi-labeling problems and achieve high-quality results. Lu et al. [21] proposed a new edge-preserving filter called a cross-based local multipoint filter (CLMF), which is an extension of the GF. Although the shape of the local support region of the GF is a square, that of the CLMF can be an adaptively derived from a reference image. Further, Lu et al. [21] showed that higher-quality stereo matching results can be achieved by applying the CLMF instead of the GF for cost aggregation. Zhang et al. [49] proposed a cross-scale cost aggregation algorithm based on CVF [28] for stereo matching. They showed that higher-quality disparity maps can be obtained by adding a regularization term between the cost values of different scales, and that the computational time of cross-scale aggregation is not significantly greater than that of the original CVF [28]. This method [49] is similar to ours in terms of multi-scale cost-volume utilization, but its purpose is to improve the quality of the disparity maps, not to reduce the computational complexity. Recently, Zhan et al. [48] proposed some techniques for local stereo matching methods to improve the accuracy: mask filtering as a pre-processing, an improved matching cost function, and multi-step disparity refinement as a post-processing. Inspired by the great success of convolutional neural networks (CNNs) in image recognition task, CNNs are recently used for computing the label costs (matching costs in stereo matching and optical flow estimation) instead of hand-crafted cost functions [11, 13, 23, 46, 47], which has led to significant improvement in terms of accuracy. In MC-CNN [46, 47], the CNN directly outputs the matching cost of two input patches. Cross-based cost aggregation and semi-global matching are preformed for the obtained cost-volume to produce accurate disparity map. To speed up computing the matching cost, Chen et al. [11] and Luo et al. [23] proposed similar ideas, where the matching cost is defined as the inner product of two features from CNN. In FlowNet [13], the matching costs are defined as the correlation between two patches of feature maps, and the final flow map is obtained by upconvolution operation. The computation of the correlations is implemented as correlation layer, which is incorporated into CNN.

Most of local methods perform cost aggregation for all the labels (disparities) at every pixel. Therefore, those methods are limited in that they do not scale to extremely large label sets. To overcome this problem, with regard to stereo matching, Min et al. [26, 27] proposed a technique to estimate a compact disparity subset for every pixel by considering disparities with the local minima of the pre-filtered cost values. Although this method efficiently achieves high-quality results with the Middlebury stereo benchmark [25], it cannot be applied to general multi-labeling problems directly. Wang et al. [38] adapted the sequential probability ratio test to reduce the disparity search range with the sufficient confidence in stereo matching problem. Helala and Qureshi [17] proposed the Accelerated CVF using an occulusion handling technique for stereo matching problem. For general multi-labeling problems, Lu et al. [22] proposed PMF, which is based on CVF [28]. As mentioned in Section 1, PMF estimates a compact label subset for every superpixel using the PatchMatch [3, 6] strategy; therefore, it is usually much more efficient than CVF while maintaining a similar level of accuracy. However, because PMF relies on complex PatchMatch-based global optimization to estimate a label subset for each superpixel, it becomes less effective when an image is divided into many superpixels.

2.2 Coarse-to-fine strategy

Coarse-to-fine strategies have been employed in a variety of methods for labeling problems such as stereo matching and optical flow estimation. We can classify them into two types: the coarse-to-fine strategies where the cost aggregation results from all resolution are merged in order to obtain more accurate results such as [33, 42, 49], and ones where the results of lower resolution are propagated to higher resolution in order to reduce the search range of labels such as [43, 50]. We focus on the latter because our method is classified into latter group.

Brox et al. [9] employed a coarse-to-fine strategy in their global optimization framework to estimate a optical flow field. They obtain an output flow field as the solution of their energy minimization formulation by solving Euler-Lagrange equations. They supplied a theoretical explanation that justifies their coarse-to-fine strategy by regarding it as a part of the two nested iterations for non-convex optimization, and argued that their coarse-to-fine strategy helps the convergence to the global minimum by setting the solution of coarser scale to the initialization of the next finer scale. Similar to [9], Wedal et al. [39] employed a coarse-to-fine strategy in their optical flow estimation framework, where the flow field is obtained by solving the total variation (L1 norm) minimization problem using linear approximation and alternating optimization scheme. They argued that their coarse-to-fine strategy has the advantage of avoiding poor local minimum by propagating the solution of coarser scale to the finer scale. Those coarse-to-fine strategies such as [9, 39] are tailored for global optimization techniques. These method iteratively update one solution for the entire image and propagate it to the next scale after the predetermined number of iterations. Therefore, their coarse-to-fine approaches cannot be used for CVF which needs pixel-wise cost computation for all possible labels and obtains pixel-wise solutions by winner-take-all strategy. Yang et al. [43] proposed a coarse-to-fine technique for belief propagation (BP), which reduces the computational complexity in both spatial and depth domain. This method is tailored for BP and cannot be directly applied to CVF. Different from these approaches, we propose a coarse-to-fine strategy for the cost-volume filtering technique that is categorized in local methods.

Next, we discuss the coarse-to-fine strategies employed in local cost aggregation methods which are close to our method. Zhao et al. [50] employed a coarse-to-fine strategy in their elegant implementation on GPGPU for real-time stereo. They limit the search range within ± 2 pixels of the disparity value obtained in lower resolution. The main difference between their method and ours is that the reduction of the search range is performed per pixel in their method, while it is done in each local region in our method. In addition, the comparison with their method has little meaning because their objective is the efficient disparity estimation in only foreground region and their algorithm is optimized for it. Their experimental results on Middlebury stereo datasets with the assumption that whole image area is foreground show the poor accuracy especially around the object boundaries (Disc. in Table 1 [50]). Tao et al. [34] proposed a multiscale local cost aggregation method for optical flow estimation called SimpleFlow. They upsampled the flow field obtained at the coarser scale and skipped the cost computation by interpolating the flow using simple bilinear interpolation in the regions where the flow was smooth. Therefore, their method can obtain a flow field with sublinear time with respect to the size of input images. Thier coarse-to-fine strategy is different from ours because our method estimates a compact label set in each local region to handle the large label space. In addition, without the refinement using the global optimization [31], the accuracy of the flow fields obtained by SimpleFlow [34] is much lower than that of CVF [28]. Although the SimpleFlow with the refinement can obtain the comparable accuracy to the CVF, the running time drastically increases because the global optimization in the refinement process is computationally expensive (Table 4 in [4]). On the other hand, our method can obtain comparable accuracy to the CVF [28] and is several times faster than CVF. Bao et al. [4, 5] proposed a fast edge-preserving PatchMatch for optical flow estimation. Their method estimates an approximate nearest neighbor field (NNF) using PatchMatch search at the coarsest scale, and repeats upsampling the NNF and the refinement of it within a small search range (3 × 3 pixels) until the original resolution. Their method is very fast and can achieve high-quality results for large displacement optical flow. However, for the datasets with small displacement optical flow, their coarse-to-fine strategy obtains the less accurate results than when without it (Table 4 [4]) because their method is tailored for large displacement optical flow. In contrast, our coarse-to-fine strategy for general multi-labeling problems obtains the comparable or more accurate results than the original CVF both when the label space is small and large.

3 Coarse-to-fine strategy for efficient CVF

In this section, we present a coarse-to-fine strategy for CVF [28] in order to address multi-labeling problems with a large label space. Given a label set \({\mathcal {L}=\{l_{0},\cdots ,l_{L-1}\}}\), the objective of a multi-labeling problem is to assign a label \({l_{i}\in \mathcal {L}}\) to each pixel \({i\triangleq [x_{i},\;y_{i}]~(i=0,\dots ,M-1)}\) in the image coordinate space I such that it minimizes the label costs encoded in the energy function [28]. Here, L and M denote the number of labels and the number of pixels, respectively.

3.1 CVF

The outline of CVF [28] is shown in Fig. 1. CVF solves a multi-labeling problem in three steps. First, a 3-D cost volume C is constructed as a collection of costs C(i, l) for selecting label l at each pixel i on the basis of the data term in the energy function. Then, each slice of the cost volume is independently filtered by an edge-preserving filter [16, 21], which is substituted for the smoothness term in the energy function:

$$ C(i,l) \leftarrow \sum\limits_{i^{\prime}\in \omega_{i}} W_{ii^{\prime}}C(i^{\prime},l), $$
(1)

where ω i is the squared window centered at the pixel i. Finally, the label at pixel i is simply selected by the winner-takes-all (WTA) strategy:

$$ l_{i} = \underset{l\in\mathcal{L}}{\arg\min} C(i,l). $$
(2)

When an O(1) edge-preserving filter (e.g., guided filter [16]) is used, the computational complexity of filtering an entire cost volume is O(M L); thus, it is difficult to handle an extremely large label space.

Fig. 1
figure 1

Framework of CVF [28]

One possible strategy for handling a large label space is to locally change the label space in order to reduce its size. Because the true label configuration is generally smooth in space (e.g., disparities are smooth except for object boundaries), the label space required for performing CVF on a local region should be smaller than the entire label space. As an example, we present a colored true disparity map of cones (see Fig. 2) that is divided into local regions by regular rectangular grids. In addition, we show a histogram of the true disparities l in the entire image and the ones in the local regions \({{S_{i}^{0}}}\) and \({{S_{j}^{0}}}\). We observe that the types of true labels in a local region are fewer than those in the entire label space.

Fig. 2
figure 2

Colored true disparity map of cones, and a histogram of the true disparities l in the entire image and the ones in local regions \({{S_{i}^{0}}}\) and \({{S_{j}^{0}}}\). The disparities l are rounded off to integer values

However, the problem is of course that we do not know a priori which labels are important for each local region, and thus, the estimation of local label subsets is required [22].

3.2 Problem statement

Here, we present a simple but efficient label subset estimation algorithm. Unlike Lu et al. [22], we do not rely on global optimization for estimating local label subsets; instead, we leverage the coarse-to-fine framework. An overview of the proposed method is shown in Fig. 3. Our algorithm mainly involves two steps (i) in-scale cost-volume filtering and (ii) across-scale label propagation. The latter is an essential feature of our approach, whereby a local label subset is estimated from the CVF output at a low-resolution. Because the computational cost of CVF for a low-resolution image is negligibly small, we perform CVF using a large label space with a low-resolution and truncate unimportant labels using the output.

Fig. 3
figure 3

Framework of proposed method

Let I k(k = 0,…, n − 1) denote a cascade of images of decreasing resolution ranging from the original scale (i.e., \({I^{k+1}=I^{k}_{\downarrow s}}\), where is a down-scaling operator with a scale factor s ∈ (1, )),Footnote 1 and let \({\mathcal {L}^{k}}\) denote the set of all possible labels at the k-th scale. Then, we divide I 0 (= I) into m non-overlapping local regions \({{S_{j}^{0}}}\) and partition I k(k ≥ 1) into local regions \({{S^{k}_{j}}(j=0,\dots ,m-1)}\) such that \({S_{j}^{k+1}={{S_{j}^{k}}}_{\downarrow s}}\). In addition, we represent a label subset for \({{S_{j}^{k}}}\) as \({\mathcal {L}_{j}^{k}}\) and its size as \({{L_{j}^{k}}}\). The total computational complexity of CVF from the lowest scale (k = n − 1) to the original scale (k = 0) is expressed as

$$ O\left( \sum\limits_{k=0}^{n-1}{\sum\limits_{j=0}^{m-1}{{M_{j}^{k}}{L_{j}^{k}}}}\right), $$
(3)

where \({{M^{k}_{j}}}\) is the number of pixels in \({{S^{k}_{j}}}\) (i.e., \({{M^{k}_{j}}=s^{-2k}{M^{0}_{j}}}\)). Therefore, our objective is to estimate compact label subsets \({\mathcal {L}_{j}^{k}}\) such that \({{\sum }_{k=0}^{n-1}{{\sum }_{j=0}^{m-1}{{M_{j}^{k}}{L_{j}^{k}}}}{\ll }ML}\) while maintaining the accuracy of CVF. The optimal m and n values will be discussed in Section 4.

3.3 Across-scale label propagation

In this section, we present an algorithm for estimating compact label subsets (\({{\mathcal {L}^{k}_{j}}}\)) that sufficiently reduce the computational cost in (3) without truncating important labels. Our algorithm begins with the coarsest scale (i.e., k = n − 1). At this scale, we set \({{\forall }j\ \mathcal {L}^{n-1}_{j}\leftarrow \mathcal {L}^{n-1}}\) and simply perform CVF [28] to acquire the filtered cost volume C n−1 at the (n − 1)-th scale. Note that although we use a complete label set, the computational complexity of CVF at this scale is O(s −2(n−1) M L), which is generally negligible (e.g., if we set s to 2 and n to 4, O(s −2(n−1) M L)≈O(10−2×M L)). Then, we initialize the label subset at the higher resolution (\({\tilde {\mathcal {L}}_{j}^{n-2}}\)) by merging labels having the smallest cost values in C n−1 at the corresponding local regions \({S^{n-1}_{j}}\). Strictly speaking, the initialization is expressed as

$$ \tilde{\mathcal{L}}^{n-2}_{j} = \bigcup\limits_{i\in S^{n-1}_{j}} f(l_{i}),\;\; l_{i} = \underset{l}{\arg\min} C^{n-1}(i,l), $$
(4)

where C n−1(p, q) is the value of the cost volume at the (n − 1)-th scale with regard to the position p and the label q, and f is a projection function that normalizes the label space if required. In general, the projection function is represented as a constant scale factor giving f = s. For instance, a disparity l at the k-th scale corresponds to s l at the (k − 1)-th scale in the stereo matching problem.Footnote 2 The initialization method based on across-scale label propagation is motivated by a reasonable observation that true labels at the same coordinates in images of different scales are highly correlated; in particular, they are very close when the difference in scales is small.

Although the initial estimation \({\tilde {\mathcal {L}}_{j}^{n-2}}\) is a good approximation of the optimal label subset \({\mathcal {L}_{j}^{n-2}}\), the problem is that \({\tilde {\mathcal {L}}_{j}^{n-2}}\) does not consist of labels that are not included in \({f(\mathcal {L}^{n-1}_{j})}\), which results in aliasing artifacts when the intermediate labels of \({\mathcal {L}_{j}^{n-1}}\) should be included in \({\mathcal {L}_{j}^{n-2}}\) (artifacts become more problematic as the scale difference increases). In addition, the filtered cost volume C n−1 often contains numerical errors due to occlusion boundaries or insufficient energy modeling. We adopt two strategies to overcome these difficulties. First, we down-sample images with a relatively small scale factor (e.g., s ≤ 2), such that the scale difference between two layers becomes sufficiently small. Second, we complete the initial label subset by adding the supporting labels within ±s/2. Note that our algorithm supports floating labels (e.g., sub-pixel disparity values). For instance, if the scale factor is 2 and the disparity unit is 0.5, the initial estimation \({\tilde {\mathcal {L}}_{j}^{n-2}=\{2,5\}}\) is extended as \({\mathcal {L}_{j}^{n-2}=\{1,1.5,2,2.5,3,4,4.5,5,5.5,6\}}\). Once a compact label subset \({\mathcal {L}_{j}^{n-2}}\) has been constructed, the target layer is shifted to the higher scale (i.e., kn − 2). Similarly to the case of the coarsest scale, CVF is performed on \({S_{j}^{n-2}}\) with regard to \({\mathcal {L}_{j}^{n-2}}\). Cost-volume filtering with respect to \({\mathcal {L}_{j}^{k}}\) and the estimation of \({\mathcal {L}_{j}^{k-1}}\) from C k are iterated n − 1 times until \({\mathcal {L}^{0}_{j}}\) is obtained. Then, the final label at each pixel in \({{S^{0}_{j}}}\) is selected by a simple WTA strategy, as in the case of CVF [28].

For the entirety of the coarse-to-fine process, we fix the radius of the edge-preserving filter to smooth the cost-volumes; in other words, the radius is not changed when the target scale is shifted to a higher scale. Therefore, the lower the scale, the more strongly is the cost-volume smoothed. Thus, incorrect labels that accidentally have low costs are truncated during our coarse-to-fine process. In the original CVF [28], especially near object boundaries, the low costs of such incorrect labels are sometimes not sufficiently smoothed, and these incorrect labels are selected by the WTA strategy. Therefore, in such cases, our coarse-to-fine strategy sometimes increases the accuracy of the output at the finest scale, as compared to the original CVF. The results will be presented in Section 4.1.2.

It is possible to generate \({{S_{j}^{0}}}\) in various ways, e.g., using regular rectangular grids or superpixels [1], as shown in Fig. 4. The former is simple and suitable for edge-preserving filters using integral images, e.g., a guided filter [16]. In contrast, when \({{S_{j}^{0}}}\) are generated by superpixels, some additional computational time is required because we need to apply the edge-preserving filter to the bounding-box containing each region, as in the case of [22]. However, in such cases, it is easier to estimate the local label subsets because the local regions based on the superpixels are less likely to cross object boundaries than regular grids. For these reasons, we use both regular rectangular grids and superpixels for generating local regions \({{S_{j}^{0}}}\), as described in Section 4.1.2.

Fig. 4
figure 4

Examples of local regions \({{S_{j}^{0}}}\)

The proposed algorithm is summarized as Algorithm 1.

figure e

4 Results

In this paper, we demonstrate the validity of our coarse-to-fine approach for CVF by applying it to stereo matching and optical flow estimation. Important to note that our technical contribution is the computational efficiency as compared to the original CVF algorithm, not the accuracy improvement. Besides, the application of CVF is not limited to stereo matching and optical flow estimation.

4.1 Middlebury stereo

Experiments were conducted to evaluate the performance of our proposed method using the Middlebury stereo matching benchmark [25]. In stereo matching, the label l corresponds to the integer disparity between a pixel i in the target image I and its equivalent in the reference image I shifted by the disparity. In the same manner, the cost function is selected as [28]:

$$\begin{array}{@{}rcl@{}} C(i,l) &=& (1-\alpha)\min{[\|I^{\prime}_{i+l}-I_{i}\|,\tau_{1}]}\\ &&+\alpha\min{[\|\nabla_{x} I^{\prime}_{i+l}-\nabla_{x}I_{i}\|,\tau_{2}]}, \end{array} $$
(5)

where ∇ x is the gradient in the x direction. The model parameters α, τ 1, and τ 2 are set to 0.89, 0.0027, and 0.0078, respectively.Footnote 3 We divide eight test image pairs of the Middlebury stereo datasets [25] into two categories according to their size: small and large. The small category includes cones (450 × 375), teddy (450 × 375), tsukuba (384 × 288), and venus (434 × 383). Further, the large category includes art (1390 × 1110), books (1390 × 1110), moebius (1390 × 1110), and reindeer (1342 × 1110). The label space size L is set to 60 for small datasets and 240 for large datasets. All the experiments were performed using an Intel Core i7-2600 (3.4GHz, single thread) machine with 16 GB of RAM, and they were implemented in C++. As in the original study of CVF [28], we use the guided filter [16] to smooth the cost volume (the radius of the filter is fixed at 9).

4.1.1 Evaluation of label selection

We begin by evaluating the efficiency of our coarse-to-fine strategy, as compared to that of CVF [28]. Here, we apply our method (n = 5, s = 2, m = 30) and CVF [28] to both small and large datasets; the results are averaged as shown in Fig. 5. We observe that overall, our coarse-to-fine strategy takes much less time than CVF [28]. As expected, the computational time for small scales (e.g., 1/16,1/8,1/4×) is negligible as compared to that for the original resolution (1/1×).

Fig. 5
figure 5

Evaluation of the computational time. The results of eight Middlebury stereo datasets are averaged. Post indicates the total computational time after weighted median filtering for the final disparity-map refinement

Further, we present the average size of local label subsets estimated in our coarse-to-fine process, as compared to the size of the entire label space (see Fig. 6). We observe that although the latter increases exponentially with the scale, there is no significant increase in the former, which is much smaller than the latter in the original scale. As a result, our method is much more efficient than CVF [28].

Fig. 6
figure 6

Evaluation of label set size at each scale. The results of eight Middlebury stereo datasets are averaged

However, an important question arises, which directly addresses the accuracy of the final label selection: “Are the estimated label subsets of the original scale really correct?” To answer this question, we define two metrics for measuring the correctness of the final label subset:

$$ P(j) = \frac{|\mathcal{L}_{j}^{0} \cap \mathcal{L}_{j}|}{|\mathcal{L}_{j}^{0}|},\quad R(j) = \frac{|\mathcal{L}_{j}^{0} \cap \mathcal{L}_{j}|}{|\mathcal{L}_{j}|}, $$
(6)

where \({\mathcal {L}_{j}}\) is the subset of ground truth labels at the original scale (i.e., a collection of ground truth disparity values that emerge in the j-th region), and we recall that \({\mathcal {L}^{0}_{j}}\) is the subset of estimated labels at the original scale. These two metrics evaluate the estimated label subset in two different aspects: P(j) ∈ [0, 1] measures the precision of \({\mathcal {L}^{0}_{j}}\), which implies how correctly unimportant labels are removed, and R(j) ∈ [0, 1] measures the recall of \({\mathcal {L}^{0}_{j}}\), which implies how correctly important labels are maintained. Note that the ideal situation of course occurs when \({{\forall }j\;\mathcal {L}^{0}_{j}=\mathcal {L}_{j}}\). For \({\mathcal {L}_{j}}\), we used the ground truth of the disparity maps precomposed in the Middlebury stereo datasets [25].

Using these metrics, we evaluate our method with a varying scale factor s and number of layers n using only small datasets, as shown in Tables 1 and 2. Here, the results are averaged over all the datasets in this category.

Table 1 Evaluation of label subset estimation with fixed lowest scale and varying scale differences
Table 2 Evaluation of label subset estimation with fixed scale difference and varying number of layers

Table 1 shows the evaluation of the label subset estimation with a fixed lowest scale and varying scale differences. We observe that when the scale difference between two layers is small (down-scale factor s = 2), our algorithm successfully maintains around 90% of ground truth labels and truncates more than 50% of unnecessary labels, on average, whereas the original label subset contains 90% of unnecessary labels. When the scale difference is large (s = 16), our method maintains more than 70% of unnecessary labels, on average. Therefore, we select a small down-scale factor (s = 2) in the following.

Next, Table 2 shows the case of a fixed scale difference and varying number of layers. We observe that when the number of layers n is set to 4, the performance of our method is optimal, considering both the precision and the recall. In such cases, our algorithm maintains more than 90% of ground truth labels and truncates more than 50% of unnecessary labels, on average. Further, we observe that when the number of layers is small (n = 2), the precision is low (less than 50%).

In summary, our observations are in good agreement with our experiments: the improvement in precision is generally limited when the number of layers is too small or the scale difference between two layers is too large. When setting the appropriate number of layers (n = 4) and scale difference (s = 2), our method successfully maintains important labels and removes unimportant labels using the coarse-to-fine strategy. Therefore, in the experiments described below, we fix n to 4 and s to 2.

4.1.2 Comparison with patchmatch filter

Here, we evaluate the performance of our method by comparing it with PatchMatch filter (PMF) [22] using both small and large datasets of the Middlebury stereo benchmark [25]. We did not compare the performance of our method with other algorithms dedicated for stereo matching because the stereo matching is merely one of the applications of our method for general multi-labeling problems. For a fair comparison, our method and PMF are performed using the same superpixels clustered by SLIC [1], the cost function, and post-processing based on left-right cross-checking and median-filtering (for further details, see [28]).Footnote 4 Further, we evaluate the performance of our method on the basis of a regular image grid with varying block size. Note that the number of local regions is inversely proportional to the block size. The results are presented in Tables 3 and 4. Here, the percentage disparity errors (threshold is set as one for small datasets, and one and four for large datasets) are averaged over all images within the same category. We observe that although our method, PMF [22], and CVF [28] provide nearly the same level of accuracy, our method is the most efficient method for both categories. In particular, for large datasets, our method achieves 6× faster performance than CVF [28], while providing a similar (or higher level) accuracy. We also observe that our method outperforms PMF when the number of local regions is large (e.g., superpixels with K = 200, 500) or when the image is divided into local regions on the basis of a simple image grid. This is because unlike the case of PMF [22], we do not consider any spatial smoothness of label subsets within the scale; instead, we consider the cross-scale smoothness of the local label subset, which is independent of the spatial coherence.

Table 3 Comparison with PMF using small datasets
Table 4 Comparison with PMF using large datasets

The estimated disparity maps of the teddy and art datasets are shown in Figs. 7 and 8, respectively. These are compared with those obtained by PMF [22] and CVF [28]. We observe that our method succeeds in estimating smoother and more reasonable disparity maps than CVF and PMF, especially in the case of the teddy dataset. Near object boundaries, CVF and PMF assign many incorrect labels, whereas our method does not. The reason is that our coarse-to-fine strategy successfully truncates incorrect labels that accidentally have low costs, as mentioned in Section 3.3.

Fig. 7
figure 7

Qualitative comparison with regard to estimated disparity maps of Teddy dataset

Fig. 8
figure 8

Qualitative comparison with regard to estimated disparity maps of Art dataset

Finally, we present the estimated disparity maps of small and large datasets in Figs. 9 and 10, respectively.

Fig. 9
figure 9

Qualitative results on the small datasets

Fig. 10
figure 10

Qualitative results on the large datasets

4.2 KITTI stereo 2015

We also conducted experiments on the KITTI stereo 2015 benchmark, which is more difficult than the Middlebury stereo dataset in Section 4.1 in terms of disparity range and image resolution. All the parameters and the cost function are exactly same as those in Section 4.1. We used 200 training images with ground truth disparity maps. The resolutions of all images are 1241 × 376. In this dataset, we did not perform the post-processing (weighted median filtering) in order to compare the pure performance of each method. The search range of disparity was set to 256 in all methods.

We show the comparison of computational time and accuracy in Table 5. Following the official evaluation rule of KITTI stereo 2015, we computed the percentage of error pixels. We regarded the pixel to be correctly estimated if the disparity error is less than 3 pixel or 5% at each pixel. The results of 200 images are averaged in Table 5. We observe that the accuracy of PMF [22] (K = 50) is worse than the original CVF [28] although PMF [22] is the fastest. In this dataset, our method with superpixel division is 5 or 6 times faster than the original CVF [28], and our method (K = 500) is much more accurate in both non-occluded and all regions. We observe that the patchmatch search did not work effectively in this dataset as opposed to our coarse-to-fine strategy. However, our method with regular grid division is worse than that with superpixel division in terms of both efficiency and accuracy.

Table 5 Comparison of computational time and accuracy using KITTI stereo 2015 datasets

The estimated disparity maps are shown in Fig. 11. Compared with CVF [28] and PMF [22], our method achieved smoother and more reasonable results by truncating unnecessary labels with the coarse-to-fine strategy. We observe that our method is better especially in less or repeated texture regions (e.g., on the road and in the sky).

Fig. 11
figure 11

Qualitative comparison with regard to estimated disparity maps of KITTI stereo 2015 dataset. Disparity maps (upper rows) and error maps (lower rows)

4.3 Middlebury optical flow

We also carried out experiments using the Middlebury optical flow benchmark [24]. In optical flow estimation, the label l corresponds to the 2-D motion vector (u, v) between the target image and the reference image. Further, u and v denote the displacements along the x and y directions, respectively, and they take floating values. We use the same cost function as that in the original CVF [28]:

$$\begin{array}{@{}rcl@{}} C(i,l) &=& (1-\alpha)\min{[\|I^{\prime}_{i+l}-I_{i}\|,\tau_{1}]}\\ &&+\alpha\min{[\|\nabla_{x} I^{\prime}_{i+l}-\nabla_{x}I_{i}\|+\|\nabla_{y} I^{\prime}_{i+l}-\nabla_{y} I_{i}\|,\tau_{2}]}, \end{array} $$
(7)

where ∇ x and ∇ y are the gradients in x and y direction, respectively. The parameters are set to the same values as those in the experiments for stereo matching; only τ 2 is changed to 0.0156 in the same manner as in [28]. In all the datasets, the search ranges of both u and v are set to the interval of −10 to 10 pixels. To achieve sub-pixel accuracy, the units are set to 0.25 pixel (i.e., each space of u and v is {−10, −9.75, −9.5, …, 0, …, 9.5, 9.75, 10}). Therefore, the size of the entire label space is 81 × 81 = 6561.

The results are listed in Table 6. Here, the average angle error (AAE) and average end-point error (AEE) are used for evaluation. They are defined, respectively, as

$$ AE = cos^{-1}\left( \frac{1.0 + u\times u_{GT} + v\times v_{GT}}{\sqrt{1.0+u^{2}+v^{2}}\sqrt{1.0+u_{GT}^{2}+v_{GT}^{2}}}\right), $$
(8)
$$ EE = \sqrt{(u-u_{GT})^{2}+(v-v_{GT})^{2}}, $$
(9)

where (u, v) is the estimated flow and (u G T , v G T ) is the ground truth flow. We did not compare the performance of our method with other algorithms dedicated for optical flow estimation for the same reason as stereo matching. From Table 6, we observe that our method is superior to and much faster than the original CVF [28] in all cases. In particular, by using superpixel division (K = 50), our method achieves the most accurate results and much faster performance than CVF (10× or more). Further by using regular grid division, our method achieves a higher level of accuracy than CVF, and it is the most efficient.

Table 6 Comparison in optical flow estimation

The estimated flow maps of the Middlebury optical flow dataset are shown in Fig. 12. We observe that our method estimates the flow around boundaries more accurately than CVF. As in the case of our stereo matching results, this is because erroneous flow vectors, which yield minimum costs even though they are the wrong choices, are efficiently removed by our hierarchical approach.

Fig. 12
figure 12

Qualitative comparison with regard to estimated flow maps of Middlebury optical flow dataset

5 Conclusion

In this paper, we proposed a coarse-to-fine strategy to reduce the large label space for efficient cost-volume filtering. The proposed method truncates redundant labels in each local region by using the labeling output of lower scales. Our method demonstrated higher efficiency than CVF while maintaining a comparable level of accuracy in stereo matching and optical flow estimation. Compared with PMF, our method showed comparable performance. Although PMF estimates compact label sets to reduce the computational cost by complex patchmatch search, our method does by simple coarse-to-fine strategy. Therefore, our method is yet another approach to optimize the label sets for efficient cost-volume filtering, which is much easier to implement than PMF. Moreover, we will make our source code publicly available.

In future work, as the performance of our method depends on the shape and number of local regions, we intend to explore the optimal division of local regions. In addition, we plan to investigate the GPU implementation of our method for real-time applications.