Keywords

1 Introduction

Image segmentation is considered as a main challenge in computer vision that has been extensively studied in the past. After decades of research, a consensus nonetheless exists among researchers in the field that accurate segments, either as large regions or as small superpixels, serve as an effective input representation for middle-level and high-level vision tasks, albeit intrinsically ambiguous. Some typical tasks that have greatly benefited from building on good segmentations include object detection/recognition [24, 25], tracking [44], saliency estimation [10, 22], objectness proposal generation [2, 8, 43], and 3D inference [20]. The reason for this to happen is threefold: (i) extracted segments are meaningful units that carry informative features such as shapes, textures, etc [21, 26, 39]; (ii) the number of segments is often significantly lower than the number pixels in the original image, resulting in a more compact representation with a great speed benefit [10]; (iii) the superpixel representation often has an improved coherency and robustness than the raw pixels [37].

In the past, several seminal works have emerged as the state-of-the-art systems that have been widely adopted in the field: spectral clustering based normalized cuts approach [38]; efficient feature (color) space mode seeking method, mean-shift algorithm [13]; efficient graph-based image segmentation method [17]; hierarchical region tree with transform contours [4]; and multi-scale normalized cuts algorithm [5]. Among these choices, efficient graph based image segmentation (EGB) and SLIC [1] methods are particularly popular in computer vision and computer graphics [2, 7, 8, 10, 20, 22, 24, 25, 43, 44], due to their great speed advantage.

Fig. 1.
figure 1

Sample results from different steps of our methods. The original image is from BSDS500 [4] dataset. These image segmentation results are generated at 50 fps on GPU.

In this paper, we aim at developing a rapid image segmentation system that produces high quality image segments for real-time computer vision tasks. We propose a hierarchical feature selection framework that learns feature combination in individual stages of a hierarchical structure. Our effort starts with a GPU version of the SLIC method [1, 34], to quickly obtain initial seed regions (superpixels) by performing oversegmentation. Image features are then extracted from the individual seed regions, followed by a feature combination process with a distance metric learnt from the training data. Note that to maintain the efficiency of our system, we only consider those image features that are appropriate for parallel computing, i.e. via GPUs. A region merging process is then performed based on the learned distance metric to output a new set of regions for the next level in the hierarchy. Our system then repeats for a few iterations.

The method developed in this paper has its practical importance to a variety of real-time applications by generating high quality image segments (see also Fig. 1) at 50 fps. The performance of our method is quantitatively evaluated in the well-known BSDS500 [4] dataset (see also Sect. 4). As demonstrated in the evaluation results (see also Table 2), our method strikes a favorable balance between segmentation quality and computational efficiency when compared with alternative approaches [1, 4, 5, 17, 40]. We will open-source our system to make it publicly available.

2 Related Work

Image segmentation is a fundamental problem in computer vision [30]. We refer readers to the popular BSDS500 [4] benchmark and other recent studies [3, 5, 28, 42] for a comprehensive background discussion. Next, we highlight a few representative methods that are relevant and important to the method proposed here.

A certain degree of attention in the past was given to grouping algorithms that efficiently compute and implement the normalized-cuts algorithm [38]. A multigrid eigenvectors producer is designed [28], enabling substantial speed-up for eigenvector computation. In [42], Taylor et al. attempt to reduce the size of eigenvectors using a watershed oversegmentation to achieve the speed of computing eigenvectors in less than half a second. Pont-Tuset et al. [5] present an approach to downsample the eigenvectors first, solving them at a reduced size, followed by upsampling the solution to retrieve the structure of the image. Although satisfying segmentation results can be obtained by the above methods, the computational time is still a bottleneck for these spectral clustering based approaches.

Along a different direction, SLIC [1] emerges as one of the most celebrated methods with a good balance between accuracy and speed, and it has been adopted in many applications [6, 9, 23, 41, 49]. In [1], a k-means clustering approach is proposed to initialize cluster centers by sampling pixels at regular grid steps, followed by a labeling procedure in which each pixel is labeled with the index of the cluster center whose search region overlaps with its location.

A graph-based clustering methods presented by Felzenszwalb and Huttenlocher [17] has also been widely used. For an undirected graph with edges measuring the dissimilarity between adjacent pixels, the goal of [17] is to perform a clustering operation such that each region is the minimum spanning tree of the involved pixels. Since it starts with a merging process directly from single pixels with weak color information, the algorithm of [17] is prone to noise. Compared with [17], we instead start our clustering method from oversegmentations that contains more informative features than single pixels. We will discuss in detail in Sect. 3 about our procedure.

Other popular methods for image segmentation include those based on feature learning [35]. These methods demonstrate a good representation power by fusing together features such as brightness, color, and texture properties using discriminative classifiers. Ren et al. [35] propose a hierarchical segmentation approach in which a cascade of boundary classifiers are applied to recursively combine regions starting from initial oversegmentations. In this spirit, our work bears certain similarity to [35] where a cascade of classifiers are used for region grouping. Here, we strike the importance of real-time execution by carefully studying regional features that are appropriate for GPU implementation when combined with [17]. The main contribution of our work is the development of a real-time image segmentation system that is of practical importance to be used in many high-level computer vision tasks.

3 Our Method

In this section, we first introduce the problem formulation and our hierarchical merging algorithm. We then explain the parameter learning and feature extraction procedure, followed by a discussion about design choices behind our method.

3.1 Problem Formulation

Given an image I, we partition it into L level segmentations \(\mathcal {S} = \{\mathcal {S}_1, \mathcal {S}_2, \cdots , \mathcal {S}_L\}\). Each segmentation \(\mathcal {S}_l\) is a decomposition of the image I with \(K_l\) regions

$$\begin{aligned} \mathcal {S}_l = \{R_1^{(l)}, R_2^{(l)}, \dots , R_{K_l}^{(l)}\}, \end{aligned}$$
(1)

where l denotes the level index in the hierarchy. We start with the finest segmentation \(\mathcal {S}_1\) consisting of a large number of regions, and gradually merge regions from level \(\mathcal {S}_{l}\) to a coarser level \(\mathcal {S}_{l+1}\). The coarsest level segmentation thus is composed of fewest regions.

We adopt a graph based approach [17] for the implementation of the region merging process \(\mathcal {S}_{l} \Rightarrow \mathcal {S}_{l+1}\) at each step. Let

$$\begin{aligned} G_l=(\mathcal {S}_l, \mathcal {A}_l) \end{aligned}$$
(2)

be an undirected graph, with vertices being a set of regions \(\mathcal {S}_l\) as defined above, and edges \((R_i^{(l)}, R_j^{(l)}) \in \mathcal {A}_l\) corresponding to pairs of neighboring vertices. Each edge \((R_i^{(l)}, R_j^{(l)}) \in \mathcal {A}_l\) has a feature vector \(\mathbf {T}_{i,j}^{(l)}\) (see also in Sect. 3.4), and a corresponding predict score \(s_{i,j}^{(l)}\), which is a non-negative measure of the distance between regions \(R_i^{(l)}\) and \(R_j^{(l)}\).

Based on the above problem definition, our task is then to quickly merge regions to produce coherent segments that best match human annotations, such as those in the BSDS500 [4] benchmark.

3.2 Hierarchical Merging

To achieve high quality and retain top efficiency, we propose to (i) iteratively learn how to combine features and update image features after region merging in each level; (ii) use fast parallel superpixel generation methods [1, 34] to group image pixels to initial regions before further merging.

The pipeline of our method is shown in Fig. 2, with example results displayed in Fig. 1 and the algorithm listed in Algorithm 1. In the first step, the GPU-SLIC method [1, 34] is exploited to over-segment an input image into superpixels, which serve as seed regions in the \(1^{st}\) level \(\mathcal {S}_1 = \{R_1^{(1)}, R_2^{(1)}, \dots , R_{K_l}^{(1)}\}\). In the subsequent steps, both internal and marginal features (see also Sect. 3.4) are extracted. Using support vector machine (SVM) regressors, we learn from training data (see also Sect. 3.3) how to map feature vectors \(\mathbf {T}_{i,j}^{(l)}\) to suitable distance measure between regions \(R_i^{(l)}\) and \(R_j^{(l)}\). We progressively merge regions in \(\mathcal {S}_{l}\) to arrive at a coarser segmentation \(\mathcal {S}_{l+1}\), following the efficient graph based (EGB) image segmentation framework [17], with the graph defined in Eq. (2).

Fig. 2.
figure 2

Pipeline of our methods.

figure a

Our design principle is motivated by a recent trend of using discriminative learning approach to find proper feature combination for various vision tasks [4, 22, 30]. A number of psychophysics studies [36] suggest that humans use multi-cues to separate objects in natural scenes. Compared with an ad-hoc design, extracting image features and allowing the data to speak for themselves is proven to be an appropriate way of learning how to combine different visual cues [4, 30]. Our system design is also motivated by an observation that image features play different roles at different scales, when deciding whether two regions should be merged. At a fine scale, e.g. pixel level, color similarity and spatial distance are important, which is observed in many state-of-the-art image segmentation methods [1, 17]. With region merging/grouping progressing to a coarser level, texture similarity, edges between regions and other cues become more important deciding factors to judge whether two regions should be merged.

Instead of learning a single rule for cue/feature combination across all levels, [4, 30], we experiment an alternative approach in which iteratively updating region features and their combination weights is implemented (see also Fig. 3). We see favorable aspect of our approach in the experiments. Based on this point, we design a hierarchical architecture in which multiple levels are engaged, followed by recursive region merging [17] and feature updating.

One thing worth noting is that we only extract features that are simplistic and suitable for parallel computing on modern GPUs. More informative CNN features can be engaged in an e.g. end-to-end edge detection system HED [46], to improve the segmentation results.

Our experimental results indicate that the F-measure [4]

$$\begin{aligned} F_\beta = \frac{2 \cdot Precision \times Recall}{Precision + Recall}, \end{aligned}$$
(3)

can be increased by \(6\,\%\) when HED is used. However, because of the overhead of HED (0.4 s per-image), as opposed to 0.02 s in our vanilla version, we make using HED an optional choice.

3.3 Parameter Learning

As described above, given a set of initial regions, we learn an edge weight \(\mathbf {w}^{(l)}_{i,j} \in \mathbf {w}^{(l)}\) between every region pair \((R_i^{(l)},R_j^{(l)}) \in \mathcal {A}_l\). Since every region pair is associated with a feature vector \(\mathbf {T}_{i,j}^{(l)}\), our next step is to provide a label for each region pair at level l. Since the initial regions of each level may have irregular shapes, we use the F-measure to help determine the ground truth label of each region pair in \(\mathcal {A}_l\).

We first calculate the F-measure of the initial segmentation at level l, denoted as \(F_{init}^{(l)}\). Then, for each region pair \((R_i^{(l)},R_j^{(l)})\) in \(\mathcal {A}_l\), we compute the F-measure repeatedly after merging \((R_i^{(l)},R_j^{(l)})\). If the F-measure after merging \((R_i^{(l)},R_j^{(l)})\) is greater than \(F_{init}^{(l)}\), the corresponding label \(y_{i,j}^{(l)}\) of \((R_i^{(l)},R_j^{(l)})\) will be assigned to 0. Otherwise, \(y_{i,j}^{(l)}\) will be assigned to 1. We adopt support vector machine (SVM) regressor to learn feature weights \(\mathbf {w}^{(l)}\).

3.4 Feature Extraction

Our system explores a group of simple features that can be efficiently calculated on modern GPUs. Both internal and marginal features are considered here. Table 1 lists the features we have considered. We discuss below the details of these features used in our system.

Table 1. Features for adjacent regions

Brightness and Colors. The brightness and color cues in the CIELAB color space have been proved to be very useful [4, 30]. We use mean L*a*b* values to represent the color of a segment. In order to tolerate variations in the relative weight of brightness and colors, we use both the Euler distance (\(d_c\)) and the distances in each channel (\(d_l\), \(d_a\), \(d_b\)) for two adjacent segments.

Average gradient maximum along boundary. Previous works have shown that gradient information is an important cue in boundary detection. Instead of using gradients directly, we use gradients after non-maxima suppression. For adjacent segments \(R_i\) and \(R_j\), the computation starts by placing a small circular disc at the pixel \(p_k \in \varGamma \), where \(\varGamma \) represents their boundaries. Then we calculate the maximum gradient \(\delta '(p_k)\) in the disc. Thereby, the average \(\delta '(p_k)\) is computed as the gradient difference \(d_g(R_i,R_j)\).

\(\chi ^2\) distance between RGB histograms. To make use of the details of color information, we employ the color histogram that has \(8 \times 8 \times 8\) dimensions in the RGB color space. For histograms belonging to adjacent segments, we use \(\chi ^2\) distance to measure their difference.

\(\chi ^2\) distance between gradient histograms. The \(\chi ^2\) distance of two segments when computing histograms of oriented gradient for each segment is also an attractive choice.

Variances. Variance is a good measure for the fluctuation of a piece of data. We apply variances in the RGB (\(s_r\), \(s_g\), \(s_b\)) and CIELAB (\(s_l'\), \(s_a'\), \(s_b'\)) color spaces to \(R_i \bigcup R_j\), where \(R_i\) and \(R_j\) are adjacent segments. The magnitude of the variances reflects the similarity between the two segments.

Average HED maximum along boundary. The HED feature is computed similarly to gradient feature above. However, because of the extra overhead of HED, we make this choice optional.

The above features play different roles in different levels. The weight comparison of features in the first and second level are shown in Fig. 3 (except for the HED features). To take computational complexity into account, we only choose a small set of features that are easy to calculate instead of using all of them. The top five features are \(d_l\), \(d_a\), \(d_b\), \(d_c\), and \(d_g\). All the experimental results reported in this paper are based on these features.

3.5 Implementation Details

To design a practical system, we choose \(L=3\) as a default value in this paper. We do all experiments using a machine with an Intel Xeon CPU E5-2676 v3 @ 2.40 GHz and an NVIDIA GeForce GTX 980 Ti. All the running time is reported without data parallelism, except for the objectness proposal application part (Sect. 4.2) which is designed to adherence to the general practice.

Fig. 3.
figure 3

The weight comparison of the features learned in the \(1^{st}\) and \(2^{nd}\) level.

4 Experiments

4.1 Evaluation

In this part, we evaluate our method on the BSDS500 [4] benchmark, which is widely used to evaluate segmentation and grouping methods. There are two choices of measures: optimal dataset scale (ODS) which is the optimization for the entire training dataset, and optimal image scale (OIS) which is the optimization for each test image. For boundary assessment, we use the F-measure of precision and recall on the ODS. The region-based measures contain:

  • Variation of Information (VI), measuring the distance between ground truth (GT) and the proposed segmentation;

  • Probabilistic Rand Index (PRI), measuring the pairwise compatibility of element assignment between GT and the proposed segmentation;

  • Segmentation Covering (Covering), measuring the average overlap between GT and the proposed segmentation.

See [4] for more details. Figure 3 shows the weight comparison of the selected features. We can see clearly that the weight importance is different in different levels. To make our results more convincing, we compare our method with approaches [1, 4, 13, 14, 17, 31, 40], the MCG as well as SCG approaches in [5]. All the experiments are accomplished using publicly available source code.

Fig. 4.
figure 4

Experimental evaluation for boundaries on BSDS500 [4] test set. The F-measure is computed by precision and recall at the Optimal Dataset Scale (ODS). And the execution time is tested without data parallel.

Table 2. Region Benchmarks on the BSDS500 [4]
Fig. 5.
figure 5

Some examples of EGB, SLIC and our method. The reason why we only compare with these two algorithms is that they are the only two that is effcient enough to be used in applications. Left: Image. Middle left: SLIC. Middle right: EGB. Right: Ours. The regions are represented by their mean color. And all images are from the test set of BSDS500 [4].

We show the evaluation results on boundary benchmark in Fig. 4, in which all of the execution time is tested without data parallel. We can see that [5] achieves the best result comparing to others. However, its simplified version SCG still needs about 2 s to process an image. For this reason, it cannot be employed in nowadays applications in spite of its stroke of genius. Similarly, the accuracy of [4] is very competitive compared with other methods. However, the speed of this method is extremely slow taking about 86 s per image. Our approach is hundreds of times faster than [4, 5], achieving 50 fps. When the data parallelism is enabled, the speed can be up to 200+ fps. In addition, our approach can be easily used in almost all the applications nowadays including some real-time systems. Comparing with some superpixel extraction methods, e.g. [1, 17, 31], Fig. 4 demonstrates that our method is much faster, and more importantly, the accuracy also has a significant improvement. When comparing with the rest three methods [13, 14, 40], our speed advantage is obvious, though the F-measure is only a little higher than them. Others, the F-measure of our enhanced version is very close to the best performance, with very fast speed.

Table 2 presents region benchmarks on the BSDS500 [4]. From Table 2, MCG performs best on all the metrics, but it needs about 15 s per image. SLIC achieves the worst results, although its GPU version can be very fast. It is not difficult to find that our approach is close to the best performance on all these criteria, especially the enhanced version. Thus, we can draw the conclusion that our approach can achieve better trade-off than others in both efficiency and quality. Figure 5 shows some examples of our method comparing with the other two fast algorithm [1, 17].

The reason for not obtaining the best results on each criterion is two-fold. First, the initial superpixels produced by SLIC are not so desirable. For instance, when the step S is set to 8 pixels, intuitively 2200 superpixels would be produced for each image. However, the boundary recall of SLIC which measures the fraction of the ground truth contours that fall into the eight neighbourhoods of a superpixel boudary, is only \(73\,\%\). This fact may significantly affect the first-level results of our merging strategy. Second, since [17] is unable to control the compactness of generated superpixels, our merging strategy cannot get the desired regions. More specifically, in [17], only a constant parameter is used to prevent each region from being too large. In fact, this criterion is sometimes not reasonable because of the diversity of input pictures.

Nevertheless, because of our powerful architecture, our results still outperform most existing segmentation approaches. The following parts describe the applications of our approach in both saliency detection and objectness proposal, from which one can find the practicality of our approach.

4.2 Objectness

Generic object proposal generation has been a hot topic in recent years. As a preprocessing step in many applications such as object recognition and detection, it generates a number of bounding boxes that may contain objects. This type of algorithms have been used in many existing object detection methods [43, 45]. It has been shown that these object detection methods can perform better than the classical sliding-window-based paradigm [15, 16, 18].

As for metrics that measure the objectness approaches, we adopt mean average best overlap (MABO) across all the classes [43] and computational efficiency. Cheng et al. [12] recently propose a very fast method (BING), which generates box proposals at 300 fps, but this method cannot perform well on MABO benchmark. Chen et al. [8] propose a postprocessing approach (MTSE) to refine bounding boxes produced by objectness methods. In their algorithm, they use [17] to generate regions. In order to show the advantages of our segmentation method, we choose [8] as the postprocessing step of [12] and replace [17] with our method.

Fig. 6.
figure 6

Tradeoff between MABO and number of proposals using different methods on VOC2007 test set.

We extensively evaluate the new system on the challenging PASCAL VOC 2007 dataset [16]. To demonstrate the advantages of our system, we compare our results with some currently influential methods, including [2, 12, 43, 50]. From Fig. 6, one can find that our modified version performs better than the original one [8]. With our segmentation method, the speed has been significantly boosted. We can get competitive boxes at over 100 fps, comparing to 0.25 s per image as reported in [8]. And not only that, Fig. 6 indicates that the new system using our segmentation is one of the best objectness methods in terms of quality. As a result, our new system without doubt can make the best trade-off between efficiency and quality.

4.3 Saliency

In this part, we report the superiority of our method when it is used in another domain of computer vision. Visual saliency has been a fundamental problem in neuroscience, psychology, neural systems, and computer vision for a long time. In computer vision, detecting and segmenting salient objects in natural scenes, also known as salient object detection, has attracted a lot of focused research and has resulted in many applications. However, because most saliency detection methods are region-based, these exist two things as the bottleneck of salient object detection for a long period. First one is the segmentation quality [33] and the other is the computation efficiency. Recently, Jiang et al. [22] proposed a supervised learning method (DRFI) to predict a salient score of the regions produced by the popular segmentation method [17], which receives good performance on several popular datasets, such as MSRA10K [10] and PASCAL-S [27]. Here, we replace [17] with our segmentation method as a single level.

Fig. 7.
figure 7

Mean absolute errors of the state-of-the-art methods on MSRA10K [10] and PASCAL-S [27]. DRFIs is the single level version of DRFI, and note that our method is a single level. The proposed approach consistently achieves the lowest error rates on all datasets.

For a faithful comparison, we evaluate current popular detection methods [10, 11, 19, 22, 29, 47, 48] on several datasets mentioned above using mean absolute error(MAE) [32], which is introduced to reflect the negative saliency assignments. It is defined between a saliency map S and the binary groundtruth GT as:

$$\begin{aligned} MAE = \frac{1}{|I |} \sum _x |S(I_x)-GT(I_x) |, \end{aligned}$$
(4)

where \(|I |\) is the total number of pixels. The MAE results on these two datasets are shown in Fig. 7. Our method achieves the lowest MAE values on all datasets. Specifically speaking, it decreases by 0.57 % and 1.43 % over the second best algorithms in terms of MAE scores. This means that its predicted saliency pixels are closest to the ground truth.

5 Discussion

In this paper, we have proposed a hierarchical method for image segmentation. We design a hierarchical architecture to enjoy the benefits of engaging different feature setting in different scale levels. In addition, we explore the capability of modern GPUs to efficiently compute a set of simple but useful features. Our approach produces high quality hierarchical regions with substantial speed-up when compared with previous state-of-the-art works. Evaluation results on standard benchmark (BSDS500 [4]) show that our method achieves a favorable trade-off between efficiency and quality. When plugged into other computer vision tasks such as objectness and saliency detection, our method improves their performance. To encourage future works, we make the source code of this work publicly available at http://mmcheng.net/hfs/.