Keywords

1 Introduction

Recently, salient object detection has acquired much research interest, which aims to locate interesting and important regions in an image [1]. The output of saliency can be benefit to numerous applications such as object recognition, object tracking, image segmentation, image compression, image retrieval, and image quality assessment.

Generally, based on data processing mechanisms, saliency detection can be categorized as either bottom-up [1,2,3,4] or top-down [5,6,7] schemes. The bottom-up model is a fast, unconscious, data-driven and open-loop visual attention mechanism which base on the characteristics of the visual scene. In contrast, top-down model is a slow, conscious, task-driven and closed-loop visual attention mechanism which relies on the observer’s expectations. Saliency detection methods can also be classified as salient region detection and eye fixation prediction. In this paper, we focus on the bottom-up salient object detection task.

Most bottom-up saliency detection methods are based on low-level features, such as color contrast, Euclidean distance and orientation. Itti et al. [1] proposed a conceptual model for saliency detection by performing multi-feature extraction and multi-scale decomposition of the input image, then fused the feature map linearly. Cheng et al. [3] presented a histogram contrast-based (HC) method, which considered the regional contrast with respect to the entire image and pixel-wise color separation to produce saliency map. Zhai et al. [8] calculated the global luminance contrast (LC) of pixel over the entire image to detect saliency. Hou et al. [9] established a spectral residual (SR) model of the image to obtain the saliency map. Achanta et al. [10] computed the saliency likelihood of each pixel by a frequency-tuned method based on luminance and color. By combining color uniqueness and spatial distribution, Perazzi et al. [11] applied a high-dimensional Gaussian filter to generate pixel-map. Zhou et al. [12] generated pixel saliency map by integrating diffusional compactness and local contrast (DCLC) cues.

However, those low-level features based methods maybe ignore the intrinsic connection between pixels and regions in images. To solve this problem, the graph-based methods are put forward. Harel et al. [13] explored a graph based visual saliency algorithm, which uses certain features to form activation map and then highlights the area of interest by normalizing. Gopalakrishnan et al. [2] detected seed nodes by Markov random walk model, which is carried out with the sparse k-regular graph and the complete graph, then the estimated location of the most notable region in an image is determined by seed nodes. By graph-based manifold ranking (MR) method, Yang et al. [4] utilized the boundary regions as background labels to generate initial saliency map and extracted foreground labels from initial map to obtain the final saliency map. In [14], a co-transduction algorithm is devised to fuse both boundary and objectness labels based on inter propagation scheme (LPS). Zhang et al. [15] adopted a linear scheme to fuse texture saliency map and color saliency map (TC) by manifold ranking. Zhou et al. [16] detected salient regions via diffusion process on sparse graph (DSG), and calculated background seed vectors by a compactness measure. Yuan et al. [17] removed foreground labels from background prior by reversion correction and built the regularized random (RCRR) walk ranking model to generate pixel-wise saliency map.

Among the graph-based methods, the boundary-based model outperforms most of the state-of-the-art saliency detection methods and is more computationally efficient. However, there still are some drawbacks that prevent from optimal performance. Firstly, most constructed graphs such as proposed in [4, 17] are full connected, each node connects to those nodes neighboring it as well as sharing common boundaries with its neighboring nodes. However, if the nodes of salient objects are inhomogeneous or incoherent, the full connected graph may lead to errors and seldom detect complete foreground. Secondly, background regions usually have a wider distribution over the entire image. The four boundaries of the image are treated as background labels for background-based saliency detection in [4, 17]. It’s insufficient and maybe fail due to the negative influence when foreground objects touch the boundary.

In order to overcome above-mentioned problems, we propose half-two layers graph and select accurate seed labels by clustering for saliency detection. Firstly, we construct a half-two layers graph model, which is generated by connecting each node to neighboring nodes and the half of the most similar nodes that share common boundaries with neighboring nodes. This method effectively removes redundant nodes and fully uses the local spatial information. Then we apply the K-means to cluster image superpixels and those clusters containing boundary are regarded as background. Due to foreground objects may touch the boundary, we employ reversion correction method [17] to remove foreground in these background labels. The background saliency map is obtained based on background labels by manifold ranking. Finally, we binarize the background saliency map and use those complete clusters as the foreground labels. And we use foreground labels based manifold ranking method to get the final saliency map.

The residual of this paper is organized as follows. Section 2 shows the overall flow of our algorithm, including the construction of the graph model, the selection of foreground labels and background labels. The experimental results for ASD, CSSD, ECSSD and SOD datasets are shown in Sects. 3, and 4 is conclusion.

2 The Proposed Method

The framework of our proposed algorithm is shown in Fig. 1.

Fig. 1.
figure 1

Principal steps of our method.

Firstly, we perform the SLIC algorithm [18] to generate superpixels and construct a half-two layers graph. Secondly, we employ the K-means to cluster the superpixels. Thirdly, we select the background labels that those clusters contain boundary and remove the foreground labels. Finally, the complete cluster is regarded as foreground label after using an adaptive threshold, and then we apply the manifold ranking [16] to obtain the final saliency map.

2.1 Graph Construction and Clustering

In order to improve the performance of salient object detection, we use the SLIC algorithm to divide the input image into homogeneous and compact superpixels using the color means. Then we construct a graph \( {\text{G}} = \left( {{\text{V}},{\text{E}}} \right) \) depend on the superpixels of image, where each node V denotes a superpixel produced by the SLIC algorithm and edge E denote that \( V_{i} \) connects to \( V_{j} \). The node set V consists of superpixels \( {\text{X}} = \left\{ {x_{1} , \ldots ,x_{q} , x_{q + 1} , \ldots , x_{n} } \right\} \in {\mathbb{R}}^{m} \). Some nodes are used as queries, and the remaining nodes need to be ranked according to their relevance to the queries. Let \( {\text{f}}:{\text{X}} \to {\mathbb{R}} \) denote a ranking function, which assigns a ranking value \( f_{i} \) to each block \( x_{i} \), and f can be regarded as a vector \( {\text{f}} = \left[ {f_{1} , \ldots ,f_{n} } \right]^{T} \). Let \( \text{y} = \left[ {y_{1} , \ldots ,y_{n} } \right]^{T} \) denotes an indication vector, where \( y_{i} = 1 \) if \( x_{i} \) is a query, and \( y_{i} = 0 \) otherwise. We use manifold ranking [4] as the ranking function, which is written as:

$$ {\text{f}} = \left( {D - \alpha W} \right)^{ - 1} y $$
(1)

where α denote a constant, the affinity matrix is denoted by \( W = \{ w_{ij} \}_{N \times N} \), and \( D = diag\{ d_{11} ,d_{22} , \ldots ,d_{NN} \} \) is the degree matrix, where \( d_{ii} = \sum\nolimits_{j} {w_{ij} } \). More manifold ranking details could be found in [4, 19].

We define the weight \( w_{ij} \) between two nodes as

$$ w_{ij} = e^{{\frac{{ - \left\| {c_{i} - c_{j} } \right\|}}{{\sigma^{2} }}}} $$
(2)

where \( c_{i} \) and \( c_{j} \) denote the mean of color of nodes \( V_{i} \) and \( V_{j} \) in Lab color space, σ is constant factor which controls the weight.

Generally, most graph-based methods construct a full connection, each node connects to those neighboring nodes \( D_{1} \left( {\text{j}} \right) \) as well as those nodes sharing common boundaries with its neighboring nodes \( D_{2} \left( {\text{j}} \right) \), which may obtain erroneous local relation. Thus, in this paper, we propose a half-two layers graph for calculating saliency. As shown in Fig. 2, the half-two layers graph generated by connecting each node to its neighboring nodes and the half of the most similar nodes p that share common boundaries with neighboring nodes. It’s well known that the second layer contains some local information, and some redundant information is adulterated in. To reduce redundancy and retain more local information, we retain the half of the most similar nodes, which is denoted as:

Fig. 2.
figure 2

The two-half layer graph model. (a) Input image. (b) Edge connection between nodes. A node (illustrated by a pink dot) connects to both its adjacent nodes (yellow dot) and the half of the most similar nodes (green dot) sharing common boundaries with its adjacent nodes. Each pair of boundary nodes are connected to each other (red dot and connection). (Color figure online)

$$ {\text{D}}\left( {\text{p}} \right) = \left\{ {{\text{q}} \in D_{2} \left( {\text{j}} \right):w_{ij} > v} \right\} $$
(3)

where v is the weight means of the second layer nodes \( D_{2} \left( {\text{j}} \right) \), q is the node in \( D_{2} \left( {\text{j}} \right) \), and p is the node whose weight larger than v.

Moreover, each node of the four boundaries of the image must be connected in pairs, and we describe the image as a closed-loop graph. Thus, the constructed graph model effectively removes redundant nodes and fully uses the local spatial distribution feature, which shows the obvious advantages compared with others graph models.

We then employ K-means algorithm to cluster the N superpixels of the image into K clusters. Considering Lab color space is more related to human perception, we use three-dimensional Lab color feature to cluster.

2.2 Background-Based Saliency Estimation

Usually most of background regions are near the boundary, which are sparse and have a wider spatial distribution over the entire image compared with foreground regions. However, it’s not adequate that simply utilizes the boundary labels as background labels. Therefore, we extend the background labels by clustering the image, each cluster contains one superpixel at least, and those clusters that contain boundary background are regarded as background labels. With the increase of the background labels, when calculating the background prior of the image, it’s more effective to detect the foreground saliency object and uniformly highlight the entire salient region.

To select the background labels more accurately, we first calculate the initial saliency map using the boundary regions as [4] and remove the boundary-adjacent foreground regions from the boundary clusters by reverse correction method [17]. The initial map is generated via the separation and combination (SC) scheme, that is, we construct four background prior maps with boundary labels and then multiply them each other as the initial map. Then we use reverse correction method to mark the foreground regions with 1 and the background regions with 2. Specifically, for each boundary, the mean of the cluster that contains boundary background is called \( L_{label} \). Given pre-defined threshold Th1 = 1, if Th1 smaller than \( L_{label} \), we will repute that those clusters contain foreground regions in background regions, and then we will remove those regions and acquire exact background labels. Figure 3 shows examples of background labels, we can see that compare with general background labels (Fig. 3(b)) and undoing reverse correction background labels (Fig. 3(c)), our background labels (Fig. 3(d)) are more precise.

Fig. 3.
figure 3

Examples of background labels. From left to right: (a) Input image. (b) General background labels. (c) Not reverse correction background labels. (d) Our background labels.

After, we calculate background saliency maps by the manifold ranking. Taking top labels as an example, the queries are the exact background labels and the remaining regions are ranked. Thus, the indication vector \( y_{i} \) is obtained, and all the nodes are ranked based on Eq. (1) in \( f_{b} \), which means each superpixel relevance to the exact background labels. The background saliency \( S_{b} \) based on top labels is calculated as:

$$ S_{b} \left( i \right) = 1 - f_{b} \left( i \right) $$
(4)

where \( f_{b} \left( i \right) \) denotes the normalize vector, and the range of \( f_{b} \left( i \right) \) is between 0 and 1.

We generate the other three saliency maps using the queries that selected via the similar method. And then the background-based saliency \( S_{B} \) is obtained by the following procedure:

$$ S_{B} \left( i \right) = \prod\nolimits_{b = 1}^{k} {S_{b} \left( i \right)} $$
(5)

Where k denotes the number of boundary.

2.3 Foreground-Based Saliency Estimation

Through the above steps, the most saliency regions are highlighted. However, there are some background regions which may not be inhibited. By the adaptive threshold method could diminish this problem, but the picked foreground labels may adulterate some background labels, as is shown in Fig. 4(b). To select the foreground labels more reasonable, we regard the extracted labels belonging to the complete clusters as foreground labels.

Fig. 4.
figure 4

Example of foreground labels. From left to right: (a) Input Image. (b) Adaptive threshold labels. (c) Adaptive threshold labels and the same cluster labels.

We separate the background saliency map by binary threshold, which exploits the adaptive threshold Th2 defined as the mean saliency over the whole saliency map. If \( S_{B} \left( {\text{i}} \right) > {\text{Th}}2 \), the \( S_{B} \left( i \right) \) is treated as foreground labels. The K-means algorithm divides the image into three categories: intra-object, intra-background and object-background, so we deem that those complete clusters are final foreground labels after adaptive threshold, as is shown in Fig. 4(c). Then we calculate the saliency map with final queries in each superpixel using Eq. (1). The foreground-based saliency map \( S_{F} \) is defined:

$$ S_{F} \left( i \right) = \bar{f}\left( i \right) $$
(6)

where \( \bar{f}\left( i \right) \) denote the normalized vector.

By the above method, the final saliency map will be greatly improved. As shown in Fig. 5. We notice that our method can stress the foreground evenly and suppress the background in effect.

Fig. 5.
figure 5

An saliency example by our method. (a) Input image. (b) GT. (c) Saliency map based on half-two layers, (d) Saliency map based on background labels. (e) Saliency map based on foreground labels.

3 Experimental Results

3.1 Experimental Setup

We test the proposed method on four datasets. The ASD dataset [10] contains 1000 images. The second one is SOD dataset [20], which contains 300 images with multiple objects. The CSSD [21] is the third dataset, which contains diversified patterns in both the foreground and background. And the last one is ECSSD dataset [21], which is an extension of CSSD to express natural circumstances.

There are four parameters in the experiment which need to be set. In all experiments, we empirically set the number of superpixel nodes N = 200. σ is the edge weight, which controls the fall-off rate of the exponential function. In manifold ranking algorithm, α balances the smooth and fitting constraints. We empirically set σ = 0.1, and α = 0.99. The parameter K is the number of cluster in K-means, through experiment we set K = 70. As shown in Fig. 6, we varied it from 30 to 90 in intervals of 10 to determine an appropriate value for K with ASD dataset.

Fig. 6.
figure 6

Influence of K on the image.

To evaluate the performance of different methods, we use the average precision-recall curve and the F-measure as evaluation criterion. We vary the threshold from 0 to 255 and compute the precision and recall at each threshold by comparing the binary mask and the ground truth to compare the accuracy of the different saliency maps. Then we apply the sequence of precision-recall pairs to plot the precision-recall curve. The F-measure is calculated using:

$$ F_{\beta } = \frac{{\left( {1 + \beta^{2} } \right)Precision \times Recall}}{{\beta^{2} Precision + Recall}} $$
(7)

Following [4], we set \( \beta^{2} = 0.3 \).

3.2 Performance Comparison

We compare our method with 8 state-of-the-art algorithms, namely HC [3], MR [4], LC [8], DCLC [12], LPS [14], TC [15], DSG [16], and RCRR [17]. As shown in Fig. 7, our method acquires better subjective performance, and uniformly stress foreground salient object and suppress background even for complex natural images.

Fig. 7.
figure 7

Saliency detection results of different methods. The proposed algorithm consistently highlight foreground and suppress background.

We calculate P-R curve and F-measure on four databases. The result of F-measure is listed in Table 1. The P-R curves are shown in Fig. 8 and the precision, recall and F-measure indexes are shown in Fig. 9. Compared with other representative methods, the performance of our method is better in F-measure for CSSD, ECSSD and SOD databases. From the P-R curves, our algorithm performs also well, and it is competitive to DCLC, MR, and RCRR. Although the performance of the P-R curve does not surpass other algorithms by a large margin, our method obtains better subjective saliency map.

Table 1. F-measure results on ASD, CSSD, ECSSD and SOD databases.
Fig. 8.
figure 8

Average precision-recall curves of the proposed method compared with 8 state-of-the-art methods. (a) the ASD database. (b) the CSSD database. (c) the ECSSD database. (d) the SOD database.

Fig. 9.
figure 9

F-measure of the proposed method compared with 8 state-of-the-art methods. (a) The ASD database. (b) The CSSD database. (c) The ECSSD database. (d) The SOD database.

3.3 Running Time

The running time is tested on a 64-bit PC with Intel Core i5-3337U CPU @ 1.80 GHz and 4 GB RAM. Average running time is calculated on ASD database. We compare five methods in recent years, and the results are shown in Table 2. Our method is slightly slower than MR and DSG, but it’s faster than LPS, LC and RCRR. Considering the overall evaluation performances, our method acquires better trade-off between performance and complexity.

Table 2. Running time test results (seconds per image).

4 Conclusion

We propose a bottom-up method to extract saliency region by calculating the relevance using manifold ranking with refined background and foreground labels. Our proposed half-two layers graph model alleviates the limitations in the prior graph models. In addition, we pick up the more precise labels using the cluster with k-means algorithm. The refined background and foreground labels can help to improve the performance of manifold ranking. By comparing with state-of-the-art saliency algorithms on four databases, it’s confirmed that our method acquires better performance and can suppress background region and highlight foreground region accurately.