SaliencyRank: Two-stage manifold ranking for salient object detection

Salient object detection remains one of the most important and active research topics in computer vision, with wide-ranging applications to object recognition, scene understanding, image retrieval, context aware image editing, image compression, etc. Most existing methods directly determine salient objects by exploring various salient object features. Here, we propose a novel graph based ranking method to detect and segment the most salient object in a scene according to its relationship to image border (background) regions, i.e., the background feature. Firstly, we use regions/super-pixels as graph nodes, which are fully connected to enable both long range and short range relations to be modeled. The relationship of each region to the image border (background) is evaluated in two stages: (i) ranking with hard background queries, and (ii) ranking with soft foreground queries. We experimentally show how this two-stage ranking based salient object detection method is complementary to traditional methods, and that integrated results outperform both. Our method allows the exploitation of intrinsic image structure to achieve high quality salient object determination using a quadratic optimization framework, with a closed form solution which can be easily computed. Extensive method evaluation and comparison using three challenging saliency datasets demonstrate that our method consistently outperforms 10 state-of-theart models by a big margin.


Introduction
Saliency detection has been an important problem in computer vision for more than two decades. Its goal is to locate the most salient or interesting region in an image that captures the viewers' visual attention [1,2]. Accurate and reliable saliency detection has been successfully applied in numerous computer vision tasks such as image compression [3], scene segmentation [4], classification [5], content aware image resizing [6,7], photo collage [8], webpage design [9], and visual tracking [10].
State-of-the-art saliency methods can be categorized as either bottom-up (data-driven) or top-down (task-driven), all of which are built upon low-or high-level visual features of images. Numerous novel techniques have been utilized in existing algorithms, such as low rank matrix recovery [11], manifold ranking [12], Bayesian frameworks [13], etc. However, despite a large number of reported models, it is still difficult to locate the most salient region in, and remove non-salient regions from, challenging images such as the one in Fig. 1.
In this paper, we present a graph based manifold ranking method for salient object detection which works by analyzing the properties of the intrinsic image structure. Firstly, we build a fully connected graph using super-pixels as graph nodes, in which color features, texture features, and spatial distances are modeled. Secondly, by exploiting a two-stage ranking strategy using background and foreground queries in turn, we effectively determine the relationship of each region to the background (i.e., the image border). Our proposed manifold ranking approach focuses on correlation with the background, while traditional methods pay more attention to the salient object, and these are complementary concerns. Thus in the last step, a Bayesian formula is used to infer the output by integrating traditional models with the proposed manifold ranking method.
To illustrate the effectiveness of our method, we present results on three challenging public datasets: (i) MSRA10K [14][15][16] (ii) ECSSD [17], and (iii) DUT-OMRON [12]. Extensive experiments demonstrate that our approach produces highaccuracy results, and also shows its superior performance in terms of three evaluation metrics to state-of-the-art salient object detection approaches.

Related work
In this section, we briefly review related work on saliency detection. Readers can refer to Refs. [1,18] for an exhaustive review and comparisons of stateof-the-art saliency models.
Many models have been proposed for saliency detection in recent years. The pioneering work by Itti et al. [19] constructs a bottom-up saliency model that estimates center-surround contrast based on multi-scale image features. This model inspired researchers to build more predictive models that could be tested against experimental data. Harel et al. [20] define a graph based visual saliency (GBVS) model based on random walks for fixation prediction. In Ref. [21], Hou and Zhang define image saliency by integration of the spectral residual in frequency domain and a saliency map in spatial domain. Similarly, Achanta et al. [14] introduce a frequency-tuned method that defines pixel saliency based on color differences. Liu et al. [16] construct a saliency model by using a conditional random field to combine a set of novel features. Zhang et al. [22] propose a saliency algorithm from the perspective of information theory. Rahtu et al. [23] measure the center-surround contrast of a sliding window within a Bayesian framework using the entire image to compute saliency. Goferman et al. [24] give a context-aware saliency algorithm to detect the most salient part of a scene based on four principles of human visual attention. Cheng et al. [15] consider histogram-based contrast and spatial relations to generate saliency maps. Shen and Wu [11] integrate low-level and high-level features using a low rank matrix recovery approach for saliency detection. Jiang et al. [25] further exploit the relationship between Markov random walks and saliency detection, and introduce an effective saliency algorithm using temporal properties in an absorbing Markov chain. Jiang et al. [26] integrate degree of focus, object-likeness, and uniqueness for saliency detection. Yan et al. [17] present a hierarchical framework by combining multilayer cues in saliency detection. In Ref. [27], a discriminative regional feature integration approach was introduced to estimate image saliency by regarding the problem as a regression task. Li et al. [13] formulate a visual saliency detection model via dense and sparse reconstruction error.
Recently, numerous novel techniques have been utilized in salient object detection models, e.g., hypergraph models [28], Boolean maps [29], highdimensional color transforms [30], submodular approaches [31], PCA [32], partial differential equations (PDEs) [33], light fields [34], context modeling [35], co-saliency [36,37], etc. Three methods [12,38,39] have exploited the background as a prior to guide saliency detection and achieve favorable results. However, most of these methods focus on the salient object itself, and do not fully utilize important cues from image border/background regions. In this paper, we propose a novel salient object detection model based on a graph based ranking method to explore the underlying image structure and use a Bayesian framework to integrating models with good results.

Preliminaries
In this section, in the context of image region labeling, we briefly describe the manifold ranking framework on which our method is built. In Ref. [40], Zhou et al. propose a ranking framework which exploits the intrinsic manifold structure of data for graph labeling, which is further extended by Yang et al. [12] for salient object detection.

Graph based manifold ranking
For an input image containing n regions or superpixels [41], we denote the feature vector for a region i as v i ∈ R m . Given the region feature vectors V = {v 1 , · · · , v n } m×n , some regions are selected as queries and the rest need to be ranked according to their relevance. Let f : V → R n denote a ranking function which assigns a ranking value to each region i, defined as a vector f = [f 1 , · · · , f n ] T . Let L = [l 1 , · · · , l n ] T denote a label vector indicating the queries. We define a graph G = (V , E) over image regions, where nodes V are region features and the edges E are weighted by an affinity matrix W = [w ij ] n×n . The degree matrix is defined as w ij . The optimal ranking of queries is given by the following optimization function: where the parameter µ controls the balance between the smoothness constraints (the first term) and the fitting constraints (the second term). Following the derivation in Refs. [12,40], the resultant ranking functions are given by

Methodology
Our saliency detection framework is based on a two-stage graph based manifold ranking process followed by a Bayesian integration process (see Fig.  2).

Feature extraction & graph construction
We first use the simple linear iterative clustering (SLIC) super-pixel segmentation algorithm [41] to over-segment the input image, generating n regions/super-pixels. To provide a rich feature description, we use the following feature vector v i ∈ is the average region color in the CIE Lab color space. Feature represents the contextual information (i.e., the center prior [42]); (x 0 , y 0 ) is the position of the image center. ε i denotes the edge density of the region (we use the Canny operator [43] in the implementation). Note that in Ref. [12], only CIE Lab color features are used to describe each region, which less robustly deals with texture regions, and ignores important contextual information.
We construct a single layer fully connected graph G = (V , E) with nodes V = {v 1 , · · · , v n } and edges E which are weighted by an affinity matrix W = [w ij ] n×n (see also Section 3.1). We define the affinity values between two image regions as where σ controls the strength of the weight. Notice that this graph is a fully connected graph, which allows long range connections [44] between image regions, and thus enables us to capture important global cues [15] for salient object detection.

Ranking with hard background queries
It is commonly observed that objects of interest in a photograph often occur such that they are rarely connected to image boundaries [12,15,27,42]. We use image boundary regions as query samples to rank the relevance of all other regions (see also Fig. 3, stage I). The labeling vector L is initialized so that l i = 1 if region i is a query sample and l i = 0 otherwise. Note that we automatically determine the initial boundary regions in the same way as in Ref. [12]; small errors here have little influence on the final results. The relevance of each region can be calculated using Eq. (2), and the corresponding saliency value using a hard background query is where S bq is a vector in which element S i represents the saliency of region i according to the background query, and ( * ) represents minmax normalization of saliency values into the range [0, 1]. Note that the fully connected graph topology and rich feature representation enable us to robustly rank image regions using a single query, instead of requiring 4 different boundary queries and their integration as in Ref. [12].

Ranking with soft foreground queries
The region saliency vector S bq can be used as a new query to construct a saliency map that better explores the underlying intrinsic structure of image data. Equation (2) essentially multiplies the optimal affinity matrix A by the query label vector L, which does not necessarily need to be a binary query. Thus  we can directly feed S bq into Eq. (2) as a soft foreground query, without making the hard decision of binarization [12], for which threshold selection could be difficult, potentially introducing artifacts. By substituting Eq. (7) into Eq. (2), we get the following soft foreground query saliency values: Figure 3 (stage II) shows an example of a soft foreground query, which successfully suppresses background noise and highlights salient object regions. Notice that Eq. (8) gives us a closed form solution for our two-stage manifold ranking based salient object detection method, in which the matrix A ∈ R n×n is a small matrix. This means that our algorithm can efficiently determine the salient object region (see Fig. 4).
Difference from GMR [12]. Our method is different from GMR [12] in several ways. Firstly, to capture both long range connections and short range connections, we use a fully connected graph topology instead of only considering local neighborhoods as in Ref. [12]. This design choice helps our method to better capture the underlying image structure for improved salient object detection. Secondly, a rich feature vector is used instead of simple Lab color. Thirdly, we use a single boundary query in the first stage and another foreground query in the second stage to avoid querying each edge separately and possible artifacts introduced by hard thresholding. Finally, we quantitatively demonstrate that modeling background information is complementary to traditional methods and significantly improves upon the prior state-ofthe-art performanced. In Fig. 6 and Section 5.1, we quantitatively demonstrate that both the fully connected graph topology and rich features significantly contribute to the high performance; the former contributes more.

Bayesian integration
Most existing salient object detection methods place more emphasis on salient object features, e.g., Refs. [15,17,25,27,30,32]. In contrast, our two-stage manifold ranking salient object detection method analyzes the input image according to background features (i.e., relationship to queries of border regions). Such complementary relations suggest that our two-stage manifold ranking results may potentially be integrated with traditional salient object detection results to obtain even better salient object predictions and segmentation accuracy. Following Refs. [13,45], we use a Bayesian method to integrate our two-stage manifold ranking results with traditional salient object detection results (e.g., DRFI [27] and RC [15]).
In Bayesian inference, both the prior and the likelihood are needed to compute the posterior probability, which is utilized as the final integration result. Firstly, we use the saliency map generated by traditional methods as the prior, denoted by p(F 1 ), while the two-stage manifold ranking result is applied to generate a foreground mask in order to estimate the likelihood. In the following we use F 1 and F 0 to denote the foreground and background, respectively. We represent the input image by a color histogram in which each pixel z falls into a certain feature Q(z) in the color channels of the CIE Lab color space. Each pixel z is represented by a vector u(z) = [l, a, b] T in the color space. The likelihood can then be computed by where N F 1 and N F 0 denote the total number of pixels in the foreground F 1 and background F 0 , respectively. N 1 (z u ) and N 0 (z u ) are the numbers of points that fall into the corresponding bin that contains feature Q(z) in F 1 and F 0 , respectively. Thus, the Bayesian formula can be defined as We represent the integration maps using traditional models [15,27] as the prior with p(F 1 tr |Q(z)). Another fusion map, p(F 1 fq |Q(z)), is further constructed by utilizing the proposed method as the prior while the traditional models are used to compute the likelihood. The final saliency map is formulated in a straightforward manner by p(F 1 ours |Q(z)) = p(F 1 tr |Q(z)) + p(F 1 fq |Q(z)) (12) We have conducted tests using RC [15] and DRFI [27] as the traditional method, and denote the corresponding integrated results as OursR and OursD respectively. Figure 5 provides a visual comparison of different components of our method. In these examples, the final integration result successfully highlights the salient object region and suppresses background elements. Our quantitative experimental results (see Section 5.1) on three well-known benchmarks consistently are in agreement with the above assumptions, leading our method to significantly outperform the state-of-the-art methods.

Experimental evaluation
We have extensively evaluated two variants of our method (OursD and OursR) on three challenging benchmarks (MSRA10K [14][15][16], ECSSD [17], and DUT-OMRON [12]), and here compare the results against 10 state-of-the-art alternative methods (RC [15], PCA [32], GMR [12], HS [17], BMS [29], MC [25], DSR [13], DRFI [27], HDCT [30], and WCTR [39])) using three popular quantitative evaluation metrics: precision-recall curves, adaptive thresholding, and mean absolute error. The other approaches used publicly available source code from the authors. When tested on the ECSSD dataset (with a typical image resolution of 400 × 300), the average running time of our method is 7.79 s on a laptop with an Intel i3 2.4 GHz CPU and 8 GB RAM, using our unoptimized Matlab code. Most of the time in our method is taken by the traditional salient object method.

Effectiveness of the design and choices
We first consider the effectiveness of the design and choices for the proposed method, using the ECSSD dataset, and show the results in Fig. 6. This figure demonstrates that the two-stage manifold ranking based salient object detection (S bq , S fq ) and existing DRFI [27] approaches can achieve good performance when applied alone. After applying the Bayesian integration model, it can be clearly seen that the performance of the proposed method is significantly enhanced, leading to better performance than that of the model components. Hereafter, we use the best configuration (OursD) for performance evaluation in the following experiments.

Precision and recall
Following Refs. [14,15,46], we quantitatively evaluate the performance of our method in terms of precision and recall rates. Precision is defined as the percentage of salient pixels correctly assigned, while recall corresponds to the percentage of detected salient pixels among all the ground truth pixels. In alignment with previous works, we binarize saliency maps using every threshold in the range [0, 255]. The resulting precision-recall curves in Fig. 7(a) clearly show that our algorithm consistently outperforms other methods at almost every threshold for any recall rate and any tested dataset.
We also tested image-dependent adaptive thresholding as suggested by Ref. [14], where the binarization threshold is defined as twice as the average saliency value over the image. F-measure, the harmonic mean of precision and recall, is another popular evaluation measure calculated as follows: where β 2 is set to 0.3 to give more weight to precision than recall, as suggested in earlier works [14,15,46]. Figure 7 shows the performance of 12 saliency methods on all tested datasets. The experimental results show that our approach constantly achieves higher precision, recall, and F-measure than existing methods. Within these evaluations, the best method among the baselines is DRFI [27], which is complementary to our twostage manifold ranking based results; integratiing them outperforms either by a large margin (see also Section 5.1). In most cases, our approach highlights salient regions effectively and suppresses background elements robustly, thus producing more accurate results. A visual comparison of methods is provided in Fig. 8.
The MAE is computed as (14) where W and H denote the width and the height of the saliency map S or the ground truth T , respectively. As shown in Fig. 9, our method successfully reduces the MAE compared to state-ofthe-art methods, and generates favorable results.

Conclusions and future work
In this paper, we have presented an effective salient object detection approach based on the manifold ranking model. The proposed model exploits intrinsic structural details by estimating the relevance of the salient object and the background features. One key aspect of our model which distinguishes it from the current literature is that it emphasizes background features more, not just salient object features. Furthermore, thanks to the complementary effects of the proposed model and the traditional models, we may apply a Bayesian formulation as an output interface for cue integration, leading to an improved saliency detection performance, which outperforms both. We have evaluated the proposed method on three challenging salient object datasets and compared its performance to those using existing state-of-the-art models. Extensive experimental results show that our model achieves better results and can effectively handle different cases in challenging scenarios.
Our future work will focus on further features to overcome the limitations of our model to improve the accuracy of saliency detection in images containing foreground objects having a background of similar texture. Another direction will be to detect and segment composite objects, as object components sometimes have quite different features (e.g., head with respect to the rest of body). In this regard, it would be interesting to know how human choose the most salient object when dealing with composite objects. This may help us discover semantics that should be included in salient object detection models to reduce false negatives.
Huchuan Lu received his M.S. degree in signal and information processing and Ph.D. degree in system engineering from Dalian University of Technology (DUT), China, in 1998 and 2008, respectively. He joined DUT in 1998, as a faculty member, where he is currently a full professor with the School of Information and Communication Engineering. His research interests include visual tracking, saliency detection, and segmentation. He is a member of the Association for Computing Machinery, and an Associate Editor of IEEE Transactions on Cybernetics.
Lian-Fa Bai is a professor of Jiangsu Key Laboratory of Spectral Imaging and Intelligence Sense, Nanjing University of Science and Technology. He got his Ph.D. degree in Nanjing University of Science and Technology. His current research interests include computer vision and image detection.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.