Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground

Fan, Deng-Ping; Cheng, Ming-Ming; Liu, Jiang-Jiang; Gao, Shang-Hua; Hou, Qibin; Borji, Ali

doi:10.1007/978-3-030-01267-0_12

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11219))

Included in the following conference series:

European Conference on Computer Vision

3420 Accesses
192 Citations

Abstract

We provide a comprehensive evaluation of salient object detection (SOD) models. Our analysis identifies a serious design bias of existing SOD datasets which assumes that each image contains at least one clearly outstanding salient object in low clutter. The design bias has led to a saturated high performance for state-of-the-art SOD models when evaluated on existing datasets. The models, however, still perform far from being satisfactory when applied to real-world daily scenes. Based on our analyses, we first identify 7 crucial aspects that a comprehensive and balanced dataset should fulfill. Then, we propose a new high quality dataset and update the previous saliency benchmark. Specifically, our SOC (Salient Objects in Clutter) dataset, includes images with salient and non-salient objects from daily object categories. Beyond object category annotations, each salient image is accompanied by attributes that reflect common challenges in real-world scenes. Finally, we report attribute-based performance assessment on our dataset.

You have full access to this open access chapter, Download conference paper PDF

RGB-D salient object detection: A survey

Article Open access 07 January 2021

Salient object detection via saliency bias and diffusion

Article 02 February 2016

Salient object detection: A survey

Article Open access 21 June 2019

Keywords

1 Introduction

This paper considers the task of salient object detection (SOD). Visual saliency mimics the ability of the human visual system to select a certain subset of the visual scene. SOD aims to detect the most attention-grabbing objects in a scene and then extract pixel-accurate silhouettes of the objects. The merit of SOD lies in it applications in many other computer vision tasks including: visual tracking [4], image retrieval [14, 16], computer graphics [9], content aware image resizing [45], and weakly supervised semantic segmentation [18, 39, 40].

Our work is motivated by two observations. First, existing SOD datasets [2, 5, 10, 11, 23, 26, 29, 32, 43, 44] are flawed either in the data collection procedure or quality of the data. Specifically, most datasets assume that an image contains at least one salient object, and thus discard images that do not contain salient objects. We call this data selection bias. Moreover, existing datasets mostly contain images with a single object or several objects (often a person) in low clutter. These datasets do not adequately reflect the complexity of images in the real world where scenes usually contain multiple objects amidst lots of clutter. As a result, all top performing models trained on the existing datasets have nearly saturated the performance (e.g., > 0.9 $F\text {-}measure$ over most current datasets) but unsatisfactory performance on realistic scenes (e.g., < 0.45 $F\text {-}measure$ in Table 3). Because current models may be biased towards ideal conditions, their effectiveness may be impaired once they are applied to real world scenes. To solve this problem, it is important to introduce a dataset that reaches closer to realistic conditions.

Second, only the overall performance of the models can be analyzed over existing datasets. None of the datasets contains various attributes that reflect challenges in real-world scenes. Having attributes helps (1) gain a deeper insight into the SOD problem, (2) investigate the pros and cons of the SOD models, and (3) objectively assess the model performances over different perspectives, which might be diverse for different applications.

Considering the above two issues, we make two contributions. Our main contribution is the collection of a new high quality SOD dataset, named the SOC, Salient Objects in Clutter. To date, SOC is the largest instance-level SOD dataset and contains 6,000 images from more than 80 common categories. It differs from existing datasets in three aspects: (1) salient objects have category annotation which can be used for new research such as weakly supervised SOD tasks, (2) the inclusion of non-salient images which make this dataset closer to the real-world scenes and more challenging than the existing ones, and (3) salient objects have attributes reflecting specific situations faced in the real-wold such as motion blur, occlusion and cluttered background. As a consequence, our SOC dataset narrows the gap between existing datasets and the real-world scenes and provides a more realistic benchmark (see Fig. 1).

In addition, we provide a comprehensive evaluation of several state-of-the-art convolutional neural networks (CNNs) based models [8, 15, 17, 23, 24, 28, 31, 36, 38, 48,49,50,51]. To evaluate the models, we introduce three metrics that measure the region similarity of the detection, the pixel-wise accuracy of the segmentation, and the structure similarity of the result. Furthermore, we give an attribute-based performance evaluation. These attributes allow a deeper understanding of the models and point out promising directions for further research.

We believe that our dataset and benchmark can be very influential for future SOD research in particular for application-oriented model development. The entire dataset and analyzing tools will be released freely to the public.

2 Related Works

In this section, we briefly discuss existing datasets designed for SOD tasks, especially in the aspects including annotation type, the number of salient objects per image, number of images, and image quality. We also review the CNNs based SOD models.

2.1 Datasets

Early datasets are either limited in the number of images or in their coarse annotation of salient objects. For example, the salient objects in datasets MSRA-A [29] and MSRA-B [29] are roughly annotated in the form of bounding boxes. ASD [1] and MSRA10K [11] mostly contain only one salient object in each image, while the SED2 [2] dataset contains two objects in a single image but contains only 100 images. To improve the quality of datasets, researchers in recent years started to collect datasets with multiple objects in relatively complex and cluttered backgrounds. These datasets include DUT-OMRON [44], ECSSD [43], Judd-A [5], and PASCAL-S [26]. These datasets have been improved in terms of annotation quality and the number of images, compared to their predecessors. Datasets HKU-IS [23], XPIE [41], and DUTS [37] resolved the shortcomings by collecting large amounts of pixel-wise labeled images (Fig. 2b) with more than one salient object in images. However, they ignored the non-salient objects and did not offer instance-level (Fig. 2c) salient objects annotation. Beyond these, researchers of [19] collected about 6k simple background images (most of them are pure texture images) to account for the non-salient scenes. This dataset is not sufficient to reflect real scenes as the real-world scenes are more complicated. The ILSO [22] dataset contains instance-level salient objects annotation but has boundaries roughly labeled as shown in Fig. 5a.

To sum up, as discussed above, existing datasets mostly focus on images with clear salient objects in simple backgrounds. Taking into account the aforementioned limitations of existing datasets, a more realistic dataset which contains realistic scenes with non-salient objects, textures “in the wild”, and salient objects with attributes, is needed for future investigations in this field. Such a dataset can offer deep insights into weaknesses and strengths of SOD models.

2.2 Models

We divide the state-of-the-art deep models for SOD based on the number of tasks.

Single-task models have the single goal of detecting the salient objects in images. In LEGS [36], local information and global contrast were separately captured by two different deep CNNs, and were then fused to generate a saliency map. In [51], Zhao et al. presented a multi-context deep learning framework (MC) for SOD. Li et al. [23] (MDF) proposed to use multi-scale features extracted from a deep CNNs to derive a saliency map. Li et al. [24] presented a deep contrast network (DCL), which not only considered the pixel-wise information but also fused the segment-level guidance into the network. Lee et al. [15] (ELD) considered both high-level features extracted from CNNs and hand-crafted features. Liu et al. [28] (DHS) designed a two-stage network, in which a coarse downscaled prediction map was produced. It is then followed by another network to refine the details and upsample the prediction map hierarchically and progressively. Long et al. [30] proposed a fully convolutional network (FCN) to make dense pixel prediction problem feasible for end-to-end training. RFCN [38] used a recurrent FCN to incorporate the coarse predictions as saliency priors and refined the generated predictions in a stage-wise manner. The DISC [8] framework was proposed for fine-grained image saliency computing. Two stacked CNNs were utilized to obtain coarse-level and fine-grained saliency maps, respectively. IMC [48] integrated saliency cues at different levels through FCN. It could efficiently exploit both learned semantic cues and higher-order region statistics for edge-accurate SOD. Recently, a deep architecture [17] with short connections (DSS) was proposed. Hou et al. added connections from high-level features to low-level features based on the HED [42] architecture, achieving good performance. NLDF [31] integrated local and global features and added a boundary loss term into standard cross entropy loss to train an end-to-end network. AMU [49] was a generic aggregating multi-level convolutional feature framework. It integrated coarse semantics and fine detailed feature maps into multiple resolutions. Then it adaptively learned to combine these feature maps at each resolution and predicted saliency maps with the combined features. UCF [50] was proposed to improve the robustness and accuracy of saliency detection. They introduced a reformulated dropout after specific convolutional layers to construct an uncertain ensemble of internal feature units. Also, they proposed reformulated dropout after an effective hybrid up-sampling method to reduce the checkerboard artifacts of deconvolution operators in the decoder network.

Table 1. CNNs based SOD models. We divided these models into single-task (S-T) and multi-task (M-T). Training Set: MB is the MSRA-B dataset [29]. MK is the MSRA-10K [11] dataset. ImageNet dataset refers to [34]. D is the DUT-OMRON [44] dataset. H is the HKU-IS [23] dataset. P is the PASCAL-S [26] dataset. P2010 is the PASCAL VOC 2010 semantic segmentation dataset [12]. Base Model: VGGNet, ResNet-101, AlexNet, GoogleNet are base models. FCN: whether model uses the fully convolutional network. Sp: whether model uses superpixels. Proposal: whether model uses the object proposal. Edge: whether model uses the edge or contour information

Full size table

Multi-task models at present include three methods, DS, WSS, and MSR. The DS [25] model set up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation, which shared the information in FCN layers to generate effective features for object perception. Recently, Wang et al. [37] proposed a model named WSS which developed a weakly supervised learning method using image-level tags for saliency detection. First, they jointly trained Foreground Inference Net (FIN) and FCN for image categorization. Then, they used FIN fine-tuned with iterative CRF to enforce spatial label consistency to predict the saliency map. MSR [22] was designed for both salient region detection and salient object contour detection, integrated with multi-scale combinatorial grouping and a MAP-based [47] subset optimization framework. Using three refined VGG network streams with shared parameters and a learned attentional model for fusing results at different scales, the authors were able to achieve good results.

We benchmark a large set of the state-of-the-art CNNs based models (see Table 1) on our proposed dataset, highlighting the current issues and pointing out future research directions.

3 The Proposed Dataset

In this section, we present our new challenging SOC dataset designed to reflect the real-world scenes in detail. Sample images from SOC are shown in Fig. 1. Moreover, statistics regarding the categories and the attributes of SOC are shown in Figs. 4a and 6, respectively. Based on the strengths and weaknesses of the existing datasets, we identify seven crucial aspects that a comprehensive and balanced dataset should fulfill.

(1) Presence of Non-Salient Objects. Almost all of the existing SOD datasets make the assumption that an image contains at least one salient object and discard the images that do not contain salient objects. However, this assumption is an ideal setting which leads to data selection bias. In a realistic setting, images do not always contain salient objects. For example, some amorphous background images such as sky, grass and texture contain no salient objects at all [6]. The non-salient objects or background “stuff” may occupy the entire scene, and hence heavily constrain possible locations for a salient object. Xia et al. [41] proposed a state-of-the-art SOD model by judging what is or what is not a salient object, indicating that the non-salient object is crucial for reasoning about the salient object. This suggests that the non-salient objects deserve equal attention as the salient objects in SOD. Incorporating a number of images containing non-salient objects makes the dataset closer to real-world scenes, while becoming more challenging. Thus, we define the “non-salient objects” as images without salient objects or images with “stuff” categories. As suggested in [6, 41], the “stuff” categories including (a) densely distributed similar objects, (b) fuzzy shape, and (c) region without semantics, which are illustrated in Fig. 3a–c, respectively.

Based on the characteristics of non-salient objects, we collected 783 texture images from the DTD [21] dataset. To enrich the diversity, 2217 images including aurora, sky, crowds, store and many other kinds of realistic scenes were gathered from the Internet and other datasets [26, 27, 32, 35]. We believe that incorporating enough non-salient objects would open up a promising direction for future works.

(2) Number and Category of Images. A considerably large amount of images is essential to capture the diversity and abundance of real-world scenes. Moreover, with large amounts of data, SOD models can avoid over-fitting and enhance generalization. To this end, we gathered 6,000 images from more than 80 categories, containing 3,000 images with salient objects and 3,000 images without salient objects. We divide our dataset into training set, validation set and test set in the ratio of 6:2:2. To ensure fairness, the test set is not published, but with the on-line testing provided on our website^{Footnote 1}. Figure 4a shows the number of salient objects for each category. It shows that the “person” category accounts for a large proportion, which is reasonable as people usually appear in daily scenes along with other objects.

(3) Global/Local Color Contrast of Salient Objects. As described in [26], the term “salient” is related to the global/local contrast of the foreground and background. It is essential to check whether the salient objects are easy to detect. For each object, we compute RGB color histograms for foreground and background separately. Then, $\chi ^2$ distance is utilized to measure the distance between the two histograms. The global and local color contrast distribution are shown in Fig. 4b and c, respectively. In comparison to ILSO, our SOC has more proportion of objects with low global color contrast and local color contrast.

(4) Locations of Salient Objects. Center bias has been identified as one of the most significant biases of saliency detection datasets [3, 20, 26]. Figure 4d illustrates a set of images and their overlay map. As can be seen, although salient objects are located in different positions, the overlay map still shows that somehow this set of images is center biased. Previous benchmarks often adopt this incorrect way to analyze the location distribution of salient objects. To avoid this misleading phenomenon, we plot the statistics of two quantities $r_o$ and $r_m$ in Fig. 4e, where $r_o$ and $r_m$ denote how far an object center and the farthest (margin) point in an object are from the image center, respectively. Both $r_o$ and $r_m$ are divided by half image diagonal length for normalization so that $r_o, r_m \in [0, 1]$. From these statistics, we can observe that salient objects in our dataset do not suffer from center bias.

(5) Size of Salient Objects. The size of an instance-level salient object is defined as the proportion of pixels in the image [26]. As shown in Fig. 4(g), the size of salient objects in our SOC varies in a broader range, compared with the only existing instance-level ILSO [22] dataset. Also, medium-sized objects in SOC have a higher proportion.

Table 2. The list of salient object image attributes and the corresponding description. By observing the characteristics of the existing datasets, we summarize these attributes. Some visual examples can be found in Figs. 1 and 4f. For more examples, please refer to the supplementary materials

Full size table

(6) High-Quality Salient Object Labeling. As also noticed in [17], training on the ECSSD dataset (1,000) allows to achieve better results than other datasets (e.g., MSRA10K, with 10,000 images). Besides the scale, dataset quality is also an important factor. To obtain a large amount of high quality images, we randomly select images from the MSCOCO dataset [27], which is a large-scale real-world dataset whose objects are labeled with polygons (i.e., coarse labeling). High-quality labels also play a critical role in improving the accuracy of SOD models [1]. Toward this end, we relabel the dataset with pixel-wise annotations. Similar to famous SOD task oriented benchmark datasets [1, 2, 11, 19, 22, 23, 29, 32, 37, 41, 43], we did not use the eye tracker device. We have taken a number of steps to provide the high-quality of the annotations. These steps include two stages: In the bounding boxes (bboxes) stage, (i) we ask 5 viewers to annotate objects with bboxes that they think are salient in each image. (ii) keep the images which majority ($\ge 3$) viewers annotated the same (the IOU of the bbox $>0.8$) object. After the first stage, we have 3,000 salient object images annotated with bboxes. In the second stage, we further manually label the accurate silhouettes of the salient objects according to the bboxes. Note that we have 10 volunteers involved in the whole steps for cross-check the quality of annotations. In the end, we keep 3,000 images with high-quality, instance-level labeled salient objects. As shown in Fig. 5b, d, the boundaries of our object labels are precise, sharp and smooth. During the annotation process, we also add some new categories (e.g., computer monitor, hat, pillow) that are not labeled in the MSCOCO dataset [27].

(7) Salient Objects with Attributes. Having attributes information regarding the images in a dataset helps objectively assess the performance of models over different types of parameters and variations. It also allows the inspection of model failures. To this end, we define a set of attributes to represent specific situations faced in the real-wold scenes such as motion blur, occlusion and cluttered background (summarized in Table 2). Note that one image can be annotated with multiple attributes as these attributes are not exclusive.

Inspired by [33], we present the distribution of attributes over the dataset as shown in Fig. 6 Left. Type SO has the largest proportion due to accurate instance-level (e.g., tennis racket in Fig. 2) annotation. Type HO accounts for a large proportion, because the real-world scenes are composed of different constituent materials. Motion blur is more common in video frames than still images, but it also occurs in still images sometimes. Thus, type MB takes a relatively small proportion in our dataset. Since a realistic image usually contains multiple attributes, we show the dominant dependencies among attributes based on the frequency of occurrences in the Fig. 6 Right. For example, a scene containing lots of heterogeneous objects is likely to have a large number of objects blocking each other and forming complex spatial structures. Thus, type HO has a strong dependency with type OC, OV, and SO.

4 Benchmarking Models

In this section, we present the evaluation results of the sixteen SOD models on our SOC dataset. Nearly all representative CNNs based SOD models are evaluated. However, since the codes of some models are not publicly available, we do not consider them here. In addition, most models are not optimized for non-salient objects detection. Thus, to be fair, we only use the test set of our SOC dataset to evaluate SOD models. We describe the evaluation metrics in Sect. 4.1. Overall model performance on SOC dataset is presented in Sect. 4.2 and summarized in Table 3, while the attribute level performance (e.g., performance of the appearance changes) is discussed in Sect. 4.3 and summarized in Table 4. The evaluation scripts are publicly available, and on-line evaluation test is provided on our website.

4.1 Evaluation Metrics

In a supervised evaluation framework, given a predicted map M generated by a SOD model and a ground truth mask G, the evaluation metrics are expected to tell which model generates the best result. Here, we use three different evaluation metrics to evaluate SOD models on our SOC dataset.

Pixel-wise Accuracy $\varepsilon $. The region similarity evaluation measure does not consider the true negative saliency assignments. As a remedy, we also compute the normalized ([0,1]) mean absolute error (MAE) between M and G, defined as:

$$\begin{aligned} \varepsilon = \frac{1}{W \times H} \sum _{x=1}^{W}\sum _{y=1}^{H} || M(x,y) - G(x,y)||, \end{aligned}$$

(1)

where W and H are the width and height of images, respectively.

Region Similarity F. To measure how well the regions of the two maps match, we use the $F\text {-}measure$, defined as:

$$\begin{aligned} F = \frac{(1+\beta ^{2}) Precision \times Recall}{\beta ^{2} Precision + Recall}, \end{aligned}$$

(2)

where $\beta ^{2}=0.3$ is suggested by [1] to trade-off the recall and precision. However, the black (all-zero matrix) ground truth is not well defined in $F\text {-}measure$ when calculating recall and precision. Under this circumstances, different foreground maps get the same result 0, which is apparently unreasonable. Thus, $F\text {-}measure$ is not suitable for measuring the results of non-salient object detection.

However, both metrics of $\varepsilon $ and F are based on pixel-wise errors and often ignore the structural similarities. Behavioral vision studies have shown that the human visual system is highly sensitive to structures in scenes [13]. In many applications, it is desired that the results of the SOD model retain the structure of objects.

Structure Similarity S. $S\text {-}measure$ proposed by Fan et al. [13] evaluates the structural similarity, by considering both regions and objects. Therefore, we additionally use $S\text {-}measure$ to evaluate the structural similarity between M and G. Note that the next overall performance we evaluated and analyzed are based on the $S\text {-}measure$.

4.2 Metric Statistics

To obtain an overall result, we average the scores of the evaluation metrics $\eta $ ($\eta \in \{F, \varepsilon , S\}$), denoted by:

$$\begin{aligned} M_{\eta }(D)=\frac{1}{|D|}\sum _{I\in D}{\bar{\eta }(I_{i})}, \end{aligned}$$

(3)

where $\bar{\eta }(I_{i})$ is the evaluation score of the image $I_i$ within the image dataset D.

Table 3. The performance of SOD models under three metrics. F stands for region similarity, $\varepsilon $ is the mean absolute error, and S is the structure similarity. $\uparrow $ stand for the higher the number the better, and vice versa for $\downarrow $. The evaluation results are calculated according to Eq. (3) over our SOC dataset. $S_{all}, F_{all}, \varepsilon _{all}$ indicate the overall performance using the metric of $S, F,~\varepsilon $, respectively. Bold for the best.

Full size table

Single-task: For the single-task models, the best performing model on the entire SOC dataset ($S_{all}$ in Table 3) is NLDF [31] ($M_{S}=0.818$), followed by RFCN [38] ($M_{S}=0.814$). MDF [23] and AMU [49] use edge cues to promote the saliency map but fail to achieve the ideal goal. Aiming at using the local region information of images, MC [51], MDF [23], ELD [15], and DISC [8] try to use superpixel methods to segment images into regions and then extract features from these regions, which is complex and time-consuming. To further improve the performance, UCF [50], DSS [17], NLDF [31], and AMU [49] utilize the FCN to improve the performance of SOD ($S_{sal}$ in Table 4). Some other methods such as DCL [24] and IMC [48] try to combine superpixels with FCN to build a powerful model. Furthermore, RFCN [38] combines two related cues including edges and superpixels into FCN to obtain the good performance ($M_{F}=0.435$, $M_{S}=0.814$) over the overall dataset.

Multi-task: Different from models mentioned above, MSR [22] detects the instance-level salient objects using three closely related steps: estimating saliency maps, detecting salient object contours, and identifying salient object instances. It creates a multi-scale saliency refinement network that results in the highest performance ($S_{all}$). Other two multi-task models DS [25] and WSS [37] utilize the segmentation and classification results simultaneously to generate the saliency maps, obtaining a moderate performance. It is worth mentioning that although WSS is a weakly supervised multi-task model, it still achieves comparable performance to other single-task, fully supervised models. So, the weakly-supervised and multi-task based models can be promising future directions.

4.3 Attributes-Based Evaluation

We assign the salient images with attributes as discussed in Sect. 3 and Table 2. Each attribute stands for a challenging problem faced in the real-world scenes. The attributes allow us to identify groups of images with a dominant feature (e.g., presence of clutter), which is crucial to illustrate the performance of SOD models and to relate SOD to application-oriented tasks. For example, sketch2photo application [7] prefers models with good performance on big objects, which can be identified by attributes-based performance evaluation methods.

Results. In Table 4, we show the performance on subsets of our dataset characterized by a particular attribute. Due to space limitation, in the following parts, we only select some representative attributes for further analysis. More details can be found in the supplementary material.

Big Object (BO) scenes often occur when objects are in a close distance with the camera, in which circumstances the tiny text or patterns would always be seen clearly. In this case, the models which prefer to focus on local information will be mislead seriously, leading to a considerable (e.g., 28.9% loss for DSS [17], 20.8% loss for MC [51] and 23.8% loss for RFCN [38]) loss of performance.

Table 4. Attributes-based performance on our SOC salient objects sub-dataset. For each model, the score corresponds to the average structure similarity $M_{S}$ (in Sect. 4.1) over all datasets with that specific attribute (e.g., CL). The higher the score the better the performance. Bold for the best. The average salient-object performance $S_{sal}$ is presented in the first row using the structure similarity S. The symbol of $^{+}$ and $^{-}$ indicates increase and decrease compared to the average ($S_{sal}$) result, respectively

Full size table

However, the performance of IMC [48] model goes up for a slight margin of 3.2% instead. After taking a deeper look of the pipeline of this model, we came up a reasonable explanation. IMC uses a coarse predicted map to express semantics and utilizes over-segmented images to supplement the structural information, achieving a satisfying result on type BO. However, over-segmented images cannot make up the missing details, causing 4.6% degradation of performance on the type of SO.

Small Object (SO) is tricky for all SOD models. All models encounter performance degradation (e.g., from DSS [17] $-0.3\%$ to LEGS [36] $-5.6\%$), because SOs are easily ignored during down-sampling of CNNs. DSS [17] is the only model that has a slight decrease of performance on type SO, while it has the biggest (28.9%) loss of performance on type BO. MDF [23] uses multi-scale superpixels as the input of network, so it retains the details of small objects well. However, due to the limited size of superpixels, MDF can not efficiently sense the global semantics, causing a big failure on type BO.

Occlusions (OC) scenes in which objects are partly obscured. Thus, it requires SOD models to capture global semantics to make up for the incomplete information of objects. To do so, DS [25] & AMU [49] made use of the multi-scale features in the down-sample progress to generate a fused saliency map; UCF [50] proposed an uncertain learning mechanism to learn uncertain convolutional features. All these methods try to get saliency maps containing both global and local features. Unsurprisingly, these methods have achieved pretty good results on type OC. Based on the above analyses, we also find that these three models perform very well on the scenes requiring more semantic information like type AC, OV and CL.

Heterogeneous Object (HO) is a common attribute in nature scenes. The performance of different models on type HO gets some improvement to their average performances respectively, all fluctuating from 3.9% to 9.7%. We suspect this is because type HO accounts for a significant proportion of all datasets, objectively making models more fitting to this attribute. This result in some degree confirms our statistics in Fig. 6.

5 Discussion and Conclusion

To our best knowledge, this work presents the currently largest scale performance evaluation of CNNs based salient object detection models. Our analysis points out a serious data selection bias in existing SOD datasets. This design bias has lead to state-of-the-art SOD algorithms almost achieve saturated high performance when evaluated on existing datasets, but are still far from being satisfactory when applied to real-world daily scenes. Based on our analysis, we first identify 7 important aspects that a comprehensive and balanced dataset should fulfill. We firstly introduces a high quality SOD dataset, SOC. It contains salient objects from daily life in their natural environments which reaches closer to realistic settings. The SOC dataset will evolve and grow over time and will enable research possibilities in multiple directions, e.g., salient object subitizing [46], instance level salient object detection [22], weakly supervised based salient object detection [37], etc. Then, a set of attributes (e.g., Appearance Change) is proposed in the attempt to obtain a deeper insight into the SOD problem, investigate the pros and cons of the SOD algorithms, and objectively assess the model performances over different perspectives/requirements. Finally, we report attribute-based performance assessment on our SOC dataset. The results open up promising future directions for model development and comparison.

Notes

1.
http://dpfan.net/SOCBenchmark/.

References

Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned Salient Region Detection. In: CVPR, pp. 1597–1604. IEEE (2009)
Google Scholar
Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic bottom-up aggregation and cue integration. In: CVPR, pp. 1–8. IEEE (2007)
Google Scholar
Borji, A., Cheng, M.M., Jiang, H., Li, J.: Salient object detection: a benchmark. IEEE Trans. Image Process. 24(12), 5706–5722 (2015)
Article MathSciNet Google Scholar
Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)
Article Google Scholar
Borji, A., Sihite, D.N., Itti, L.: Salient object detection: a benchmark. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 414–429. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_30
Chapter Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: CVPR. IEEE (2018)
Google Scholar
Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.: Sketch2photo: internet image montage. ACM Trans. Graph. (TOG) 28(5), 124 (2009)
Google Scholar
Chen, T., Lin, L., Liu, L., Luo, X., Li, X.: DISC: deep image saliency computing via progressive representation learning. IEEE Trans. Neural Netw. Learn. Syst. 27(6), 1135–1149 (2016)
Article MathSciNet Google Scholar
Cheng, M.M., Hou, Q.B., Zhang, S.H., Rosin, P.L.: Intelligent visual media processing: when graphics meets vision. J. Comput. Sci. Technol. 32(1), 110–121 (2017)
Article Google Scholar
Cheng, M.M., Mitra, N.J., Huang, X., Hu, S.M.: Salientshape: group saliency in image collections. Vis. Comput. 30(4), 443–453 (2014)
Article Google Scholar
Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2010 (VOC 2010) Results. https://www.researchgate.net/profile/Luc_Van_Gool/publication/277292831_The_2005_pascal_visual_object_classes_challenge/links/57224cf108aef9c00b7c7efb.pdf
Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: ICCV, pp. 4548–4557. IEEE (2017)
Google Scholar
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 698–704 (2018)
Google Scholar
Gayoung, L., Yu-Wing, T., Junmo, K.: Deep saliency with encoded low level distance map and high level features. In: CVPR, IEEE (2016)
Google Scholar
He, J. et al.: Mobile product search with bag of hash bits and boundary reranking. In: CVPR, pp. 3005–3012. IEEE (2012)
Google Scholar
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. IEEE TPAMI (2018). https://doi.org/10.1109/TPAMI.2018.2815688
Hou, Q., Dokania, P.K., Massiceti, D., Wei, Y., Cheng, M.M., Torr, P.H.S.: Bottom-up top-down cues for weakly supervised semantic segmentation. In: EMMCVPR. IEEE (2017)
Google Scholar
Jiang, H., Cheng, M.M., Li, S.J., Borji, A., Wang, J.: Joint salient object detection and existence prediction. Front. Comput. Sci., 1–11 (2017)
Google Scholar
Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations. In: MIT Technical Report (2012)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local affine regions. IEEE TPAMI 27(8), 1265–1278 (2005)
Article Google Scholar
Li, G., Xie, Y., Lin, L., Yu, Y.: Instance-level salient object segmentation. In: CVPR, pp. 247–256. IEEE (2017)
Google Scholar
Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR, pp. 5455–5463. IEEE (2015)
Google Scholar
Li, G., Yu, Y.: Deep contrast learning for salient object detection. In: CVPR, pp. 478–487. IEEE (2016)
Google Scholar
Li, X., Zhao, L., Wei, L., Yang, M.H., Wu, F., Zhuang, Y., Ling, H., Wang, J.: DeepSaliency: multi-task deep neural network model for salient object detection. IEEE TIP 25(8), 3919–3930 (2016)
MathSciNet Google Scholar
Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: CVPR, pp. 280–287. IEEE (2014)
Google Scholar
Lin, T.-Y. et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Liu, N., Han, J.: DHSNet: deep hierarchical saliency network for salient object detection. In: CVPR, pp. 678–686. IEEE (2016)
Google Scholar
Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to detect a salient object. In: CVPR, pp. 1–8. IEEE (2007)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440. IEEE (2015)
Google Scholar
Luo, Z., Mishra, A.K., Achkar, A., Eichel, J.A., Li, S., Jodoin, P.M.: Non-local deep features for salient object detection. In: CVPR, vol. 2, p. 7 (2017)
Google Scholar
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV, vol. 2, pp. 416–423. IEEE (2001)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732. IEEE (2016)
Google Scholar
Russakovsky, O. et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Wang, J., Jiang, H., Yuan, Z., Cheng, M.M., Hu, X., Zheng, N.: Salient object detection: a discriminative regional feature integration approach. IJCV 123(2), 251–268 (2017)
Article Google Scholar
Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation and global search. In: CVPR, pp. 3183–3192. IEEE (2015)
Google Scholar
Wang, L. et al.: Learning to detect salient objects with image-level supervision. In: CVPR, pp. 136–145. IEEE (2017)
Google Scholar
Wang, L., Wang, L., Lu, H., Zhang, P., Ruan, X.: Saliency detection with recurrent fully convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 825–841. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_50
Chapter Google Scholar
Wei, Y., Feng, J., Liang, X., Cheng, M.M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: CVPR. IEEE (2017)
Google Scholar
Wei, Y.: STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI 39(11), 2314–2320 (2017)
Article Google Scholar
Xia, C., Li, J., Chen, X., Zheng, A., Zhang, Y.: What is and what is not a salient object? Learning salient object detector by ensembling linear exemplar regressors. In: CVPR. IEEE (2017)
Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV, pp. 1395–1403. IEEE (2015)
Google Scholar
Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: CVPR, pp. 1155–1162. IEEE (2013)
Google Scholar
Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: CVPR, pp. 3166–3173. IEEE (2013)
Google Scholar
Zhang, G.X., Cheng, M.M., Hu, S.M., Martin, R.R.: A shape-preserving approach to image resizing. Comput. Graph. Forum 28(7), 1897–1906 (2009)
Article Google Scholar
Zhang, J. et al.: Salient object subitizing. In: CVPR, pp. 4045–4054. IEEE (2015)
Google Scholar
Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., Mech, R.: Unconstrained salient object detection via proposal subset optimization. In: CVPR, pp. 5733–5742. IEEE (2016)
Google Scholar
Zhang, J., Dai, Y., Porikli, F.: Deep salient object detection by integrating multi-level cues. In: Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2017)
Google Scholar
Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X.: Amulet: aggregating multi-level convolutional features for salient object detection. In: ICCV, pp. 202–211 (2017)
Google Scholar
Zhang, P., Wang, D., Lu, H., Wang, H., Yin, B.: Learning uncertain convolutional features for accurate saliency detection. In: ICCV, pp. 212–221 (2017)
Google Scholar
Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In: CVPR, pp. 1265–1274. IEEE (2015)
Google Scholar

Download references

Acknowledgments

This research was supported by NSFC (NO. 61620106008, 61572264), the national youth talent support program, Tianjin Natural Science Foundation for Distinguished Young Scholars (NO. 17JCJQJC43700), Huawei Innovation Research Program.

Author information

Authors and Affiliations

College of Computer Science, Nankai University, Tianjin, China
Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang-Hua Gao & Qibin Hou
CRCV, University of Central Florida, Orlando, Florida, US
Ali Borji

Authors

Deng-Ping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Ming Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-Jiang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shang-Hua Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qibin Hou
View author publications
You can also search for this author in PubMed Google Scholar
Ali Borji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming-Ming Cheng .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, DP., Cheng, MM., Liu, JJ., Gao, SH., Hou, Q., Borji, A. (2018). Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11219. Springer, Cham. https://doi.org/10.1007/978-3-030-01267-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-01267-0_12
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01266-3
Online ISBN: 978-3-030-01267-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground

Abstract