1 Introduction

Accurate visual saliency models are fundamental for multiple disciplines such as computer vision [5], neuroscience [12], and cognitive psychology [11]. In this paper we focus on salient object detection, which consists in segmenting the main foreground object from the background in a digital image. Salient object detection methods are commonly used in applications such as object-of-interest proposal, object recognition, adaptive image and video compression, content aware image editing, image retrieval, and object-level image manipulation [6].

In the literature there is no universal agreement for the definition of foreground and background. This will be evident later on in this paper, by comparing the annotation criteria adopted through the different benchmark datasets. During the training of our Fully Convolutional Network, we simultaneously exploit annotated data coming from different datasets. In this way we obtain a model that represents a good compromise among the different datasets on the definition of foreground/background segmentation.

Many different approaches and solutions have been proposed in the last years for salient object detection. The method proposed in Discriminative Regional Feature Integration (DRFI) [13] builds a multi-level representation of the input image, and creates a regression model mapping the regional feature vector of each level to the corresponding saliency score. These scores are finally fused in order to determine the complete saliency map. In Quantum Cut (QCUT) [3] authors model salient object segmentation as an optimization problem. They then exploit the link between quantum mechanics and graph-cuts to develop an object segmentation method based on the ground state solution of a modified Hamiltonian. The authors of Minimum Barrier Distance (MBD) [26] present an approximation of the MBD transform, and combine it with an appearance-based backgroundness cue. The resulting method performs significantly better than other solutions having the same computational requirements. In Saliency Tree (ST) [18] authors simplify the image into primitive regions, with associated saliency based on multiple handcrafted measures. They generate a saliency tree using region merging, and perform a systematic analysis of such tree to derive the final saliency map. Robust Background Detection (RBD) [28] introduces boundary connectivity: a background measure based on an intuitive geometrical interpretation. This measure is then used along with multiple low level cues to produce saliency maps through a principled optimization framework.

In a very recent work Borji et al. [6] present an exhaustive review of state of the art methods for salient object detection. They compared more than forty methods on a benchmark composed of seven different datasets. In this paper we investigate the use of a Fully Convolutional Network (FCN) for salient object detection taking inspiration from the work of Long et al. [19], and evaluate it on the Borji et al. [6] benchmark. Differently from compared solutions, we propose a data-driven model that leverages semantic cues as the basis for saliency estimation. Other approaches using deep learning methods also exist [7, 10, 15], although they don’t adhere to the data and methods in the reference benchmark we adopt here. The main contributions of this paper can be summarized as follows:

  • we propose a semantically-aware FCN to address the problem of salient object detection that is able to produce a binary pixel-level saliency map;

  • we systematically investigate the contribution of different kinds of synthetic data augmentation to train the FCN;

  • we evaluate the effectiveness of our proposal on a standard benchmark for salient object detection composed of seven different datasets [6]. The proposed method on average outperforms the state of the art according to multiple evaluation measures.

2 Proposed Method

We propose a Fully Convolutional Network to address the problem of salient object detection, taking inspiration from a work originally developed for semantic segmentation [19], that uses layers previously trained for the recognition of 1,000 object classes (Visual Geometry Group, or VGG [22]). This allows our network to be semantically-aware, and therefore capable of exploiting high-order concepts for separating foreground from background. Furthermore, the fully convolutional architecture is specifically designed to produce a per-pixel prediction, which perfectly fits the task of generating an input-sized foreground/background mask.

The main difference with respect to the semantic segmentation proposed in [19] is that in our proposal the salient object could belong to any object category. Our network is in fact able to segment salient objects belonging to categories not restricted to the 20 classes defined in the original semantic segmentation task [19], or the 1,000 object classes used to train the VGG [22]. Finally, we adopt a different training procedure, as we find advantage in applying several kinds of data augmentation. The effects of such augmentation are analyzed and discussed in the experimental results section.

Fig. 1.
figure 1

Schematic view of the Fully Convolutional Network employed for salient object detection. Intermediate activations of a VGG-based processing are resized and combined in order to implement a multi-resolution analysis.

The network architecture is illustrated in Fig. 1, and adheres to the following logic:

  1. 1.

    Build abstractions of gradually decreasing spatial resolution, using [22].

  2. 2.

    Extract intermediate activations, and map their depth to the final problem size (2 classes for our task), using convolution layers.

  3. 3.

    Increase size of activations, using convolution-transpose layers.

  4. 4.

    Sum-up activations having now compatible size.

  5. 5.

    Produce as output a binary pixel-level saliency map.

Thanks to this strategy, the network can see both the whole picture and small details at the same time, thus producing a globally-aware yet precise output.

2.1 Training

Layers inherited from VGG (which supposedly only need fine-tuning) and new layers (trained from scratch), are all updated using the same learning-rate. The task of calibrating the gradients for the two strategies is implicitly left to the Adam optimizer [14].

Many methods in the state of the art generate a continuous-valued prediction [3, 13, 18, 26, 28] directly correlated to the saliency of pixels in the image. Most of the available datasets, though, are published with a binary ground truth [5, 8, 17, 24, 25]. For this reason we choose to approach the problem as a per-pixel binary classification task: all ground truth images are converted to binary data, setting to 1 all values greater than 0. The neural network is then trained with a softmax cross entropy loss, with the global loss of each batch computed by averaging all loss values from the single pixels.

All training examples are processed by an online data augmentation procedure in order to provide additional information to the learning process. The following perturbations are considered:

  • Random crop. We select a square subwindow of random side between 256 pixels and the original image limits. The crop is then resized to the fixed training dimension, i.e. 256\(\ \times \ \)256 pixels.

  • Random horizontal flip.

  • Random gamma between 0.3 and \(\tfrac{1}{0.3}\).

Each perturbation category was individually tested on a small subset of the benchmark data, in order to assess its impact on performance. An analysis on such effects is reported in Sect. 3.2.

All models are trained with a learning rate of \(5\times 10^{-5}\) and a batch size equal to 15 for a total of 20 epochs.

Table 1. Summary of tested datasets
Fig. 2.
figure 2

Image-annotation examples for each of the seven datasets used in the benchmark [6].

3 Experiments

3.1 Datasets

Experiments were performed according to the benchmark proposed in [6] concerning both the datasets and the evaluation protocol of the results. The benchmark is composed by seven different datasets that are presented in Table 1. Each dataset has different kinds of content and bias. Figure 2 shows an image-annotation pair for each dataset. The benchmark defines no official training/test split for the seven datasets, mainly because at the time of its original release few of the tested methods involved an explicit training phase. Our approach requires a significant amount of training data, so we adopted a Leave-One-Dataset-Out (LODO) solution. This allows us to have a fair comparison with the state of the art, as we test on the official datasets, and to avoid overfitting the model to the data. However, since in each LODO split we train the FCN on images collected and annotated with potentially different criteria than those used on the test set, our results could be lower than those we would obtain on homogeneous data (e.g. train/test split of the same dataset).

In order to ensure a totally fair evaluation procedure, we checked for any near-duplicates among dataset pairs. Following [4] we computed Structure Similarity measure (SSIM) [23] between all pairs of images, previously scaled to 64\(\times \)64 pixels and converted to grayscale, and manually checked those having similarity higher than 0.9. Out of more than 200 million pairs, only five duplicates were found. Although this number of pairs is probably too small to have any overfitting effect, these images were excluded from the training set whenever the corresponding ones were present in the test set. Table 2 lists the found duplicate pairs.

Table 2. Duplicates found among the seven analyzed datasets.

3.2 Data Augmentation

A preliminary investigation on the usefulness of the data augmentation as described in Sect. 2.1 was performed on the DUT-OMRON Leave-One-Dataset-Out (LODO) setting. Figure 3 shows the loss values on both the training and test sets for three different setups: no data augmentation, three separate perturbations, and the same three perturbations applied jointly. It can be seen that all the investigated perturbation strategies reduce the ability of fitting the training data, while at the same time enhancing the model predictive power on unseen data. Their joint application results in the best improvement, thanks to the little correlation among the single contributions. Thus, it is used for the training of the FCN on all the datasets.

Fig. 3.
figure 3

Softmax cross entropy loss on the DUT-OMRON LODO setup under different kinds of data augmentation.

3.3 Evaluation Measures

Evaluation is performed under the following criteria, aimed at capturing different aspects of the quality of the predicted saliency region:

F-Measure ( \(F_{\beta }\) ) is the weighted harmonic mean between precision and recall:

$$\begin{aligned} F_{\beta } = \frac{(1+\beta ^2)Precision \times Recall}{\beta ^2 Precision + Recall} \end{aligned}$$
(1)

According to [6] the weight \(\beta ^2\) is set to 0.3 in order to benefit precision, considered more important than recall for this specific task [1, 17]. Since precision and recall require a binary input, the benchmark adopts three different alternatives for binarization of the methods that do not provide a binary prediction:

  1. 1.

    Varying fixed threshold: Precision and Recall are computed at all integer thresholds between 0 and 255, and then averaged.

  2. 2.

    Adaptive threshold [1]: The threshold for binarization is set to twice the mean value of the prediction map.

  3. 3.

    Saliency Cut [9]: The threshold is set to a low value, thus granting high recall rate. GrabCut [21] is then iteratively applied to the binarized prediction, typically producing a map with more precise edges.

Area Under Curve (AUC) is the area under the Receiver Operating Characteristic curve. The ROC curve is computed by varying the binarization threshold and plotting True Positive Rate (TPR) versus False Positive Rate (FPR) values.

Mean Absolute Error (MAE) is computed directly on the prediction, without any binarization step, as:

$$\begin{aligned} MAE = \frac{1}{W\times H} \sum _{x=1}^W \sum _{y=1}^H |Prediction(x,y) - GroundTruth(x,y)| \end{aligned}$$
(2)

where W and H refer to image dimensions.

3.4 Results

We compare our solution with the top five methods from [6] on all the seven datasets using all criteria described in the previous section. Results are shown in Table 3.

Table 3. Evaluation results for all measures on all datasets

The proposed method is superior by a large margin according to both \(F_\beta \) measures and MAE. The binary nature of our prediction, though, is penalized by AUC due to the particular benchmark evaluation protocol [6]. On average our method outperforms all compared solutions for five of the seven datasets. On JuddDB and MSRA10K, and to a lesser extent on THUR15K, we have lower performance compared to the state of the art. We may notice that images in the JuddDB dataset contain many different subjects, out of which only one is annotated as the main salient object, based on fixations gathered from different observers. This particular set of conditions, radically different from those of the other datasets used for training in our Leave-One-Dataset-Out setup, could be the root cause of sub-optimal performance of our method, and it is left to future work for further analysis. Figure 4 reports some example predictions from all datasets. False positives mostly correspond to actual objects that were not in the ground truth due to annotation guidelines (e.g. the flower in Fig. 4b and the fish in Fig. 4e), which could also be contributing to the lower performance on datasets MSRA10K and THUR15K. False negatives are often related to holes in our prediction (e.g. the window glasses in Fig. 4a), thus highlighting a current downside of the solution. Finally, we can also observe that the edges of our predictions are in general smoother and less precise than the reference annotations.

Fig. 4.
figure 4

Example predictions on different datasets.

A direct comparison with other methods in terms of computational complexity cannot be performed in a fair setup, as our solution is designed to run on GPU, unlike the compared methods. On a NVIDIA TITAN X GPU our prediction takes on average 0.09 s on each image of the MSRA10K dataset (typical image resolution \(400 \times 300\)). For reference, the fastest among compared methods (RBD [28]) takes 0.269 s using a desktop machine with Xeon E5645 2.4 GHz CPU [6].

4 Conclusions

In this work we exploited the semantic awareness of a Fully Convolutional Network to address the problem of salient object detection. We verified the effectiveness of this approach by comparing it on a standard benchmark, composed of seven datasets and more than forty methods (we reported here only the top five). Despite the challenging Leave-One-Dataset-Out setup, which naturally excludes the possibility of overfitting the model to the data, we outperformed the state of the art on most datasets.

In the future we might switch from a binary foreground/background prediction to a multiclass one, in order to also consider the different levels of saliency defined in some of the used datasets. Bringing this even further, we might directly treat the problem as a regression task, and study the effects of different training losses on the final performance.

Finally, we plan on extending evaluation and comparison to other datasets [15, 20] and methods [15, 27], which were currently left-out for not being contemplated in the adopted benchmark, as well as for space constraints.