Keywords

1 Introduction

Saliency detection can be generally divided into two subcategories: salient object segmentation [12, 16, 38] and eye-fixation detection [7, 26]. This paper mainly focus on salient object segmentation, which aims to highlight the most conspicuous and eye-attracting object regions in images. It has been used as a pre-processing step to facilitate a wide range of vision applications and received increasingly more interest from the community. Although much progress has been made, it is still a very challenging task to develop effective algorithms capable of handling real world adverse scenarios.

Most existing methods address saliency detection with hand-crafted models and heuristic saliency priors. For instance, contrast prior formulates saliency detection as center-surrounding contrast analysis and captures salient regions either characterized by global rarity or locally standing out from their neighbors. In addition, boundary prior regards boundary regions as background and detects foreground objects by propagating background information to the rest image areas. Although these saliency priors have been proved to be effective in some cases (Fig. 1 first row), they are not robust enough to discover salient objects in complex scenes (Fig. 1 second row). Furthermore, saliency prior based methods mainly rely on low-level hand-crafted features which are incapable to capture the semantic concept of objects. As demonstrated in the third row of Fig. 1, high-level semantic information, in some cases, plays a central role in distinguishing foreground objects from background with similar appearance.

Fig. 1.
figure 1

Saliency detection results by different methods. From left to right: original image, groundtruth mask, our proposed RFCN, RC [2], MR [36]

Recently, deep convolutional neural networks (CNNs) have delivered record breaking performance in many vision tasks, e.g. image classification [15, 28], object detection [5, 27], object tracking [32, 33], semantic segmentation [21, 22], etc. Existing methods suggest that deep CNNs can also benefit salinecy detection and are very effective to handle complex scenes by accurately identifying semantically salient objects (Fig. 1 third row). Though better performance has been achieved, there are still three major issues of prior CNN based saliency detection methods. Firstly, saliency priors, which are shown to be effective in previous work, are completely discarded by most CNN based methods. Secondly, CNNs predict the saliency label of a pixel only considering a limited size of local image patch. They mostly fail to enforce spatial consistency and may inevitably make incorrect predictions. However, with feed-forward architectures, CNNs can hardly refine the output predictions. Lastly, saliency detection are mainly formulated as binary classification problems, i.e., either foreground or background. Compared with image classification tasks with thousands of categories, the supervision of binary labels is relatively weak to effectively train a deep CNN with a huge number of parameters.

To mitigate the above issues, we investigate recurrent fully convolutional networks (RFCNs) for saliency detection. In each time step, we feed forward both the input RGB image and a saliency prior map through the RFCN to obtain the predicted saliency map which in turn serves as the saliency prior map in the next time step. The prior map in the first time step is initialized by incorporating saliency priors indicative of potential salient regions. Our RFCN architecture has two advantages over existing CNN based methods: a) saliency priors are exploited to make training deep models more easier and yield more accurate prediction; b) in contrast to feed-forward networks, the output of our RFCN network is provided as the feedback signal, such that the RFCN is capable to refine the saliency prediction by correcting its previous mistakes until producing the final prediction in the last time step. To train the RFCN for saliency detection, a new pre-training strategy is developed, which leverage rich attribute information of semantic segmentation data for supervision. Figure 2 demonstrates the architecture overview of the proposed RFCN model.

In summary, the contributions of this work are three folds. Firstly, we propose a saliency detection method using recurrent fully convolutional network which is able to refine the previous predictions. Secondly, saliency priors are incorporated into the network to facilitate training and inference. Thirdly, we design a RFCN pre-training method for saliency detection using semantic segmentation data to both leverage strong supervison from multiple object categories and capture the intrinsic representation of generic objects. The proposed saliency detection method yields more accurate saliency maps and outperforms state-of-the-art approaches with a considerable margin on four benchmark data sets.

2 Related Work

Existing saliency detection methods can be mainly classified into two categories, i.e., either hand-crafted models or learning based approaches. Most hand-crafted methods ca be traced back to the feature-integration theory [30], where important visual features are selected and combined to model visual attention. Later on, Itti et al. [8] propose to measure saliency by center-surround contrast of color, intensity and orientation features. Xie et al. [34] formulate saliency detection in a Bayesian framework and estimate visual saliency by a likelihood probability. In [3], a soft image abstraction is developed by considering both appearance similarity and spatial distribution of image pixels for saliency measurement. Meanwhile, background prior is also commonly used by many hand-crafted models [6, 10, 36, 38], where the fundamental hypothesis is that image boundary regions are more likely to be background. Salient regions can then be recognized by label propagation using boundary regions as background seeds.

Fig. 2.
figure 2

Architecture overview of our RFCN model.

Hand-crafted saliency methods are efficient and effective, however they are not robust in handling complex scenarios. Recently, learning based methods have received more attention from the community. These methods can automatically learn to detect saliency by training detectors (e.g., random forests [12, 19], deep networks [17, 31, 37] etc.) on image data with annotations. Among others, deep networks based saliency models have shown very competitive performance. For instance, Wang et al. [31] propose to detect salient region by training a DNN-L and a DNN-G network for local estimation and global search, respectively. In [16], a fully connected network is trained to regress the saliency degree of each superpixel by taking multi-scale CNN features of the surrounding region. Both methods conduct patch-by-patch scanning in order to obtain the saliency map of the input image, which is very computational expensive. In addition, they directly train deep models on saliency detection data sets and ignore the problem of weak supervision from binary labels. To address the above issues, Li et al. [17] propose to detect saliency using a fully convolutional network (FCN) trained under a multi-task learning framework. Though bears a similar spirit, our method significantly differs from [17] in three aspects. Firstly, saliency priors are leveraged for network training and inference, which are ignored in [17]. Secondly, instead of using the feed-forward architecture in [17], we design a recurrent architecture capable of refining the generated predictions. Thirdly, our pre-training method for deep network allows to learn both class specific features and generic object representations using segmentation data. In contrast, [17] trains the network on segmentation data only for the task of distinguishing objects of different categories, which is essentially different from the task of salient object detection.

Recurrent neural networks (RNNs) have been applied to many vision tasks [20, 25]. The recurrent architecture in our method mainly serves as a refinement mechanism to correct previous errors. Compared to existing RNNs that strongly rely on hidden units from last step, RFCN takes only the final output of last step as prior. Hence, it takes fewer steps to converge and is more easier to train.

3 Saliency Prediction by Recurrent Networks

A conventional CNN used for image classification consists of convolutional layers followed by fully connected layers, which takes an image of fixed spatial size as input and produces a label vector indicating the category of the input image. For tasks requiring spatial labels, like segmentation, depth prediction etc., some methods apply CNNs for dense predictions in a patch-by-patch scanning manner. However, the overlap between patches leads to redundant computations and thus significantly increases computational overhead. Unlike existing methods, we consider the fully convolutional network (FCN) architecture [22] for our recurrent model, which generates predictions with the same size of the input image. In Sect. 3.1, we formally introduce FCN network for saliency detection. Section 3.2 presents our saliency methods based on RFCN network. Finally, we show how to train the RFCN network for saliency detection in Sect. 3.3.

3.1 Fully Convolutional Networks for Saliency Detection

Convolutional layers as building blocks of CNNs are defined on a translation invariance basis and have shared weights across different spatial locations. Both the input and the output of convolutional layers are 3D tensors called feature maps, where output feature map is obtained by convolving convolution kernels on the input feature map as

$$\begin{aligned} f_s({\varvec{X}};{\varvec{W}},{\varvec{b}}) = {\varvec{W}}*_s{\varvec{X}} + {\varvec{b}}, \end{aligned}$$
(1)

where \({\varvec{X}}\) is the input feature map; \({\varvec{W}}\) and \({\varvec{b}}\) denote kernel and bias, respectively; \(*_s\) represents convolution operation with stride s. As a result, the resolution of the output feature map \(f_s({\varvec{X}};{\varvec{W}},{\varvec{b}})\) is downsampled by a factor of s. Typically, convolutional layers are interleaved with max pooling layers and non-linear units (e.g., ReLUs) to further improve translation invariance and representation capability. The output feature map of the last convolutional layer can then be fed into a stack of fully connected layers which discard the spatial coordinates of the input and generates a global label for the input image (See Fig. 3 (a)).

For efficient dense inference, [22] converts CNNs to fully convolutional networks (FCNs) (Fig. 3(b)) by casting fully connected layers into convolutional layers with kernels that cover their entire input regions. This allows the network to take input images of arbitrary sizes and generate spatial output by one forward pass. However, due to the stride of convolutional and pooling layers, the final output feature maps are still coarse and downsampled from the input image by a factor of the total stride of the network. To map the coarse feature map into a pixelwise prediction of the input image, FCN upsamples the coarse map via a stack of deconvolution layers (Fig. 3(c))

$$\begin{aligned} {\hat{\varvec{Y}}}=U_S\left( F_S({\varvec{I}};{\varvec{\theta }});{\varvec{\psi }}\right) , \end{aligned}$$
(2)

where \({\varvec{I}}\) is the input image; \(F_S(\cdot ;{\varvec{\theta }})\) denotes the output feature map generated by the convolutional layers of FCN with total stride of S and parameterized by \({\varvec{\theta }}\); \(U_S (\cdot ; {\varvec{\psi }})\) denotes the deconvolution layers of FCN networks parameterized by \({\varvec{\psi }}\) that upsamples the input by a factor of S to ensure the same spatial size of the output prediction \({\hat{\varvec{Y}}}\) and the input image \({\varvec{I}}\). Different from simple bilinear interpolation, the parameters \({\varvec{\psi }}\) of deconvolution layers are jointly learned. To explore the fine-scaled local appearance of the input image, the skip architecture [22] can also be employed to combine output feature maps of both lower convolutional layers and the final convolutional layer for more accurate inference.

In the context of saliency detection, we are interested in measuring the saliency degree of each pixel in an image. To this end, the FCN takes the RGB image \({\varvec{I}}\) of size \(h \times w \times 3\) as input and generates the output feature map \({\hat{\varvec{Y}}}=U_S\left( F_S({\varvec{I}};{\varvec{\theta }});{\varvec{\psi }}\right) \) of size \(h \times w \times 2\). We denote the two output channels of \({\hat{\varvec{Y}}}\) as background map \({\hat{\varvec{B}}}\) and salient foreground map \({\hat{\varvec{H}}}\), indicating the scores of all the pixels being background and foreground, respectively. By applying softmax function, these two scores are transformed into foreground probability as

$$\begin{aligned} p(l_{i,j}={ fg }|{\varvec{\theta }},{\varvec{\psi }}) = \frac{\exp ({\hat{\varvec{H}}}_{i,j})}{\exp ({\hat{\varvec{H}}}_{i,j})+\exp ({\hat{\varvec{B}}}_{i,j})}, \end{aligned}$$
(3)

where \(l_{i,j} \in \{ {fg}, {bg}\}\) indicates the foreground/background label of the pixel indexed by (ij). The background probability \(p(l_{i,j}={ bg }|{\varvec{\theta }},{\varvec{\psi }})\) can be computed in a similar way. Given the training set \(\{{\varvec{Z}}=({\varvec{I}},{\varvec{C}})\}_1^N\) containing both training image \({\varvec{I}}\) and its pixelwise saliency annotation \({\varvec{C}}\), the FCN network can be trained end-to-end for saliency detection by minimizing the following loss

$$\begin{aligned} \arg \min _{{\varvec{\theta }},{\varvec{\psi }}} -\sum _{{\varvec{Z}}} \sum _{i,j} {\mathbf {1}} ( {\varvec{C}}_{i,j}&= fg ) \ln p(l_{i,j}= fg |{\varvec{\theta }},{\varvec{\psi }})\nonumber \\&+{\mathbf {1}}({\varvec{C}}_{i,j}= bg )\ln p(l_{i,j}={ bg }|{\varvec{\theta }},{\varvec{\psi }}), \end{aligned}$$
(4)

where \({\mathbf {1(\cdot )}}\) is the indicator function. The network parameters \({\varvec{\theta }}\) and \({\varvec{\psi }}\) can then be iteratively updated using stochastic gradient descent (SGD) algorithm.

Fig. 3.
figure 3

Comparison of different deep models. (a) Convolution network. (b) Fully convolution network. (c) Fully convolution network with deconvolution layers. (d)(e) Recurrent fully convolution networks with different recurrent architectures.

3.2 Recurrent Network for Saliency Detection

The above FCN network is trained to approximate the direct nonlinear mapping from raw pixels to saliency values and ignores the saliency priors which are widely used in existing methods. Although, heuristic saliency priors have their limitations, they are easy to compute and shown to be very effective under a variety of cases. Thus, we believe that leveraging saliency prior information can facilitate faster training and more accurate inference. This has been verified by our experiments. We also note that the output prediction by FCN may be very noisy and lack of label consistency. However, the feed forward architecture of FCN fails to consider feedback information, which makes it impossible to correct prediction errors. Based on these observations, we make two improvements over the FCN network and design the RFCN by: (i) incorporating saliency prior into both training and inference; and (ii) recurrently refining the output prediction (Fig. 4).

Fig. 4.
figure 4

Saliency maps generated by our model. (a) Original images. (b) Ground truth. (c)(d) Saliency maps without and with prior maps, respectively.

Saliency Prior Maps. We encode prior knowledge into a saliency prior map which serves as the input to the network. We first oversegment the input image into M superpixels, \(\{s_i\}_1^M\). The color contrast prior for \(s_i\) is calculated by

$$\begin{aligned} \mathcal {G}(s_i) = \frac{1}{\varGamma _i}\sum \limits _{j=1}^{M}{\Vert \mu _{s_i}-\mu _{s_j}\Vert ^2 \exp (-\frac{\Vert p_{s_i}-p_{s_j}\Vert ^2}{2\delta ^2})}, \end{aligned}$$
(5)

where \(\mu \) and p denote the mean RGB value and the center position of a superpixel, respectively; \(\varGamma _i\) is the normalization factor; and \(\delta \) is a scale parameter (fixed to 0.5). The intensity contrast \(\mathcal {I}(s_i)\) and orientation feature contrast \(\mathcal {O}(s_i)\) can be computed in a similar way by replacing the color values in (5) with corresponding feature values. The saliency prior map \({\varvec{P}}\) is obtained by

$$\begin{aligned} {\varvec{P}}(s_i) = \mathcal {U}(s_i)\times (\mathcal {G}(s_i)+ \mathcal {I}(s_i) +\mathcal {O}(s_i)), \end{aligned}$$
(6)

where \({\varvec{P}}(s_i)\) denotes the saliency prior value of superpixel \(s_i\); and the central prior [11] \(\mathcal {U}(s_i)\) penalizes0 the distance from superpixel \(s_i\) to the image center.

Recurrent Architecture. To incorporate the saliency prior maps into our approach, we consider two recurrent architectures for RFCN network. As in Sect. 3.1, we divide the network into two parts, i.e., convolution part \(F(\cdot ,{\varvec{\theta }})\) and deconvolution part \(U(\cdot ,{\varvec{\psi }})\). Our first recurrent architecture (Fig. 3 (d)) incorporates the saliency prior map \({\varvec{P}}\) into the convolution part by modifying the first convolution layer as

$$\begin{aligned} f({\varvec{I}}) = {\varvec{W}}_{{\varvec{I}}}*{\varvec{I}} + {\varvec{W}}_{{\varvec{P}}}*{\varvec{P}} + {\varvec{b}}, \end{aligned}$$
(7)

where \({\varvec{I}}\) and \({\varvec{P}}\) denote input image and saliency prior,respectively; \({\varvec{W}}_{{\varvec{I}}}\) and \({\varvec{W}}_{{\varvec{P}}}\) represent corresponding convolution kernels; \({\varvec{b}}\) is bias parameter. In the first time step, the RFCN network takes the input image and saliency prior map as input and produces the final feature map \({\hat{\varvec{Y}}}^1=U\left( F({\varvec{I}},{\varvec{P}};{\varvec{\theta }});{\varvec{\psi }}\right) \) comprising both foreground map \({\hat{\varvec{H}}}^1\) and background map \({\hat{\varvec{B}}}^1\). In the following each time step, the foreground map \({\hat{\varvec{H}}}^{t-1}\) generated in the last time step is fed back as saliency prior map to the input. The RFCN then refine the saliency prediction by considering both the input image and the last prediction as

$$\begin{aligned} {\hat{\mathbf{Y}}}^t=U\left( F({\varvec{I}},{\hat{\varvec{H}}}^{t-1};{\varvec{\theta }});{\varvec{\psi }}\right) . \end{aligned}$$
(8)

For the above recurrent architecture, forward propagation of the whole network is conducted in every time step, which is very expensive in terms of both computation and memory. An alternative recurrent architecture is to incorporate the saliency prior maps into the deconvolution part ((Figure 3 (e))). Specifically, in the first time step, we feed the input image \({\varvec{I}}\) into the convolution part to obtain the convolution feature map \(F({\varvec{I}};{\varvec{\theta }})\). The deconvolution part then takes the convolution feature map as well as saliency prior map \({\varvec{P}}\) as input to infer the saliency prediction \({\hat{\varvec{Y}}}^1=U\left( F({\varvec{I}};{\varvec{\theta }}),{\varvec{P}};{\varvec{\psi }}\right) \). In the t-th time step, the predicted foreground map \({\hat{\varvec{H}}}^{t-1}\) in the last time step serves as saliency prior map. The deconvolution part takes the convolution feature map \(F({\varvec{I}};{\varvec{\theta }})\) as well as the foreground map \({\hat{\varvec{H}}}^{t-1}\) to refine the saliency prediction \({\hat{\varvec{Y}}}^t\):

$$\begin{aligned} {\hat{\varvec{Y}}}^t=U\left( F({\varvec{I}};{\varvec{\theta }}),{\hat{\varvec{H}}}^{t-1};{\varvec{\psi }}\right) . \end{aligned}$$
(9)

Note that, for each input image, forward propagation of deconvolution part is repeatedly conducted in each time step, whereas the convolution part is only required to be fed forward once in the first time step. Since the deconvolution part has approximately 10 times fewer parameters than the convolution part, this recurrent architecture can effectively reduce computational complexity and save memory. However, we find in our preliminary experiments that the second recurrent architecture can only achieve similar performance compared to the FCN based approach (i.e., without recurrent). This may be attributed to the fact that the prior saliency map is severely downsampled to the same spatial size of the last convolution feature map \(F({\varvec{I}};{\varvec{\theta }})\) (downsampled by a factor of 1/32 from the input). With less prior information, the downsampled prior saliency map can hardly facilitate network inference. Therefore, we adopt the first recurrent architecture in this work. In our experiments, we observe that the accuracy of the saliency maps almost converges after the second time step (Compare Fig. 5(a) and (e)). Therefore, we set the total time step of the RFCN to \(T=2\).

Fig. 5.
figure 5

Saliency maps predicted by the proposed RFCN in different time steps. (a) Original images. (b) Ground truth. (c)–(e) Saliency maps predicted by RFCN in the 1st–3rd time step, respectively.

3.3 Training RFCN for Saliency Detection

Our RFCN training approach consists of two stages: pre-training and fine-tuning. Pre-training is conducted on the PASCAL VOC 2010 semantic segmentation data set. Saliency detection and semantic segmentation are highly correlated but essentially different in that saliency detection aims at separating generic salient objects from background, whereas semantic segmentation focuses on distinguishing objects of different categories. Our pre-training approach enjoys strong supervision from segmentation data and also enables the network to learn general representation of foreground objects. Specifically, for each training pair \({\varvec{Z}}=({\varvec{I}},{\varvec{S}})\) containing image \({\varvec{I}}\) and pixelwise semantic annotation \({\varvec{S}}\), we generate an object map \({\varvec{G}}\) to label each pixel as either foreground (\( {fg}\)) or background (\( {bg}\)) as follow

$$\begin{aligned} {\varvec{G}}_{i,j}= {\left\{ \begin{array}{ll} {bg} &{}\text{ if } {\varvec{S}}_{i,j}=0 \\ {fg} &{}\text{ otherwise } \end{array}\right. }, \end{aligned}$$
(10)

where \({\varvec{S}}_{i,j} \in \{0,1,\ldots ,C\}\) denotes the semantic class label of pixel (ij), and \({\varvec{S}}_{i,j}=0\) indicates the pixel belonging to background. In the pre-training stage, the final feature map \({\hat{\varvec{Y}}}^{t}\) (Sect. 3.1) generated by the RFCN consists of \(C+3\) channels, where the first \(C+1\) channels correspond to the class scores for semantic segmentation and the last 2 channels, i.e., \({\hat{\varvec{H}}}^t\) and \({\hat{\varvec{B}}}^t\) (Sect. 3.1), denotes the foreground/background scores. By applying softmax function, we obtain the conditional probability \(p(c_{i,j}|{\varvec{I}},{\hat{\varvec{H}}}^{t-1},{\varvec{\theta }},{\varvec{\psi }})\) and \(p(l_{i,j}|{\varvec{I}},{\hat{\varvec{H}}}^{t-1},{\varvec{\theta }},{\varvec{\psi }})\) predicted by the RFCN for segmentation and foreground detection, respectively. The loss function for pre-training across all time steps is defined as

$$\begin{aligned} L({\varvec{\theta }},{\varvec{\psi }}) = -\sum _{t=1}^{T} \sum _{{\varvec{Z}}} \sum _{i,j} \ln p(c_{i,j}={\varvec{S}}_{i,j}|{\varvec{I}}&,{\hat{\varvec{H}}}^{t-1},{\varvec{\theta }},{\varvec{\psi }}) \nonumber \\&+ \ln p(l_{i,j}={\varvec{G}}_{i,j}|{\varvec{I}},{\hat{\varvec{H}}}^{t-1},{\varvec{\theta }},{\varvec{\psi }}), \end{aligned}$$
(11)

where T is the total time step and \({\hat{\varvec{H}}}^{0}\) is initialized by the saliency prior map \({\varvec{P}}\) (Sect. 3.2). Pre-training is conducted via back propagation through time.

After pre-training, we modify the RFCN network architecture by removing the first \(C+1\) channels of the last feature map and only maintaining the last two channels, i.e., the predicted foreground and background maps. Finally, we fine-tune the RFCN network on the saliency detection data set as described in Sect. 3.2. As demonstrated in Fig. 6(c), the pre-trained model, supervised by semantic labels of multiple object categories, captures generic object features and can already discriminate foreground objects (of unseen categories in pre-training) from background. Fine-tuning on the saliency data set can further improve the performance of the RFCN network (Fig. 6(d)).

Fig. 6.
figure 6

Saliency detection results on different stages. (a) Original images. (b) ground truth. (c) results of pre-trained RFCN. (d) results of fine-tuned RFCN. (e) result after post-processing.

3.4 Post-processing

The trained RFCN network is able to accurately identify salient objects. To more precisely delineate the compact and boundary-preserving object regions, we adopt an efficient post-processing approach. Given the final saliency score map \({\hat{\varvec{H}}}^{T}\) predicted by the RFCN, we first segment the image into foreground and background regions by thresholding \({\hat{\varvec{H}}}^{T}\) with its mean saliency score. A spatial confidence \({\varvec{SC}}_{i,j}\) and a color confidence \({\varvec{CC}}_{i,j}\) are computed for each pixel (ij). The spatial confidence is defined considering the spatial distance of the pixel to the center of the foreground region

$$\begin{aligned} {\varvec{SC}}_{i,j} = \exp (-\frac{\Vert loc_{i,j}-loc_s\Vert _2}{\sigma }), \end{aligned}$$
(12)

where \(loc_{i,j}\) and \(loc_s\) denote the coordinates the pixel (ij) and the center of foreground, respectively; \(\sigma \) is a scale parameter. The color confidence is defined to measure the similarity of the pixel to foreground region in RGB color space

$$\begin{aligned} {\varvec{CC}}_{i,j} = \frac{N_{i,j}}{N_s}, \end{aligned}$$
(13)

where \(N_{i,j}\) is the number of foreground pixels that have the same color feature with pixel (ij) and \(N_s\) is the total number of foreground pixels.

We then weight the predicted saliency scores by spatial and color confidences to dilate the foreground region

$$\begin{aligned} \mathbf{\tilde{{\varvec{H}}}}_{i,j} = {\varvec{SC}}_{i,j} \times {\varvec{CC}}_{i,j} \times {\hat{\varvec{H}}}^T. \end{aligned}$$
(14)

After an edge-aware erosion procedure [4] on the dilated saliency score map \(\mathbf{\tilde{{\varvec{H}}}}\), we obtain the final saliency map. As demonstrated in Fig. 6 (e), the post-processing step can improve the detection precision to a certain degree.

Fig. 7.
figure 7

Comparisons of saliency maps. Top, middle and bottom two rows are images from the SOD, ECSSD, PASCAL-S and SED1 data sets, respectively.(a) Original images, (b) ground truth, (c) our RFCN method, (d) LEGS, (e) MDF, (f) DRFI, (g) wCtr, (h) HDCT, (i) DSR, (j) MR, (k) HS.

Fig. 8.
figure 8

Performance of the proposed algorithm compared with other state-of-the-art methods on the SOD, ECSSD, PASCAL-S and SED1 databases, respectively.

4 Experiments

4.1 Experimental Setup

Detailed architecture of the proposed RFCN can be found in the supplementary materialsFootnote 1. We pre-train the RFCN on the PASCAL VOC 2010 semantic segmentation data set with 10103 training images belonging to 20 object classes. The pre-training is converged after 200k iterations of SGD. We then fine-tune the pre-trained model for saliency detection on the THUS10K [2] data set for 100k iterations. In the test stage, we apply the trained RFCN in three different scales and fuse all the results into the final saliency maps [12]. Our method is implemented in MATLAB with the Caffe [9] wrapper and runs at 4.6 s per image on a PC with a 3.4 GHz CPU and a TITANX GPU. The source code will be released (see footnote 1).

We evaluate the proposed algorithm (RFCN) on five benchmark data sets: SOD [24], ECSSD [35], PASCAL-S [19], SED1 [1], and SED2 [1]. The evaluation result on SED2 and additional analysis on the impact of recurrent time step are included in the supplementary materials. Three metrics are utilized to measure the performance, including precision-recall (PR) curves, F-measure and area under ROC curve (AUC). The precision and recall are computed by thresholding the saliency map, and comparing the binary map with the ground truth. The PR curves demonstrate the mean precision and recall of saliency maps at different thresholds. The F-measure can be calculated by \(F_{\beta } = \frac{(1+\beta ^2)Precision\times Recall}{\beta ^2 \times Precision + Recall}\), where Precision and Recall are obtained using twice the mean saliency value of saliency maps as the threshold, and set \(\beta ^2=0.3\).

Table 1. F-measure and AUC (Area Under ROC Curve) on the SOD, ECSSD, PASCAL-S and SED1 data sets. The best two results are shown in italic and bold fonts respectively. The proposed methods rank first and second on the four data sets.

4.2 Performance Comparison with State-of-the-art

We compare the proposed algorithm (RFCN) with twelve state-of-the-art methods, including MTDS [17], LEGS [31], MDF [16], BL [29], DRFI [12], UFO [13], PCA [23], HS [35], wCtr [38], MR [36], DSR [18] and HDCT [14]. We use either the implementations or the saliency maps provided by the authors for fair comparison. Note that MTDS, LEGS and MDF are deep learning based methods. Among others, MTDS exploits fully convolution network for saliency detection and leverages segmentation data for multi-task training. As demonstrated in Fig. 8 and Table 1, the proposed RFCN method can consistently outperform existing methods across almost all the data sets with a considerable margin in terms of PR curves, F-measure as well as AUC scores. Compared with other deep learning based methods, the three contributions of our method (i.e., integration of saliency priors, recurrent architecture and pre-training approach) ensures more accurate saliency detection. Figure 7 shows that our saliency maps can reliably highlight the salient objects in various challenging scenarios.

4.3 Ablation Studies

To analyze the relative contributions of different components of our methods, we evaluate four variants of the proposed RFCN method with different settings as demonstrated in Table 2. The performance in terms of F-measure and AUC are reported in Table 3. The comparison between FCN and \(\mathrm{FCN}_{\mathrm{p}}\) suggests that saliency priors ignored by existing deep learning based methods can indeed benefit network training and inference. The comparison between \(\mathrm{FCN}_{\mathrm{p}}\) and RFCN-A indicates that the proposed recurrent architecture is capable of correcting previous errors and refining the output saliency maps. In addition, the RFCN-B method with the proposed pre-training strategy can significantly outperform the RFCN-A method simply pre-trained for segmentation, which verifies that our pre-training method can effectively leverage the strong supervision of segmentation and simultaneously enable the network to caputre generic feature representation of foreground objects. After the proposed post-processing step, our RFCN method achieves considerable improvements over RFCN-B in terms of AUC scores with a slight performance degrade in terms of F-measure.

Table 2. Different settings of variants of the proposed RFCN method. “Pre-train” denotes that the network is pre-trained on segmentation data only for semantic segmentation task. “Pre-train+” indicates the proposed pre-training method using segmentation data for both semantic segmentation and foreground background separation.
Table 3. Comparison of the proposed approach. The best results are shown in bold fonts.

5 Conclusions

In this paper, we propose a recurrent fully convolutional network based saliency detection methods. Heuristic saliency priors are incorporated into the network to facilitate training and inference. The recurrent architecture enables our method to refine saliency maps based on previous output and yield more accurate predictions. A pre-training strategy is also developed to exploit the strong supervision of segmentation data sets and explicitly enforce the network to learn generic feature representation for saliency detection. Extensive evaluations verify that the above three contributions can significantly improve performance of saliency detection. State-of-the-art performance has been achieved by the proposed method in five widely adopted data sets.