1 Introduction

Salient object detection refers to the separation of objects that can most attract human visual attention from background images [1]. Recently, due to the rapid increase in the quantity and quality of image files, salient object detection has become increasingly important as a precondition of various image processing approaches. In the early stage, salient object detection was applied to image content editing [2], object recognition [3], image classification [43] and semantic segmentation [4]. In recent years, it has also played an important role in intelligent photography [5] and image retrieval [6]. It is worth noting that we have seen an interesting application of saliency detection in the emerging Internet video technology. Video site users especially the young like to post their own comments while watching the video. These comments will be displayed on the screen. We call this “bullet screen.” In addition, salient object detection is also applied to virtual background technology, which can protect the privacy of users in video conferences, especially during the epidemic of COVID-19. As shown in Fig. 1, our saliency detection technology can help us highlight the important people or objects in the scene so that they are not obscured by the bullet screen, and the real background in video conferencing has been replaced by a virtual background.

Fig. 1
figure 1

Applications of salient object detection in video technology. These images are selected from the Chinese video site “Bilibili” and the video conferencing software “Zoom”

Early saliency detection techniques were mainly based on the extraction of certain artificial features. Limited by prior knowledge, these methods sometimes cannot achieve better results in natural scenes. We focus on making full use of deep information at different levels and modeling the image with multi-level mine the information.

Convolutional neural networks can effectively extract the features of the image. The low-level layers usually have smaller receptive fields and can focus on local details of the image, such as edge information. However, unlike the edge detection in traditional tasks, we mainly focus on salient objects and ignore the cluttered lines in the background; as such, we use salient foreground contours as an auxiliary task for our salient object detection.

Most existing methods simply merge multi-channel feature maps, ignoring the variety of effects that different feature channels may have on the final saliency map. We model the feature channels explicitly, introduce a global pooling method with a large visual receptive field into the modeling of the feature channels and reweight each feature channel.

In general, our proposed 3MNet uses a U-shaped structure as the main structure, with contour detection branches as auxiliary tasks, and introduces channel reweighting modules in the network structure, so as to explicitly model and combine the multi-task, multi-level and multi-channel features of the image. Specifically, the contour detection task can refine the edge details of salient objects. The multi-level network structure can better aggregate the local and global feature information of the image. Multiple multi-channel feature maps are generated in the deep network. Modeling the channel features helps to mine the deep channel information in the image and enhance the weight of high contribution channels. Our subsequent experiments also proved that combining multiple image features can effectively improve the detection accuracy.

The main contributions of this paper are as follows:

(1) The proposed 3MNet makes full use of the deep salient information in the image and combines the multi-task, multi-level and multi-channel features to explicitly model the saliency detection task. We have achieved good results on the basis of salient object detection tasks, supplemented by target contour detection.

(2) Compared with traditional models and some other deep detection models, our model has higher accuracy, and multiple evaluation indicators on the five most commonly used data sets are ahead of other methods. In addition, we conducted a series of ablation experiments to verify the effectiveness of our network structure.

(3) Our training process requires saliency object contour information. Therefore, we provide saliency target contour ground-truth maps of multiple training sets as a supplement to the training set, so that researchers can adopt more optional auxiliary methods for saliency detection.

The rest part of our paper is organized as follows: Section 2 introduces the related works of salient object detection. The specific structure of our proposed approach are described in Sect. 3. Section 4 shows and analyzes the results of our experiment. Section 5 makes a conclusion to our paper.

2 Related works

Early salient object detection used a data-driven bottom-up approach. In 1998, Itti et al. [7] proposed the classic saliency visual attention model. For a long time, manual features such as contrast, color and background prior dominated the salient object detection.

Achanta et al. [8] introduced a frequency-tuned model to extract the global features of the image. Jiang et al. [9] used the absorbing Markov chain to calculate the absorption time. They considered solving problems mathematically rather than imitating human vision. [42] introduced a bootstrap learning algorithm into salient object detection task. Researchers also proposed methods of preprocessing and post-processing such as the super-pixel detection [10] and the conditional random field [11] methods.

Fig. 2
figure 2

Overall structure of our proposed network framework. The RWC module is the RWConv module. The upper part explicitly models the contour information and uses this information to help detect salient targets. The lower left part uses image multi-level features to fuse salient feature maps

Recently, salient object detection models based on deep learning have been widely studied and applied. Inspired by various network optimization methods, especially the emergence of convolutional neural network structures [24], more and more models designed for saliency detection tasks are appearing and have achieved unprecedented detection effects on various evaluation criteria. Since the introduction of VGG [12] and residual networks [13], saliency detection models with these networks as the base structure have developed considerably. Researchers have achieved better results by appropriately increasing the depth of the network and expanding the width of the network. [14] combine features of different levels in the deep network to predict salient regions. DHSNet [15] aggregates the characteristics of many different receptive fields to obtain performance gains. Ronneberger et al. [16] propose a U-shaped network structure for image segmentation. Liu et al. propose PoolNet [17] for saliency detection based on a similar structure and obtained accurate and fast detection performance. Hou et al. [18] ingeniously build short connections between multi-level feature maps to make full use of high-level features to guide detection. Li et al. [41] explore the channel characteristics with reference to the structure of SENet [23].

Apart from innovations in depth and breadth in the network structure, some researchers have also attempted multi-task-assisted saliency detection. Li et al. [19] combine the saliency detection task with the image semantic segmentation task. Through the collaborative feature learning of these two related tasks, the shared convolutional layer produces effective object perception features. Zhuge et al. [20] focus on using the boundary features of the objects in the image, utilizing edge truth labels to supervise and refine the details of the detection feature map. [44] make full use of the multi-temporal features and show the effectiveness of multiple features in improving detection performance. [21] apply saliency detection to dynamic video processing, greatly expanding the application space of saliency detection.

3 Proposed Approach

Our model captures the features of the image to be detected from the following aspects: First, we set up two sets of network frameworks to perform saliency target detection and salient object contour detection in parallel. Second, we use a U-shaped network construction [16] for the main structure of each network to aggregate the salient features extracted from different levels. Finally, for the basic unit of each convolution module, we make full use of the channel characteristics, use global pooling to obtain the corresponding global receptive field of each channel and learn how much each channel contributes to the salient features. According to the learning results, we then recalibrate the weights of the feature channels. The specific framework of the model is shown in Fig. 2

3.1 Multi-channel characteristic response reweighting module

For common RGB three-channel images, each channel’s salient stimulation of the human eyes of each channel may be different [22]. This reminds us that different feature channels of salient feature maps may also contribute differently to the saliency detection. We refer to the structure of SENet [23] and propose a similar multi-channel reweighted convolution module RWConv and a multi-channel reweighted fusion module RWFusion. These two structures are shown in Fig. 3.

Fig. 3
figure 3

Specific structure of the RWConv module (left) and the RWFusion module (right)

For each basic convolution unit RWConv, we use ResNet’s convolution layer [13] as its main structure. On this basis, we introduce a second branch between the residual and the accumulated sum \( x' \), as the weight storage area. For an input image with number of channels c, width w and height h, first, we use global pooling to convert the input to an output of \(1 \times 1\) \(\times \)c. To some extent, these c real numbers can describe the global characteristics of the input. Its calculation method is shown in Eq. 1.

$$\begin{aligned} W_{k}=\frac{1}{w {\times } h} \sum _{i=1}^{w} \sum _{j=1}^{h} P_{k}(i, j), \quad k=1,2, \ldots , c, \end{aligned}$$
(1)

where \( P_{k}(i, j) \) is the feature value corresponding to the coordinate (ij) in the kth channel of the given feature map.

In order to fully represent the relationship between each channel, so that our model can focus on the channels that contribute more, we add 2 fully connected layers after global pooling. The number of fully connected points in each layer is the same as the number of channels in the upper layer, and a Relu layer is added to ensure the nonlinearity of the model. After obtaining the final channel weight \(W_{c}'\), we weight and accumulate the input value corresponding to the c weight parameters to obtain the final output.

$$\begin{aligned} S_{k}=p_{k} \times W_{k}', \quad k=1,2, \ldots , c, \end{aligned}$$
(2)

where this operation corresponds to the scale module in the network.

The basic structure of the RWFusion part is roughly the same as that of the RWConv part, except that one of the addends x is replaced by the same size feature map on the other side of the U-shaped network. The input of the main part is obtained by the upsampling operation.

The basic module of the contour detection part is the same as that of the above-mentioned RWConv and RWFusion. This fusion method takes into account the multi-level and multi-channel characteristics, makes full use of the detailed information of the image and enhances the expression ability of the network.

3.2 Salient object contour detection auxiliary module

Explicitly modeling contour features is undoubtedly helpful for optimizing the details of salient object. However, the high-level feature maps often have large receptive fields and cannot pay attention to the details of the target. Low-level feature maps can help us optimize the contour details of objects [25]. As such, we take low-level features into consideration. We use a two-layer RWConv structure to extract the contour features of the object in the main part of the network; then, after obtaining the significant contour feature map \( E_{j} \), we use the same fusion method. The calculation method is as follows:

$$\begin{aligned} \begin{array}{c} E_{f 2}=u p\left( E_{2}, 4\right) , \\ E_{f 1}=u p\left( R W F\left( E_{1}, E_{2}\right) , 2\right) , \end{array} \end{aligned}$$
(3)

where \( up(*,\theta ) \) means upsampling the feature map, \( \theta \) is the upsampling multiple and RWF is the multi-channel feature reweighted fusion operation.

We fuse two saliency contour feature maps according to the following combined strategy:

$$\begin{aligned} E_{fusion}=Conv\left( Con\left( E_{f 1}, E_{f 2}\right) , \omega _{i}\right) , \end{aligned}$$
(4)

where Con means that the feature maps are concatenated by channel and Conv means the convolution operation. The parameter \( \omega _{i} \) is trained through the convolutional layer.

In order to effectively obtain the salient contours of salient targets, we imitate the prior knowledge in the traditional method [26] and increase the contour weights of salient regions. At this time, we use the high-level feature map \( S_{4} \) as a prior map to emphasize the importance of the saliency region and get the final fusion contour saliency map \( E_{f} \).

Fig. 4
figure 4

P–R curves of some mentioned datasets

Fig. 5
figure 5

ROC of four widely used datasets

3.3 Multi-level continuous feature aggregation module

For the main part of the model framework, we adopt a design that is similar to a U-shaped network structure [16]. The basic unit of the convolution layer is a multi-channel feature response reweighting module (RWConv), which we introduced in detail in Sect. 3.1. First, the input image passes through four consecutive levels of RWConv layers to form four corresponding-level feature maps. The feature fusion module at each level is RWFusion, which we have also introduced in Sect. 3.1. We represent the feature map obtained at each level of the saliency target detection section as \( F_{i} \) , and we fuse them according to Eq. 5:

$$\begin{aligned} \begin{array}{l} S_{4}=u p\left( F_{4}, 16\right) , \\ S_{3}=u p\left( R W F\left( F_{3}, F_{4}\right) , 8\right) , \quad i=3, \\ S_{i}=up\left( R W F\left( F_{i}, S_{i+1}\right) , 2^{i}\right) , \quad i=1,2, \\ S_{f}=Conv\left( Con\left( S_{1}, S_{2,} S_{3}, S_{4}\right) , \omega _{i}\right) , \end{array} \end{aligned}$$
(5)

where the operations in Eq. 5 are the same as the operations in Eq. 3 and Eq. 4.

The final result \( R_f \) of the multi-feature fusion is as:

$$\begin{aligned} R_{f}=Conv\left( Con\left( S_{f}, E_{f,} \right) , \omega _{i}\right) \end{aligned}$$
(6)

4 Experiment and analysis

4.1 Implementation details

In the training phase, we use the MSRA10K dataset [27] as our training set. The dataset contains 10,000 high-quality images with salient objects and is labeled at the pixel level. In addition, we randomly selected 5000 images from the DUTS-TR [40] dataset to expand our training set. We do not use validation sets during the training phase. Since our training has salient object contour supervision in addition to the original ground-truth map, we need to expand the dataset. We utilize the Laplacian operator in the OpenCV toolbox to perform edge detection on the targets in the ground-truth map. In this way, we get a 10K group of images with pixel-level object contour annotation. Our implementation is based on the pytorch deep learning framework. The training and testing processes are performed on a desktop with an NVIDIA GeForce RTX 2080Ti (with 11G memory). On our desktop, our model can achieve a relatively fast speed of 16 fps. The initial values of the main parameters of the first half of the U-shaped network are consistent with ResNet [13], and the other parameters are initialized randomly. We use the cross-entropy loss function to calculate the loss between the feature map and the truth map. The calculation method of Softmax function and the cross-entropy loss function is as follows:

$$\begin{aligned} \begin{aligned}&p_{i}=\frac{e^{\alpha _{i}}}{e^{\alpha _{1}}+e^{\alpha _{2}}}, \\&L(\omega )=-\sum _{i=1}^{c} y_{i} \log p_{i}, \end{aligned} \end{aligned}$$
(7)
Fig. 6
figure 6

MAE histogram of the above detection methods. From left to right in each histogram is our method, Amulet [14], DHSNet [15], DSMT [19], MDF [33], NLDF [38], RFCN [39], DRFI [31] and RC [27]

Fig. 7
figure 7

Qualitative comparison of our method with other methods in different application scenarios

where \( \alpha _i \) represents the ith value of the predicted C-dimensional vector and \( y_i \) represents the value of the label in the ground truth. We take C as 2 to distinguish background and foreground. \( \omega \) is the weight parameter.

The model we propose is end to end and does not contain any preprocessing or post-processing operations. We trained 30 epochs on the network.

During network training, the stochastic gradient descent optimization method is used, the momentum is set to 0.9, and the weight decay is 0.0005. The basic learning rate is set to 1e-6, and it is reduced by 50\( \% \) every 10 epochs.

4.2 Datasets

We qualitatively and quantitatively compare different methods and their performance on five commonly used benchmark datasets. The ECSSD [28] dataset contains 1,000 complex images, and the images contain salient objects of different sizes. The SOD [29] dataset is built on the basis of BSD [30], and pixel-level annotations were made by Jiang et al. [31]. It contains 300 high-quality and challenging images. The DUT-OMRON [32] dataset contains 5,168 high-quality and challenging images. The HKU-IS [33] dataset consists of 4,447 annotated high-quality images, and most of them contain multiple salient objects. The PASCAL-S [34] dataset contains 850 natural images which are derived from the PASCAL VOC dataset [35].

4.3 Evaluation metrics

We use five common evaluation metrics to assess our model performance, including precision–recall curve [1], F-measure [36], receiver operating characteristic curve (ROC) [36], area under ROC curve (AUC) [36] and MAE [36, 37]. We binarize the predicted saliency map according to a certain threshold and then compare the obtained binarization map with the ground truth to get the precision and recall, with the F-measure as the harmonic mean of the two. These are calculated as:

$$\begin{aligned} F_{\beta }=\left( 1+\beta ^{2}\right) \times \frac{Precision \times Recall}{\beta ^{2} \times Precision + Recall} \ , \end{aligned}$$
(8)

where \( \beta ^2 \) is generally set to 0.3 in order to emphasize the importance of the precision value [1]. For each fixed binarization threshold, different P–R and F-measure values are obtained. We draw them as curves, and we pick the maximum value of all F-measure calculation results.

Additionally, we can obtain the paired false positive rate (FPR) and true positive rate (TPR), from which we can get the ROC curve and calculate the AUC value.

$$\begin{aligned} T P R=\frac{|M \cap G|}{G}, F P R=\frac{|M \cap {\bar{G}}|}{{\bar{G}}} \ , \end{aligned}$$
(9)

where M is the binary salient feature map, G is the truth map and \( {\bar{G}} \) is the result of negating G.

MAE is expressed as the mean absolute error between the normalized saliency map S and the ground truth G. Its calculation formula is as:

$$\begin{aligned} M A E=\frac{1}{W \times H} \sum _{x=1}^{W} \sum _{y=1}^{H}|S(x, y)-G(x, y)| \ , \end{aligned}$$
(10)

where W and H are the width and height of the image, respectively.

4.4 Comparison with different methods

Our experiments quantitatively compare our model with eight other saliency detection algorithms (Amulet [14], DSMT [19], DHSNet [15], MDF [33], NLDF [38], RFCN [39], DRFI [31] and RC [27]). The P–R curves of some of the mentioned datasets are shown in Fig. 4, and the ROC is shown in Fig. 5. We compare the five performance indicators of the model on the five datasets mentioned above.

Quantitative Comparison: On the five commonly used datasets mentioned above, we quantitatively compare the P–R curve, the ROC curve and the MAE value, and the corresponding experimental results are shown below.

For the P–R curve, the quantitative result we are interested in is the F-measure, and the AUC in the ROC curve can be quantitatively compared as shown in Table 1.

It can be seen from the table that, for the model we proposed, the performance in five popular datasets of its two quantitative indicators’ F-measure, AUC, is significantly better than within the other methods. The bold part in the table indicates that the method performs best on the dataset. In particular, the evaluation criteria F-measure, compared with the second place, has an increase of 3.2\( \% \), 3.7\( \% \), 5.4\( \% \), 4.1\( \% \) and 2.9\( \% \) on HKU-IS, ECSSD, DUT-OMRON, PASCAL-S and SOD datasets. Although the method DSMT scores higher on the auc indicator on PASCAL-S and SOD datasets, it is not as good as our method in terms of refining the target contour and uniformly highlighting the salient target, which can be found in the following qualitative visual comparison. The DRFI and RC methods are outstanding among the traditional methods. By comparison, we can prove that the models’ performance based on the deep network is much better than the traditional method, which is explained in [24].

Table 1 Quantitative indicators of various advanced detection methods. The best results was bolded

Figure 6 shows the experimental results of the nine methods we mentioned regarding MAE values in four datasets. And the histogram shows that our model has the best performance on these datasets.

Qualitative Comparison: Fig. 7 compares the performance of our model with other detection methods for different scenarios. Our images are selected from the aforementioned datasets. Through intuitive comparison, we can find that, due to the explicit modeling of the contour of the salient object, our method can better refine the contour of the target to be tested; it also achieved good performance in the overall consistency of the salient target.

4.5 Ablation experiment

Our ablation experiments focus on the impact of the contour-aided detection and the multi-channel reweighting module on the performance of the detection. Our baseline model is a network structure without these two parts. We take the ECSSD [28] dataset as an example and add contour information to assist detection and channel reweighting modules. The evaluation indicators \( F_\beta \) and MAE are shown in Table 2. After successively introducing contour features and channel features, the F-measure has improved by 2.1% and 1.5%, while the MAE has been reduced by 0.012 and 0.002, respectively. From this, we can discover that the contour feature improves the detection performance more significantly.

Table 2 Changes in quantitative indicators during ablation experiments.(on ECSSD dataset)
Fig. 8
figure 8

Comparison images before and after adding multi-channel features. a Input image; b ground truth; c feature map before adding multi-channel features; d feature map with multi-channel features

Fig. 9
figure 9

Visual effect of adding contour assistant detection module in various of unmanned missions, including aerial photography, intelligent driving, traffic sign detection and underwater target detection. (a) Input images; (b) original detection feature maps; (c) contour auxiliary feature maps; (4) feature maps with contour information. After adding contour information, the detailed information of the object is more refined. For instance, the wings of the bird in the picture become clearer

The salient feature maps before and after the multi-feature cues are added as shown in Figs. 8 and 9. Qualitative observations show that the saliency map with the contour assist module has clearer boundaries. Adding a multi-channel reweighting module can make full use of the information in the feature channels to help highlight the target area uniformly.

5 Conclusion

This paper explores methods to make full use of multiple aspects of image information and proposes a saliency detection network that combines multi-level, multi-task and multi-channel features. The network explicitly models these three features of the image. Multi-level features are modeled with U-shaped networks, multi-task features are modeled with contour-assisted branches, and multi-channel features are modeled with reweighting modules. The model is an end-to-end model without any preprocessing or post-processing. It is relatively flexible for multi-tasking as well as multi-channel modeling, and it can be used to improve most existing models. Experiments show that our method is comparable to the state-of-the-art deep learning methods on various datasets.