WGI-Net: A weighted group integration network for RGB-D salient object detection

Salient object detection is used as a pre-process in many computer vision tasks (such as salient object segmentation, video salient object detection, etc.). When performing salient object detection, depth information can provide clues to the location of target objects, so effective fusion of RGB and depth feature information is important. In this paper, we propose a new feature information aggregation approach, weighted group integration (WGI), to effectively integrate RGB and depth feature information. We use a dual-branch structure to slice the input RGB image and depth map separately and then merge the results separately by concatenation. As grouped features may lose global information about the target object, we also make use of the idea of residual learning, taking the features captured by the original fusion method as supplementary information to ensure both accuracy and completeness of the fused information. Experiments on five datasets show that our model performs better than typical existing approaches for four evaluation metrics.


Introduction
In recent years, salient object detection (SOD) has attracted widespread interest; it aims to distinguish the most visually obvious objects or regions in a † Yanliang Ge and Cong Zhang contributed  given image. Salient object detection uses computers to imitate human visual mechanisms to detect and distinguish salient objects in given images. SOD has been applied to many fields, including content-based image editing [1][2][3][4], image and video compression [5], object segmentation and recognition [6][7][8][9][10], visual tracking [11][12][13], image retrieval [14,15], etc. Due to their powerful ability to extract information, SOD [16,17] and other related tasks (e.g., video salient object detection [18,19], co-saliency detection [20,21], light field salient object detection [22][23][24], etc.) are often used as preprocesses in visual tasks. Most early SOD approaches considered a single RGB image or a set of them. As depth cameras (such as Kinect, RealSense, etc.) began to be applied to computer vision, combining the use of depth information for salient object detection, namely RGB-D SOD, becomes a topic of interest.
Depth cues can supply additional information about appearance, so it is useful to fuse depth information into salient object detection. A model incorporating depth information is able to identify target objects in given images more quickly and accurately.
In recent years, more and more researchers have considered RGB-D SOD [25][26][27] as a way to improve salient object detection. Existing RGB-D SOD methods mostly fused depth input in one of 3 stages: fusion at an early stage [28][29][30][31], fusion at a middle stage [32][33][34][35], or fusion at a late stage [36][37][38]. Early stage fusion directly fuses the input, both RGB and depth features, into one channel to extract information. In Ref. [28], Peng et al. proposed a multistage RGB-D SOD algorithm that combines depth cues and appearance features in a coupled manner. As a result of the distribution gap between the two inputs, it is not easy to fit the data in one model. Some methods fuse depth features in a middle-stage, they first extract RGB features at each level and then combine them with depth features to generate saliency maps. For example, in Ref. [32], Feng et al. proposed a method that utilizes RGB-D saliency features to obtain angular spread directions. Fusing the input at a late stage firstly determines salient RGB and depth information in two channels, and then utilizes pixel-wise summation or multiplication to fuse the RGB and depth saliencies. For example, Cheng et al. [38] proposed a method that exploits visual saliency cues in color and depth spaces to compute the saliency map.
Since depth information can help to locate salient objects in an image, in this article, we present a weighting strategy to obtain more accurate depth feature cues. Furthermore, to exploit both RGB and depth information, we propose a novel feature integration method, weighted group integration (WGI), that can well employ each category of information. See Fig. 1. The first row shows that our model is able to accurately detect salient objects in complex scenes. The second row shows that, although the depth map is noisy, the predicted saliency map from our method is still close to the ground truth. Extensive experiments demonstrate that the proposed method achieves comparable results to other state-ofart models on five public benchmarks.
In summary, our main contributions are 1. A novel feature fusion method, WGI, which can effectively integrate RGB features and depth features to accurately distinguish salient objects in given images. It shows significant performance improvements over existing feature fusion modules like DRB. 2. A series of experiments on five popular datasets to verify the effectiveness and efficiency of the proposed approach.

Traditional
Traditional RGB-D saliency models usually rely on hand-crafted features to distinguish salient objects in given images. Existing widely-used hand-crafted features including contrast [28,38,39], compactness [39,40], center-surround difference [41,42], center or boundary prior [43,44], background enclosure [32], and various fused saliency measures [29]. In Ref. [ [39], to reduce the influence of poor depth maps on saliency detection, Cong et al. turned the input into a graph and applied depth information to graph construction. They proposed a new method that utilizes RGB and depth features to compute a compactness saliency map. However, this handcrafted feature has limitations, such as difficulty in providing high-level semantic information, slow and imprecise extraction of information, and poor generalizability in complex scenarios.

Deep learning based
To overcome the limitations of hand-crafted features, and benefit from the powerful information extraction capability of deep learning, recent works have applied convolutional neural networks (CNN) to RGB-D saliency detection. This improves the expressiveness of models and improves detection performance [25,[46][47][48][49][50][51][52][53]. Shigematsu et al. [33] proposed a pioneering method, BED, which applies deep-learning to RGB-D based SOD models. To obtain background enclosure features and depth contrast in given images, BED extracted ten hand-crafted depth features based on super-pixels. These features were then fed into a CNN to fuse them with RGB features to give superpixel saliency values. In Ref. [30], Qu et al. designed a method that firstly generated RGB and depth feature vectors for superpixels or patches, and then fused these vectors in the CNN to generate saliency values, ultimately utilizing a Laplacian function to obtain the predicted maps. More recently, Han et al. [47] designed an end-to-end model that extracted features from both RGB images and depth maps, and used a fully connected layer to obtain the final saliency map.

Our approach
In this paper, we propose an integration network which fuses RGB and depth feature information for RGB-D salient object detection. In this section, we describe our proposed model, WGI-Net, for RGB-D salient object detection. We also explain the advantage of weighting depth information and clarify how to weight the depth information in detail. Finally, we expound the proposed feature fusion module, WGI that aggregates the depth features and RGB features to distinguish the salient objects in given images.

Overall network architecture
To explain our network for RGB-D based saliency detection, Fig. 2 depicts an example backbone with two branches (an RGB branch and a depth branch), each having a hierarchy of five levels.
The RGB branch is utilized to obtain the main feature information, including low-level features (color, location, texture, etc.), high-level features (semantic information), and contextual features. The depth branch is used to capture depth cues from the image to help accurately and completely detect the salient objects. To better fuse the depth information with the RGB information, we present a novel feature fusion module, WGI. We employ elementwise addition to integrate the output of each WGI module, F fused i (i = 1, · · · , 5). Finally, we feed the summed values into the FRU (see Ref. [34]) to obtain more detailed and accurate saliency maps.

Weighted depth information
Both RGB and depth information are significant for salient object detection and other segmentation. Specifically, the depth information can provide powerful cues to locate and distinguish salient objects in an image. It is difficult to accurately detect and distinguish salient objects in images only by appearance features when the background environment is complex or the color contrast between the foreground and background is low.
To the best of our knowledge, most existing models only consider depth information but not weighted depth information. In this paper, we apply weights to depth information to obtain more accurate saliency maps. As shown in Fig. 2 To obtain the depth residual feature we firstly feed f depth i,j into a 3 × 3 convolutional layer, and compute: where Conv 3 (.) represents a convolutional layer with a kernel size of 3. This depth residual feature can provide cues that are ignored in the process of forwarding extracted information. Then, we divide f depth i,j into two parts, and feed them into two branches. In one branch, we feed the f depth i,j into a series of weight layers composed of a Pooling+Conv layer and a Softmax layer to capture more detailed and accurate information: where S(.) denotes the softmax function and * represents the convolution operation. In the other branch, we do nothing with f depth to obtain complete information: where × denotes element-wise multiplication. The weighted depth information can provide more accurate detailed information, i.e., f Depth i,j is more complementary to the RGB information. Fig. 3 The process of weighting depth information.

Weighted group integration module
In order to effectively utilize feature information from the image, we introduce a new feature fusion method, WGI. In this module, we divide the RGB information and depth information into 8 parts, respectively. Then, we use a concatenation operation to fuse the RGB feature and the depth feature for each part to obtain that part's saliency map. Next, we again utilize a concatenation operation to integrate the predicted maps to collect all useful information. Details of the WGI module are as follows.
Instead of fusing RGB and depth feature information using convolution layers, we seek alternative methods with powerful feature integration ability, while maintaining a similar or less computational load. In particular, we replace RGB and depth feature information with smaller groups of feature information blocks, while at the same time, the previous fusion information is connected in a similar residual style. As shown in the purple box in Fig. 2, the WGI module contains two branches, an RGB branch and a depth branch. In the RGB branch, we evenly divide the obtained RGB feature of each level into 8 sub-information blocks, f RGB i,j (i = 1, · · · , 5, j = 1, · · · , 8). Each block of the RGB branch has the same number of channels, 1/8 of the input image. In the depth information branch, we perform the same operation on the input depth map as in the RGB branch, dividing the input depth map of each layer into 8 parts, f depth i,j . We also perform the operation given in Section 3.2 on the 8 blocks to obtain more instructive depth information (f Depth i,j ). Then, we separately merge each obtained RGB information block with the corresponding depth information block by concatenation on channel dimension: where Concat(.) represents concatenation on channel dimension and f fused i,j denotes the fused feature information of each block. We then concatenate the 8 saliency prediction maps obtained from the previous step: where f fused i is the saliency prediction map generated by the 8 f fused i,j in this layer. Since concatenation changes the number of channels in the result, we perform a 1×1 convolution operation on the obtained prediction map to ensure it has the same size as the input map for each level. Thus, the reshaped fused information can be written as where Conv 1 (.) represents a convolution layer with kernel size 1.
In order to ensure the completeness and accuracy of the information, the WGI module utilizes the information obtained by the original fusion method as residual information to correct the predicted saliency maps, allowing it to highly accurately distinguish the salient objects: where F fused i (i = 1, · · · , 5) denotes the output of the WGI module, and f res i (i = 1, · · · , 5) denotes the original fused information of each level.
In this module, we use segmentation and fusion to slice the depth information and RGB information extracted from each layer and then fuse them separately. This method is conducive to the use of global information and can more effectively fuse the two types of information.

Datasets
The following datasets were chosen for evaluation. Ju et al. [41] proposed a dataset, NJUD, for detecting salient objects or pixels in given images. The dataset consists of 2000 images with mask labels. Its stereo images were taken with a Fuji W3 camera. Images were collected from the Internet, 3D movies, and photos. Because of the labeling differences between 2D images and 3D environments, the labels were all provided by Nvidia 3D vision, to ensure accuracy of mask labeling. The RGBD135 dataset [38] comprises 135 indoor images with manually marked labels. The images are taken by Kinect and the resolution of each image is 640 × 480. To address the problem of strong complementarity between RGB and depth, Peng et al. [28] proposed a benchmark, NLPR, for RGB-D salient object detection. It contains 1000 natural images and manually matched ground truths. Zhang et al. [54] presented a dataset, LFSD, based on general salient object segmentation and saliency detection on light fields. The dataset comprises 3 parts, including outdoor scenes, indoor scenes, and corresponding ground truths. DUT-RGBD [34] presented by Piao et al. is composed of 1200 pairs of images taken by a Lytro camera. Most images have complex backgrounds so are suitable for evaluating the effectiveness of our proposed model.

Evaluation metrics
In this paper, we utilize four common measures to evaluate the quality of predicted saliency maps against the ground truth: MAE [59], F-measure [60], S-measure [61], and E-measure [62]. MAE evaluates the mean absolute error between saliency maps S and corresponding ground-truth G over all image pixels: where H and W denote the height and the width of the image, respectively. F-measure computes the weighted harmonic mean between precision P and recall R of binarized saliency maps, defined as where β 2 is generally set to 0.3 to emphasize precision. F max is maximum F-measure. S-measure computes the structural similarity of the object-aware S o and the region-aware S r comparing the non-binary saliency map and the ground truth.
Following previous work [61], α is set to 0.5. Emeasure utilizes both local and global pixels to obtain local pixel matching information and image-pixel statistics. Unlike S-measure, E-measure evaluates binary maps: where φ FM is the enhanced alignment matrix. We consider maximum E-measure, E max .

Training details
We implemented our method using the Pytorch toolbox and utilized an NVIDIA 1080 Ti GPU for acceleration. As training dataset we used the same one as DMRA [34], with input maps set to 256 × 256. Other experimental parameters, momentum and weight decay, were set to 0.9 and 0.0005, respectively. In addition, the learning rate was 10 −10 , and the batch size was 2.
To fully compare our proposed method, WGI-Net with these existing approaches, we re-evaluated these models using available source code or directly used saliency maps provided by their authors.

Quantitative comparision
Detailed comparative results of experiments based on the above four metrics are listed in Table 1. As it can be seen, our framework achieves good performance. Our proposed approach performs better than all other approaches across all four metrics on most datasets. Specifically, in terms of F max and E max , WGI-Net achieves the best performance across all datasets. For the NJUD dataset, our model achieves the best performance on all four evaluation metrics. Compared with the second ranked models, D3Net, F max score, E max score, and S score are higher by 0.006, 0.002, and 0.001 respectively, while MAE is lower by 0.005. In the LFSD dataset, compared with the second place, A2dele, our F score, E score, and S score are higher by 0.022, 0.016, and 0.015, respectively. Figure 4 provides sample saliency maps predicted by the proposed method and several other algorithms. It intuitively illustrates the outstanding ability of our method to highlight correct salient object regions. Specifically, as shown in the 1st and 2nd rows of Fig. 4, the saliency maps provided by our method are closer to the ground-truth. Our method can detect the edges of objects more completely and accurately, while the maps output by other models lose certain items. For the LFSD dataset, for example, our results have no holes or extra parts. For the NJUD dataset, our saliency maps are more similar to the groundtruth: e.g., ours clearly detects the flags on the car, while others only detect part or none. For the NLPR dataset, our method accurately distinguishes salient objects in the foreground, while other methods detect incomplete objects or extraneous objects as salient objects. The 1st and 2nd rows of the RGB135 dataset show results for small and large objects; our method is able to provide accurate results in the cases. In summary, our model is able to handle various complex situations and provide highly accurate saliency maps.

Ablation study
To verify the effectiveness of our WGI-Net, an ablation experiment was conducted comparing just the backbone with additionally using WGI. Our backbone is DMRA [34] following the identical implementation setup. We conducted experiments on two datasets, RGBD135 and DUT-RGBD. Table 1 Performance comparison to seven state-of-the-art architectures, for five datasets. Maximum F-measure Fmax, maximum E-measure Emax, S-measure S, and MAE are utilized to assess performance. ↑ and ↓ indicate the higher the score the better, and the lower the better, respectively. " * " indicates that the author has not provided corresponding saliency maps.  Experimental results are listed in Table 2. Compared to the baseline with no WGI module, the performance of our approach is improved. Specifically, for the RGBD135 dataset, F max , E max , and S score increase by 0.020, 0.009, and 0.023 respectively, while MAE decreases by 0.005. For the DUT-RGBD dataset, F max , E max , and S score increase by 0.004, 0.010, and 0.024 respectively, while MAE score by 0.010.
As shown in Fig. 5, the saliency maps from our method are closer to the ground-truth. Unlike the backbone (DMRA), our method is able to eliminate background interference and accurately detect salient objects against complex backgrounds.

Conclusions
In this paper, we have proposed a simple but efficient fusion approach, WGI, to make effective use of RGB feature information and depth feature information.
The extracted RGB and depth features are sliced into 8 parts, and then concatenation is used to fuse the features of each block to more effectively integrate the two kinds of feature information. We also apply a series of weight layers to the depth information to obtain more accurate cues about the locations of the salient objects. Experiments on five datasets verify that our method performs better than current work for different evaluation metrics. Although our approach can accurately detect salient objects in complex environments through weighted group integration, it requires a large amount of calculation time. Therefore, in future we will focus on improving the fusion of information to make more effective use of feature information.
the School of Electrical Information Engineering in Northeast Petroleum University. His main research interests concern digital watermarking, signal processing, and digital video processing. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

Cong
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.