RGB-D salient object detection: A survey

Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.


Background
Salient object detection aims to locate the most visually prominent object(s) in a given scene [1].
Recently, RGB-D based salient object detection has gained increasing attention, and various methods have been developed [38,45]. Early RGB-D based salient object detection models tended to extract handcrafted features and then fused the RGB image and depth map. For example, Lang et al. [46], the first work on RGB-D based salient object detection, utilized Gaussian mixture models to model the distribution of depth-induced saliency. Ciptadi et al. [47] extracted 3D layout and shape features from depth measurements. Several methods [48][49][50] measure depth contrast using depth differences between different regions. In Ref. [51], a multicontextual contrast model including local, global, and background contrast was developed to detect salient objects using depth maps. More importantly, however, this work also provided the first largescale RGB-D dataset for salient object detection. Despite the effectiveness of traditional methods using handcrafted features, their low-level features tend  [36] and SE [37], and seven stateof-the-art deep models: D 3 Net [38], SSF [39], A2dele [40], S 2 MA [41], ICNet [42], JL-DCF [43], and UC-Net [44].
to limit generalization ability, and they lack the high-level reasoning required for complex scenes. To address these limitations, several deep learningbased RGB-D salient object detection methods [38] have been developed, with improved performance. DF [52] was the first model to introduce deep learning technology into the RGB-D based salient object detection task. More recently, various deep learning-based models [41][42][43][44][53][54][55] have focused on exploiting effective multi-modal correlations and multi-scale or level information to boost salient object detection performance. To more clearly describe the progress in the RGB-D based salient object detection field, we provide a brief chronology in Fig. 2.
In this paper, we provide a comprehensive survey of RGB-D based salient object detection, aiming to thoroughly cover various aspects of models used for this task and to provide insightful discussions of the challenges and open directions for future work. We also review a related topic, light field salient object detection, as light fields can also provide additional information (including focal stacks, all-focus images, and depth maps) to boost the performance of salient object detection. Further, we provide a comprehensive comparative evaluation of existing RGB-D based salient object detection models and discuss their main advantages.

Related reviews and surveys
Several surveys consider salient object detection. For example, Borji et al. [59] provided a quantitative evaluation of 35 state-of-the-art non-deep-learning saliency detection methods. Cong et al. [60] reviewed several different saliency detection models, including RGB-D based salient object detection, co-saliency detection, and video salient object detection. Zhang et al. [61] provided an overview of co-saliency detection and reviewed its history, and summarized several benchmark algorithms in this field. Han et al. [62] reviewed recent progress in salient object detection, including models, benchmark datasets, and evaluation metrics, as well as discussing the underlying connection between general object detection, salient object detection, and categoryspecific object detection. Nguyen et al. [63] reviewed  [46]. Deep learning techniques have been widely applied since 2017. See Section 2. various works related to saliency applications and provide insightful discussions of the role of saliency in each. Borji et al. [64] provided a comprehensive review of recent progress in salient object detection and discussed related topics, including generic scene segmentation, saliency for fixation prediction, and object proposal generation. Fan et al. [1] provided a comprehensive evaluation of several state-of-theart CNN-based salient object detection models, and proposed a high quality salient object detection dataset, SOC (see: http://dpfan.net/socbenchmark/). Zhao et al. [65] reviewed various deep learning-based object detection models and algorithms in detail, as well as various specific tasks, including salient object detection. Wang et al. [66] focused on reviewing deep learning-based salient object detection models. Unlike previous salient object detection surveys, in this paper, we focus on reviewing RGB-D based salient object detection models and benchmark datasets.

Contributions and organization
Our contributions and organization are: • the first systematic review of RGB-D based salient object detection models considering different perspectives. We classify existing RGB-D salient object detection models as traditional or deep methods, fusion-wise methods, single-stream or multi-stream methods, and attention-aware methods (Section 2); • a review of nine RGB-D datasets commonly used in this field, giving details for each (Section 3). We also provide a comprehensive, attribute-based evaluation of several representative RGB-D based salient object detection models (Section 5); • the first survey of light field salient object detection models and benchmark datasets (Section 4); • a thorough investigation of challenges facing RGB-D based salient object detection, and the relationship between salient object detection and other topics, shedding light on potential directions for future research (Section 6); Conclusions are drawn in Section 7.

Approach
Over the past few years, several RGB-D based salient object detection methods have been developed; they provide promising performance. These models are summarized in Tables 1-4. Further information can be found at http://dpfan.net/d3netbenchmark/. To review these RGB-D based salient object detection

Traditional models
Using depth cues, several useful attributes, such as boundaries, shape attributes, surface normals, etc.,

Deep models
The above traditional methods suffer from unsatisfactory salient object detection performance due to the limited expressiveness of handcrafted features. To address this, several studies have turned to deep neural networks (DNNs) to fuse RGB-D data [39, 40, 42-44, 52-55, 83, 93, 94, 96, 102-106, 111-113, 117-119, 137]. These models can learn highlevel representations to explore complex correlations between RGB images and depth cues for improving salient object detection performance. We next review some representative works. DF [52] develops a novel convolutional neural network (CNN) to integrate different low-level saliency cues into hierarchical features, to effectively locate salient regions in RGB-D images. This was the first CNN-based model for RGB-D salient object detection. However, it utilizes a shallow architecture to learn the saliency map. PCF [92] presents a complementarity-aware fusion module to integrate cross-modal and cross-level feature representations. It can effectively exploit complementary information by explicitly using crossmodal and -level connections and modal-and levelwise supervision to decrease fusion ambiguity.
CTMF [58] employs a computational model to identify salient objects from RGB-D scenes, utilizing CNNs to learn high-level representations for RGB images and depth cues, while simultaneously exploiting the complementary relationships and joint representation. This model transfers the structure of the model from the source domain (RGB images) to the target domain (depth maps).
CPFP [53] proposes a contrast-enhanced network to produce an enhanced map, and presents a fluid pyramidal integration module to effectively fuse crossmodal information in a hierarchical manner. As depth cues tend to suffer from noise, a feature-enhanced module is used to learn enhanced depth cues for to effectively boost salient object detection performance.
UC-Net [44] proposes a probabilistic RGB-D based salient object detection network via conditional variational autoencoders to model human annotation uncertainty. It generates multiple saliency maps for each input image by sampling the learned latent space. This was the first work to investigate uncertainty in RGB-D based salient object detection, and was inspired by the data labeling process. It leverages diverse saliency maps to improve the final salient object detection performance.

Fusion approach
For RGB-D based salient object detection models, it is important to effectively fuse RGB images and depth maps. Existing fusion strategies can be classified as using early fusion, multi-scale fusion, or late fusion, as we now explain; also see Fig. 3.

Early fusion
Early fusion-based methods work in one of two ways: (i) RGB images and depth maps are directly integrated to form a four-channel input [50,51,87,96], which we call input fusion, or (ii) RGB and depth images are first fed into separate networks and their low-level representations are combined to give a joint representation which is then fed into a subsequent network for further saliency map prediction [52]. We call this early feature fusion.

Late fusion
Late fusion-based methods can also be further divided into two families: (i) two parallel network streams are adopted to learn high-level features for RGB and depth data, respectively, which are concatenated and then used to generate the final saliency prediction [48,58,106]. We call this later feature fusion. (ii) Two parallel network streams are used to obtain independent saliency maps for RGB images and depth cues, and then the two saliency maps are concatenated to obtain a final prediction map [108]. This is called late result fusion.

Multi-scale fusion
To effectively explore the correlations between RGB images and depth maps, several methods propose a multi-scale fusion strategy [42,43,55,109,116,122,123,128]. These models can be divided into two categories. The first learns the cross-modal interactions and then fuses them into a feature learning network. For example, Chen et al. [55] developed a multi-scale, multi-path fusion network to integrate RGB images and depth maps, with a crossmodal interaction (MMCI) module. This method introduces cross-modal interactions into multiple layers, which can provide additional gradients for enhancing learning of the depth stream, as well as enabling complementarity between low-level and highlevel representations to be explored. The second category fuses features from RGB images and depth maps in different layers and then integrates them into a decoder network (e.g., via skip connections) to produce the final saliency detection map. Some representative works are now briefly discussed.
ICNet [42] proposes an information conversion module to interactively convert high-level features. In this model, a cross-modal depth-weighted combination (CDC) block is introduced to enhance RGB features with depth features at different levels.
DPANet [109] uses a gated multi-modality attention (GMA) module to exploit long-range dependencies. The GMA module can extract the most discriminatory features by utilizing a spatial attention mechanism. This model also controls the fusion rate of the cross-modal information using a gate function, which can reduce some effects caused by unreliable depth cues.
BiANet [116] employs a multi-scale bilateral attention module (MBAM) to capture better global information from multiple layers.
JL-DCF [43] treats a depth image as a special case of a color image and employs a shared CNN for both RGB and depth feature extraction. It also proposes a densely-cooperative fusion strategy to effectively combine the features learned from different modalities.
BBS-Net [128] uses a bifurcated backbone strategy (BBS) to split the multi-level feature representations into teacher and student features, and develops a depth-enhanced module (DEM) to explore informative parts in depth maps from the spatial and channel views.

Single-stream models
Several RGB-D based salient object detection works [52,53,83,87,93,96,102] focus on a singlestream architecture to achieve saliency prediction. These models often fuse RGB images and depth information in the input channel or feature learning part. For example, MDSF [87] employs a multiscale discriminative saliency fusion framework as the salient object detection model, in which four types of features from three levels are computed and then fused to obtain the final saliency map. BED [83] utilizes a CNN architecture to integrate bottom-up and top-down information for salient object detection. It incorporates multiple features, including background enclosure distribution (BED) and low level depth maps (e.g., depth histogram distance and depth contrast) to boost salient object detection performance. PDNet [102] extracts depthbased features using a subsidiary network, which makes full use of depth information to assist the main-stream network.

Multi-stream models
Two-stream models [54,106,111] have two independent branches to process RGB images and depth cues, respectively, and often generate different highlevel features or saliency maps, and then incorporate them in the middle stage or at the end of the two streams. Most recent deep learning-based models [40,42,45,55,92,104,109,112,114,117] utilize this two-stream architecture with several models capturing the correlations between RGB images and depth cues across multiple layers. Moreover, some models utilize a multi-stream structure [38,103] and then design different fusion modules to effectively fuse RGB and depth information in order to exploit their correlations.

Attention models
Existing RGB-D based salient object detection methods often treat all regions equally using the extracted features in the same way, while ignoring the fact that different regions can make different contributions to the final prediction map. These methods are easily affected by cluttered backgrounds. Furthermore, some methods either regard the RGB images and depth maps as having the same status or overly rely on depth information. This prevents them from considering the importance of different domains (RGB images or depth cues). To overcome such issues, several methods introduce attention mechanisms to weight the importance of different regions or domains.
ASIF-Net [117] captures complementary information from RGB images and depth cues using interwoven fusion, and weights saliency regions through a deeply supervised attention mechanism.
AttNet [111] introduces attention maps for differentiating between salient objects and background regions to reduce the negative influence of certain low-quality depth cues.
TANet [103] formulates a multi-modal fusion framework using RGB images and depth maps from bottom-up and top-down views. It then introduces a channel-wise attention module to effectively fuse the complementary information from different modalities and levels.

Open-source implementations
Available open-source implementations of RGB-D based salient object detection models reviewed in this survey are provided in Table 5. Further source code will

RGB-D datasets
With the rapid development of RGB-D based salient object detection, various datasets have been constructed over the past several years. Table 6 summarizes nine popular RGB-D datasets, and Fig. 4 shows examples of images (including RGB images, depth maps, and annotations) from these datasets. We provide details for each dataset next. STERE [139]. The authors collected 1250 stereoscopic images from Flickr (http://www.flickr.com/), NVIDIA 3D Vision Live (http://photos.3dvisionlive .com/), and the Stereoscopic Image Gallery (http://www.stereophotography.com/). The most salient objects in each image were annotated by three users. All annotated images were then sorted based on the overlapping salient regions and the top 1000 images were selected to construct the final dataset. This was the first collection of stereoscopic images in this field.
GIT [47] consists of 80 color and depth images, collected using a mobile-manipulator robot in a realworld home environment. Each image is annotated based on pixel-level segmentation of its objects.
DES [49] consists of 135 indoor RGB-D images,  taken by Kinect at a resolution of 640 × 640. When collecting this dataset, three users were asked to label the salient object in each image, and overlapping labeled areas were regarded as the ground truth. NLPR [51] consists of 1000 RGB images and corresponding depth maps, obtained by a standard Microsoft Kinect. This dataset includes a series of outdoor and indoor locations, e.g., offices, supermarkets, campuses, streets, and so on.
LFSD [140] includes 100 light fields collected using a Lytro light field camera, and consists of 60 indoor and 40 outdoor scenes. To label this dataset, three individuals were asked to manually segment salient regions; the segmented results were deemed ground truth when the overlap of the three results was over 90%.
NJUD [56] consists of 1985 stereo image pairs, collected from the Internet, 3D movies, and photographs taken by a Fuji W3 stereo camera.
SSD [85] was constructed using three stereo movies and includes indoor and outdoor scenes. It includes 80 samples; each image has resolution of 960 × 1080. [137] consists of 800 indoor and 400 outdoor scenes with corresponding depth images. This dataset provides several challenging factors: multiple and transparent objects, complex backgrounds, similar foregrounds to backgrounds, and low-intensity environments.

DUT-RGBD
SIP [38] consists of 929 annotated high-resolution images, with multiple salient persons in each image. In this dataset, depth maps were captured using a smart phone (Huawei Mate10). This dataset covers diverse scenes and various challenging factors, and is annotated with pixel-level ground truth.
A detailed dataset statistical analysis (including center bias, size of objects, background objects, object boundary conditions, and number of salient objects) can be found in Ref. [38].

Background
Salient object detection methods can be grouped into three categories according to the input data type: RGB, RGB-D, or light field [141]. We have already reviewed RGB-D based salient object detection models, in which depth maps provide geometric information to improve salient object detection performance to some extent. However, inaccurate or low-quality depth maps often decrease performance. To overcome this issue, light field salient object detection methods have been proposed to make use of the rich information captured by a light field. Specifically, light field data can provide an all-focus image, a focal stack, and a rough depth map [137]. A summary of light field salient object detection works is provided in Table 7; we now review them in more detail.

Traditional and deep models
Classic models for light field salient object detection often use superpixel-level handcrafted features [137, 140, 142-147, 149, 155]. Early work [140,147] showed that the unique refocusing capability of light fields can provide useful focus, depth, and object identity cues, leading to several salient object detection models using light field data. For example, Zhang et al. [143] utilized a set of focal slices to compute

Refinement-based models
Several refinement strategies have been used to enforce neighborhood constraints or to reduce the homogeneity of multiple modalities for salient object detection. For example, in Ref. [142], the saliency dictionary was refined using an estimated saliency map. The MA method [145] employs a two-stage saliency refinement strategy to produce the final prediction map, so that adjacent superpixels obtain similar saliency values. LFNet [141] presents an effective refinement module to reduce the homogeneity between different modalities as well to refine their dissimilarities.

Light field data
Five representative datasets are widely used in existing light field salient object detection methods, as we now describe. LFSD [140] consists of 100 light fields of different scenes with 360×360 spatial resolution, captured using a Lytro light field camera. This dataset contains 60 indoor and 40 outdoor scenes, and most scenes include only one salient object. Three individuals were asked to manually segment salient regions in each image, and ground truth was determined to occur when all three segmentation results had an overlap of over 90%. (https://sites.duke.edu/nianyi/publication/saliencydetection-on-light-field/) HFUT [145] consists of 255 light fields captured using a Lytro camera. Most scenes contain multiple objects at different locations and scales, with complex background clutter. (https://github.com/ pencilzhang/HFUT-Lytro-dataset) DUTLF-FS [151] consists of 1465 samples, 1000 for use as a training set, and 465 for a test set. The resolution of each image is 600 × 400. This dataset contains several challenges, including low contrast between salient objects and cluttered backgrounds, multiple disconnected salient objects, and dark and bright lighting conditions.
PR. Given a saliency map S, we can convert it to a binary mask M , and then compute the precision P and recall R by comparing M with a ground-truth map G: A popular strategy is to partition the saliency map S using a set of thresholds (from 0 to 255). For each threshold, we calculate a pair of recall and precision scores, and then combine them to obtain a PR curve that describes the performance of the model as threshold varies. F-measure (F β ). The F-measure takes into account both precision and recall in a single measure, using the weighted harmonic mean: where β 2 is set to 0.3 to emphasize precision [157]. We may again vary threshold and compute the Fmeasure, yielding a set of F-measure values, from which we report the maximal or average F β . MAE. This measures the average pixel-wise absolute error between a saliency map S and a ground truth map G for all pixels. It can be defined by where W and H denote the width and height of the map, respectively. MAE values are normalized to [0, 1]. S-measure (S α ). To capture the importance of the structural information in an image, S α [159] is used to assess the structural similarity between the regional perception (S r ) and object perception (S o ). Thus, S α can be defined by where α ∈ [0, 1] is a weight. We set α = 0.5 as the default, as suggested by Fan et al. [159]. E-measure (E φ ). E φ [160] was proposed based on cognitive vision studies to capture image-level statistics and local pixel matching information. Thus, E φ can be defined by where φ FM denotes the enhanced-alignment matrix [160].
To understand the best six models in depth, we discuss their main advantages below. D 3 Net [38] consists of two key components, a three-stream feature learning module and a depth purifier unit. The three-stream feature learning module has three subnetworks: RgbNet, RgbdNet, and DepthNet. RgbNet and DepthNet are used to learn high-level feature representations for RGB and depth images, respectively, while RgbdNet is used to learn their fused representations. This threestream feature learning module can capture modalityspecific information as well as the correlation between modalities. Balancing the two aspects is very important for multi-modal learning and helps to improve the salient object detection performance. The depth purifier unit acts as a gate to explicitly remove low-quality depth maps, whose effects other existing methods often do not consider. Because lowquality depth maps can hinder fusion of RGB images and depth maps, the depth purifier unit can ensure effective multi-modal fusion to achieve robust salient object detection.
Specifically, the JL module is used to learn robust saliency features, while the DCF module is used for complementary feature discovery. This method uses a middle-fusion strategy to extract deep hierarchical features from RGB images and depth maps, in which cross-modal complementarity is effectively exploited to achieve accurate prediction.
UC-Net [44], instead of producing a single saliency prediction, produces multiple predictions by modeling the distribution of the feature output space as a generative model conditioned on RGB-D images. Because each person has specific preferences in labeling a saliency map, the stochastic characteristic of saliency may not be captured when a single saliency map is produced for an image pair using a deterministic learning pipeline. The strategy in this model can take into account human uncertainty in saliency annotation. Moreover, depth maps can suffer from noise. Directly fusing RGB images and depth maps can cause the network to fit this noise. Therefore, a depth correction network, designed as an auxiliary component, is used to refine depth information with a semantic guided loss. All of these key components help to improve salient object detection performance.
In SSF [39], a complementary interaction module (CIM) is developed to explore discriminative crossmodal complementarity and to fuse cross-modal features, where region-wise attention is introduced to supplement rich boundary information for each modality. A compensation-aware loss is used to improve the network's confidence for hard samples in unreliable depth maps. These key components enable the proposed model to effectively explore and establish the complementarity of cross-modal feature representations, while at the same time reducing the negative effects of low-quality depth maps, boosting salient object detection performance.
ICNet [42] uses an information conversion module to interactively and adaptively explore correlations between high-level RGB and depth features. A cross-modal depth-weighted combination block is introduced to enhance the differences between the RGB and depth features at each level, ensuring that the features are treated differently. ICNet exploits the complementarity of cross-modal features, as well as exploring continuity of cross-level features, both of which help to achieve accurate predictions. S 2 MA [41] uses a self-mutual attention module (SAM) to fuse RGB and depth images, integrating selfattention and mutual attention to propagate context more accurately. The SAM can provide additional complementary information from multi-modal data to improve salient object detection performance, overcoming the limitations of only using self-attention, i.e., a single modality. To reduce the effects of lowquality depth cues (due to e.g., noise), a selection mechanism is used to reweight the mutual attention. This can filter out unreliable information, resulting in more accurate saliency prediction.

Attribute-based evaluation
To investigate the influence of different factors, such as object scale, background clutter, number of salient objects, indoor or outdoor scene, background objects, and lighting conditions, we carried out diverse attribute-based evaluations on several representative RGB-D based salient object detection models.
Object scale. To characterize the scale of a salient object, we compute the ratio of the size of the salient area to that of the whole image. We define three object scales: small, when the ratio is less than 0.1, large, when the ratio is greater than 0.4, and medium, otherwise. For this evaluation, we built a hybrid dataset with 2464 images collected from STERE [139], NLPR [51] , LFSD [140], DES [49], and SIP [38], where 24%, 69.2%, and 6.8% of images have small, medium, and large salient objects respectively. The constructed hybrid dataset can be found at https://github.com/taozh2017/RGBD-SODsurvey. Some sample images with objects of different scales are shown in Fig. 8. The results of the attribute-based comparison w.r.t. object scale are shown in Table 8. It can be observed that all methods perform best at detecting small salient objects and worst for large salient objects. The three most recent models: JL-DCF [43], UC-Net [44], and S 2 MA [41], achieve the best performance. D 3 Net [38], SSF [39], A2dele [40], and ICNet [42] also obtain promising performance.
Background clutter. It is difficult to directly characterize background clutter.
Since classic salient object detection methods tend to use prior information or color contrast to locate salient objects, they often fail in the presence of complex backgrounds.   Thus, in this evaluation, we utilize five traditional salient object detection methods: BSCA [161], CLC [162], MDC [163], MIL [164], and WFD [165], to first detect salient objects in various images, and then categorise these images as having simple or complex backgrounds according to the results. Specifically, we first constructed a hybrid dataset with 1400 images collected from three datasets (STERE [139], NLPR [51], and LFSD [140]). Then, we applied the five models to this dataset and obtained S α values for each image, which we used to characterize images as follows. If all S α values are higher than 0.9, the image is considered to have a simple background. If all S α values are lower than 0.6, the image is said to have a complex background. The remaining images are deemed to be uncertain. Some example images with these three types of background clutter are shown in Fig. 9. The constructed hybrid dataset can be found at https://github.com/taozh2017/RGBD-SODsurvey.
The results of the attribute-based comparison w.r.t. background clutter are shown in Table 9. All models are worse at salient object detection for images with complex backgrounds than simple ones. Among the representative models, JL-DCF [43], UC-Net [44], and SSF [39] achieve the three best results. The four most recent models: D 3 Net [38], S 2 MA [41], A2dele [40], and ICNet [42], obtain better performance than the other models.

Single and multiple objects.
For this evaluation, we constructed a hybrid dataset with 1229 images from the NLPR [51] and SIP [38] datasets. Some example images with single and multiple salient objects are shown in Fig. 10. The comparison results are shown in Fig. 11. From the results, we can see that it is easier to detect single salient object than multiple ones.

Indoors and outdoors.
We evaluated the performance of different RGB-D based salient object detection models on indoor and outdoor scenes. For this evaluation, we constructed a hybrid dataset collected from the DES [49], NLPR [51], and LFSD [140] datasets. The results are shown in Fig. 12. It can be seen that most models struggle more to detect salient objects in indoor scene than outdoor scenes. This is possibly because indoor environments often have varying lighting conditions. Background objects. We evaluated the performance of RGB-D based salient object detection models in the presence of different backgrounds. We used the SIP dataset [38], and split it into eight categories: car, barrier, flower, grass, road, sign, tree, and other. The results of the comparison are shown in Table 10. All methods obtain diverse performances with different background objects. Among the 24 representative RGB-D based models, JL-DCF [43], UC-Net [44], and SSF [39] achieve the three best results. The four most recent models, i.e., D 3 Net [38], S 2 MA [41], A2dele [40], and ICNet [42] obtain better performance than the others.
Lighting conditions. The performance of salient object detection methods can be affected by the lighting conditions. To determine the effects on different RGB-D based salient object detection models, we conducted an evaluation on the SIP dataset [38], whose images we split into two categories: sunny and low-light. The results of the comparison are shown in Table 11.
Low light negatively impacts salient object detection performance. Among the models compared, UC-Net [44] obtained the best performance under sunny conditions, while JL-DCF [43] achieved the best result under low light.

Table 10
Attribute-based study w.r.t. background objects: car, barrier, flower, grass, road, sign, tree, and other. The methods compared including 24 representative RGB-D based salient object detection models (9 traditional and 15 deep learning-based) evaluated on the SIP dataset [38] in terms of MAE and Sα. The three best results are shown in red, blue, and green       [42], JL-DCF [43], and UC-Net [44].

Effects of low-quality depth maps
Depth maps with detailed spatial information have proven beneficial in detecting salient objects against cluttered backgrounds, while the depth quality directly affects salient object detection performance. The quality of depth maps varies tremendously across different scenarios due to the nature of depth sensors, posing a challenge when trying to reduce the effects of low-quality depth maps. However, most existing methods directly fuse RGB images and original raw data from depth maps, without considering the effects of low-quality depth maps.
There are a few notable exceptions. For example, in Ref. [53], a contrast-enhanced network was proposed to learn enhanced depth maps, with much higher contrast than the original depths. In Ref. [39], a compensation-aware loss was designed to pay more attention to hard samples containing unreliable depth information. D 3 Net [38] uses a depth purifier unit to classify depth maps as reasonable or low-quality. It also acts as a gate to filter out low-quality depth maps. However, such methods often employ a twostep strategy to achieve depth enhancement and multi-modal fusion [39,53] or an independent gate operation to remove poor depths, which could lead to a suboptimal problem. There is thus a need to develop an end-to-end framework that can achieve depth enhancement or adaptively assign low weights to poor depth maps during multi-modal fusion, which would be more helpful in reducing the effects of low-quality depth maps and boosting salient object detection performance.

Incomplete depth maps
In RGB-D datasets, it is inevitable for there to be some low-quality depth maps due to the limitations of the acquisition devices. As previously discussed, several depth enhancement algorithms have been used to improve the quality of depth maps. However, depth maps that suffer from severe noise or blurred edges are often discarded. In this case, we have complete RGB images but some samples without depth maps, which is similar to the incomplete multi-view modal learning problem [166][167][168][169][170]. We may call this problem incomplete RGB-D based salient object detection. As current models only focus on salient object detection using complete RGB images and depth maps, we believe this could be a new direction for RGB-D salient object detection.

Depth estimation
Depth estimation provides an effective solution to recover high-quality depths and overcome the effects of low-quality depth maps. Various depth estimation approaches [171][172][173][174] have been developed, which could be introduced into the RGB-D based salient object detection task to improve performance.

Adversarial learning-based fusion
It is important to effectively fuse RGB images and depth maps for RGB-D based salient object detection. Existing models often employ different fusion strategies (early fusion, middle fusion, or late fusion) to exploit the correlations between RGB images and depth maps. Recently, generative adversarial networks (GANs) [175] have gained widespread attention for the saliency detection task [176,177]. In common GAN-based salient object detection models, a generator takes RGB images as inputs and generates the corresponding saliency maps, while a discriminator determines whether a given image is synthetic or ground-truth. GANbased models could easily be extended to RGB-D salient object detection, which could help to boosting performance due to their superior feature learning ability. Moreover, GANs could also be used to learn common feature representations for RGB images and depth maps [114], which could help with feature or saliency map fusion and further boost salient object detection performance.

Attention-induced fusion
Attention mechanisms have been widely applied to various deep learning-based tasks [178][179][180][181], allowing networks to selectively pay attention to a subset of regions for extracting powerful and discriminative features. Co-attention mechanisms have also been developed to explore the underlying correlations between multiple modalities. They are widely studied in visual question answering [182,183] and video object segmentation [184]. Thus, for the RGB-D based salient object detection task, we could also develop attention-based fusion algorithms to exploit correlations between RGB images and depth cues to improve the performance.

Different supervision strategies
Existing RGB-D models often use a fully supervised strategy to learn saliency prediction models. However, annotating pixel-level saliency maps is a tedious and time-consuming procedure. To alleviate this issue, there has been increasing interest in weakly and semi-supervised learning, which have been applied to salient object detection [185][186][187][188][189]. Semi-and weak supervision could also be introduced into RGB-D salient object detection, by leveraging image-level tags [185] and pseudo pixel-wise annotations [188,190], to improve detection performance. Furthermore, several studies [191,192] have suggested that models pretrained using self-supervision can effectively be used to achieve better performance. Therefore, we could train saliency prediction models on large amounts of annotated RGB images in a self-supervised manner and then transfer the pre-trained models to the RGB-D salient object detection task.

Dataset size
Although there are nine public RGB-D datasets for salient object detection, their size is quite limited, with the largest, NJUD [56], containing about 2000 samples. When compared to other RGB-D datasets for generic object detection or action recognition [193,194], the RGB-D datasets for salient object detection are very small. Thus, it is essential to develop new large-scale RGB-D datasets to serve as baselines for future research.

Complex backgrounds & task-driven datasets
Most existing RGB-D datasets contain images with one salient object, or multiple objects but against a relatively clean background. However, real-world applications often involve much more complicated situations, e.g., occlusion, appearance change, and low illumination, which can reduce salient object detection performance. Thus, collecting images with complex backgrounds is critical to improving the generalizability of RGB-D salient object detection models. Moreover, for some tasks, images with specific salient object(s) must be collected. For example, road sign recognition is important in driver assistance systems, requiring images with road signs to be collected. Thus, it is essential to construct task-driven RGB-D datasets like SIP [38].

Model design for real-world scenarios
Some smart phones can capture depth maps (e.g., images in the SIP dataset were captured using a Huawei Mate10). Thus it is feasible to perform salient object detection for real-world applications on smart devices. However, most existing methods include complicated and deep DNNs to increase model capacity and for better performance, preventing them from being directly applied to such platforms.
To overcome this, model compression [195,196] techniques could be used to learn compact RGB-D based salient object detection models with promising detection accuracy. Moreover, JL-DCF [43] utilizes a shared network to locate salient objects using RGB and depth views, which largely reduces the model parameters and makes real-world applications feasible.

Extension to RGB-T
In addition to RGB-D salient object detection, there are several other methods that fuse different modalities for better detection, such as RGB-T salient object detection, which integrates RGB and thermal infrared data. Thermal infrared cameras can capture the heat radiation emitted from any object, making thermal infrared images insensitive to illumination conditions [197]. Therefore, thermal images can provide supplementary information to improve salient object detection when images of salient objects suffer from varying light, glare, or shadows. Some RGB-T models [197][198][199][200][201][202][203][204][205] and datasets (VT821 [199], VT1000 [203], and VT5000 [205]) have already been proposed over the past few years. Like for RGB-D salient object detection, the key aim of RGB-T salient object detection is to fuse RGB and thermal infrared images and exploit the correlations between the two modalities. Thus, several advanced multi-modal fusion technologies in RGB-D salient object detection could be extended to the RGB-T salient object detection task.

Conclusions
This paper has presented the first comprehensive review of RGB-D based salient object detection models. We have reviewed the models from different perspectives, and summarized popular RGB-D salient object detection datasets as well as providing details of each. As light fields also provide depth information, we have also reviewed popular light field salient object detection models and related benchmark datasets. We have comprehensively evaluated 24 representative RGB-D based salient object detection models, as well as performing an attribute-based evaluation based on new datasets. Moreover, we have discussed several challenges and highlighted open directions for future research. In addition, we have briefly discussed the extension to RGB-T salient object detection to improve robustness to lighting conditions. Although RGB-D based salient object detection has made notable progress over the past several decades, there is still significant room for improvement. We hope this survey will generate more interest in this field. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.