Salient object detection for RGBD video via spatial interaction and depth-based boundary refinement

Recently proposed state-of-the-art saliency detection models rely heavily on labeled datasets and rarely focus on perfect RGBD feature fusion, which lowers their generalization ability. In this paper, we propose a depth-based interaction and refinement network (DIR-Net) to fully leverage the depth information provided with RGB images to generate and refine the corresponding saliency segmentation maps. In total, three modules are included in our framework. A depth-based refinement module (DRM) and an RGB module work in parallel while coordinating via interactive spatial guidance modules (ISGMs), which utilize spatial and channel attention computed from both depth features and RGB features. In each layer, the features in both modules are refined and guided by the spatial information obtained from the other module through ISGMs. In the RGB module, before sending the depth-guided feature map to the decoder, a convolutional gated recurrent unit (ConvGRU)-based block is introduced to handle temporal information. Thinking about the clear movement information in RGB features, the block also guides temporal information in DRM. By merging the results from both the DRM and RGB modules, a segmentation map with distinct boundaries is generated. Considering the lack of depth images in popular public datasets, we utilize a depth estimation network that incorporates manual postprocessing-based correction to generate depth images on the DAVIS and UVSD datasets. The state-of-the-art performance achieved on both the original and new datasets illustrates the advantage of our RGBD feature fusion strategy, with a real-time speed of 19 fps on a single GPU.


Introduction
Salient object detection (SOD) aims to achieve pixel-level binary classification for salient objects in image or video frames.This skill mimics the attention mechanism of human eyes.People's brains always find vital objects but not all objects in a scene and then rapidly respond to them.It is hoped that computers can achieve the same ability.As an important preprocessing method, SOD is able to distinguish the most visually distinctive objects quickly for further computer vision tasks.SOD has been applied to a wide range of vision applications such as object recognition [1], video and image segmentation [2,3], and object tracking [4].Thanks to the emergence of deep neural networks, most computer vision tasks, such as object detection [5,6] and semantic segmentation [7], have made great progress.Inspired by this, convolutional neural networks (CNNs) have been investigated for SOD [8,9].Existing SOD methods mainly focus on 2D or 3D images and 2D videos, but methods for RGBD videos are still limited.However, depth information is critical in the 3D dynamic world.2D SOD fails to obtain sufficient information and suffers from low prediction accuracy in lowcontrast or complex scenes.At this time, depth information can be utilized to supply more spatial information and guide an RGB module to find salient objects.In this regard, we fully consider the characteristics of the depth channel in video Fig. 1 Comparison between previous and current depth images; the images in row1 and row2 are previous examples from database1 and database2, respectively; the images in row3 and row4 represent examples in eloesdataset and our new 3D-VSOD dataset; RGB images, depth images, and ground-truth images are located from right to left in a row SOD and design an efficient network structure to fuse multimodal spatial and temporal information.
There are two limitations regarding the development of RGBD video SOD approaches.First, as shown in Fig. 1, limited by hardware equipment, previous depth cameras could hardly obtain high-quality depth images [10,11] especially in terms of boundary details.It is difficult to improve the detection effect for rough data, and the opposite influence may even be achieved.On the other hand, although deep neural networks have excellent performance in computer vision, large-scale data are needed.Unfortunately, no annotated RGBD video datasets are currently available in the SOD field because it is time-consuming to annotate video datasets.Therefore, we create the first pixelwise-annotated dataset called 3D-VSOD to train our model with real data.This is still insufficient for deep learning so we employ Mon-odepth2 [12], a state-of-the-art network in the field of depth estimation, to generate depth images from existing 2D video SOD datasets (DAVIS and UVSD).However, most depth images obtained in this way are terrible as shown in Fig. 1, so we refine them through GT in the dataset and obtain more realistic results.We also propose the first deep neural network to utilize depth information in video SOD.
Many effective network structures have been proposed to extract spatiotemporal information [13,14], while depth information has not been fused in a valid manner.Considering the characteristics of poor semantic information and clear boundaries in depth images, we design a depth-based refinement module (DRM) that is similar to U-Net [15] for depth information extraction and simultaneously preserve the boundary details as much as possible.The clear boundary information is further used to optimize the final predictions in the RGB module.A non-local recurrent enhancement module is introduced to improve their temporal coherence.
In summary, our major contributions can be determined as follows and our code and dataset will be available at Github. 1.In the overall network, a DRM is designed to extract depth cues that are further used to guide potential salient regions and refine the boundaries in RGB feature maps.Furthermore, due to the depth and RGB sequence consistency in temporal information, spatiotemporal connections in RGB sequences are used to guide temporal information in depth sequence.We take full advantage of multimodal spatial and temporal information in this way.2. Starting from the conflicts among semantic features under multiple modalities, we leverage the consistency of the spatial distribution to devise an interactive spatial guidance module (ISGM).The ISGM excavates spatial saliency cues in the DRM and RGB modules, providing interactive guidance between two modules as a potential saliency prior.The two modules utilize their respective properties to jointly boost spatial saliency in this manner.
3. As the first deep learning approach for RGBD video SOD, our method addresses the insufficient training dataset problem.First, we optimize the low-quality depth images generated by a depth estimation network via GT.To ensure the proportion of real data in the training data, we construct the first RGBD video dataset (3D-VSOD) with pixelwise annotations for saliency detection and it will be available at https://github.com/ELOESZHANG/3D-VSOD-dataset.

2D SOD
2D SOD was extensively researched first before depth cameras were widely used.This skill and the first model were proposed by Itti [16].He extracted multiscale features such as colors, intensities and orientations using center-surround mechanisms in an image and decided if pixels were salient according to these features.Since then, many researchers have proposed many other hand-designed features suitable for image SOD, such as Fourier spectral residuals [17] and Markov chains [18].However, the hand-designed features are always low-level and may ignore the global features.For 2D video, features are also based on hand-designed features at the beginning.Compared with images, temporal information is special and important in video SOD.The initial models have only incorporated some temporal information into spatial saliency information such as region space-time dynamic contrasts [19] and temporal consistency [20].Optical flow is also an intuitive representation of temporal features and can always obtain better representations.Many researchers have designed models based on optical flow such as local gradient flow [21] and geodesic distances [22].However, optical flow is computationally expensive, which is not only time-consuming but also consumes many computational resources.In recent years, deep neural networks have brought great improvements in computer vision, as well as in SOD.With the availability of large-scale annotated image datasets and the strong computing power of current GPUs with respect to images, deep neural networks can extract ample semantic features, so they require less manual intervention but obtain better results.Many classical models trained on ImageNet are available as feature extractors and are widely used as backbones.The models developed later can be fine-turned based on trained parameters, thereby avoiding excessive training and the need for training data.
Deep learning methods were first used for 2D images.Li et al. [23] used a pretrained CNN to extract multiscale features for each superpixel and fused different hierarchical saliency maps.Lee et al. [24] took traditional hand-crafted features into consideration.They fused high-level semantic features in a CNN and encoded hand-crafted features originating from superpixels into a feature vector for prediction results.
Inspired by this, deep neural networks have recently been employed for 2D videos.Classic CNNs are good at extracting spatial information but have difficulty extracting temporal information.Wang et al. [25] used a fully convolutional network(FCN) to process a single image for static saliency, and a similar structure was employed to process frame pairs for dynamic saliency.Due to its great performance in timesequence problems, long short-term memory(LSTMs) has also been used with video to process temporal information.Wang et al. [26] determined static saliency through an attention mechanism and designed an LSTM based network to encode temporal information.And Song et al. [27] designed a bidirectional LSTM network to enhance temporal features.
In the real systems, perturbation and error will also have an impact on the network system.Literature [28,29] puts forward some methods to influence various uncertainties in view of the actual situation.Wei et al. [30] verified that response diffusion facilitates the input-state stability of neural network systems by three numerical examples.For salient object detection, Han et al. [31] developed a stacked denoising autoencoder with a deep learning architecture to model the background for exploring potential patterns and reduce errors.Zhou et al. [32] proposed an iterative semi-supervised learning framework to gradually converge the system to an optimized stable state.

3D SOD
3D spatial information supplies more available features in RGBD images and videos.The fusion of depth features into spatial saliency is a fundamental issue.In 3D image SOD, Fan et al. [33] predicted salient objects through depth contrasts and depth-weighted color contrasts.Li et al. [34] pruned foreground superpixel outliers to obtain foreground saliency information.Some deep learning methods have also been developed.Qu et al. [35] used a CNN to learn interaction mechanisms based on low-level color and depth features.Zhang et al. [36] designed a complimentary interaction module to discriminatively select useful representations from RGB and depth images, and effectively integrate cross-modal features.Zhao et al. [37] designed a feature enhanced module to enhance the contrast between the foreground and background and then fused the module between the convolutional blocks in a VGGNet-based feature extraction network.
Compared with the 2D videos in the aforementioned studies, 3D videos receive less attention.Moreover, most studies are based on traditional hand-crafted features.Zhang et al. [38] calculated spatial saliency through superpixels and determined temporal saliency through optical flow and depth confidence regions.They then fused these features through depth confidence optimization.Kim et al. [39] enhanced the influence of depth information in motion and defined the contrast disparity strength factor, motion strength factor and location strength factor.Fang et al. [40] proposed a stereoscopic video detection method based on gestalt theory and combined depth motion, temporal and spatial saliency maps.Lino et al. [41] proposed a center-bias weighting function to combine these three feature maps.Zhang et al. [42] emphasized the priority of closer objects in depth maps.
Deep learning frameworks always achieve the best-inclass performance.Deep learning algorithms significantly outperform traditional methods in most cases.However, no deep learning-based 3D video SOD has approaches have been proposed to date.We take advantage of the power of deep neural networks for SOD in this paper to build a depthbased interaction and refinement network.Table 1 shows the comparison between our DIR-Net and 7 previous representative SOD methods.It can be seen that there is relatively little research on 3D VSOD (here refers to RGBD VSOD) in recent years.Most deep learning SOD methods do not focus on multi-source information, while traditional methods are obviously unable to deal with new datasets with huge amounts of data.The proposed method is not only based on deep learning, but also can effectively extract and fuse features in multi-source information, ensuring the accuracy of 3D VSOD while guaranteeing the speed of computation.

Proposed method
In this section, we devise an interactive spatial guidance approach to boost spatial features.The depth-based interaction and refinement network (DIR-Net) extracts the spatial features from RGB and depth frames in the RGB module and DRM, respectively.The features are regarded as prior information, and they guide each other between the two modules through an ISGM.Finally, the output of the DRM is used to optimize the output of the RGB module.For consideration of the consistency of the temporal features between depth and the RGB images, the temporal module improves the spatiotemporal coherence only in high-level feature maps to reduce the computational complexity of the algorithm.We also tackle the datasets issue, which is a fundamental challenge in deep learning.In addition to building a new dataset, we also use a depth estimation network to generate depth maps from existing 2D video SOD datasets and optimize them to reduce the workload of annotation.Details are shown in the following subsections.

Network architecture
The proposed DIR-Net consists of two U-Net-like encoderdecoder modules that analyze RGB and depth frames, as shown in Fig. 2. The spatial feature extractor in the RGB module employs the ResNet50 structure; the difference is that the downsampling operation in the last block is replaced with atrous convolutions at a ratio of 2 to reduce the loss of information and enlarge the receptive field simultaneously.After that, an atrous spatial pyramid pooling (ASPP) module [44] is also connected to obtain information at multiple scales.A residual connection is established to retain the lowlevel information.The structure refers to U-Net, but adopts the mode of residual skip connection layer rather than simple channel concatenation to enhance the learning ability of the network.ISGMs are inserted between convolutional blocks in the extractor to exchange spatial information with the DRM, and the details of this process are described in next Section.
Many networks for depth images in current studies adopt structure like models for RGB SOD and are initialized with pretrained parameters on ImageNet.However, we argue that depth images always do not possess rich semantic information, so a deep structure may be redundant and induce unnecessary training costs.Due to the incompatibility between two different modalities, it is questionable to apply the parameters trained on RGB images directly to a depth network.Therefore, DRM and random initialization are adopted.We add a bridge between the encoder and corresponding decoder (such as U-Net) to reduce the loss of boundary details and refine the boundary in the final predictions.The DRM has 5 convolutional blocks with 3 × 3 kernels, each of which outputs the same H and W as that of the corresponding block in the RGB module to ensure spatial information interaction.To utilize the advantages of the two modalities to achieve better fusion, ISGMs are inserted between two modules.Each ISGM has two inputs: the feature maps behind each convolution block of the DRM and the corresponding feature maps in the RGB module.More weight is given to the most salient regions according to their features, and the weights are spread to the other module to enhance its salient regions.In the fusion stage, the predictions in the RGB module and DRM are simply added to significantly improve the boundary details of the predictions of the RGB module.This enables the DRM to simultaneously act as a prediction refinement structure.
For temporal features, in consideration of the consistency between the temporal features of the depth and RGB images, and the RGB images contain richer information, which is more conducive to the extraction of temporal features, we use the temporal information of the RGB stream to guide the temporal information of the depth stream.Guidance is performed to address the lack of detailed information and the indistinct temporal features caused by noise in the depth stream.Non-local enhancement and recurrent (NER) module [14] is integrated to improve the spatiotemporal coherence (only in high-level RGB feature maps).The RGB module takes a video sequence as input but the spatial feature extractor only  handles each frame( f feature i ) in the sequence.f feature i and its adjacent frames' feature maps ( f feature i−1 , f feature i+1 ) behind the ASPP layer are collected as feature clips that will be input into NER module to extract spatiotemporal information.Specifically, an NER module is a combination of two non-local 3D blocks [45] and two convolutional gated recurrent units (ConvGRUs) [46].A ConvGRU is composed of a reset gate and an update gate that can be trained to remember or forget information in a sequence to help the network describe the sequential evolution of a video sequence.[47] showed that two ConvGRU modules in two directions can strengthen spatiotemporal information exchange.Thus, bidirectional ConvGRUs are used to extract bidirectional temporal features.Non-local blocks are symmetrically placed in front of and behind the ConvGRUs to calculate the global responses of the spatiotemporal features.After the other non-local block, high-level features with spatiotemporal coherence are decoded in the spatial decoder to obtain spatiotemporal and depth-enhanced predictions in the RGB module.To improve the temporal characteristics of the depth stream features, the temporal extraction module of the depth stream adopts the structure of one-way temporal extraction and coordinated RGB temporal feature guidance.During the guidance process, the temporal features of the RGB stream and the depth stream are first stacked to introduce more informative temporal features into the depth stream.The temporal information is integrated, and the temporal information of the RGB stream is transferred to the depth stream feature.The feature map of each frame combined with the temporal information in different modes is finally used as the input of the respective tributary pixel classifier for decoding.

ISGMs
Feature fusion with multiple modalities is a challenging task.Simple concatenation may ignore the different properties in RGB and depth images.The pixel values in RGB channels represent colors and contain some semantic information, but the pixel values in depth images represent the distances between objects and the camera and thus contain more spatial information.As a result, it is laborious for a network to learn features merged simply with approaches such as concatenation under existing pretrained parameters.Nevertheless, we can find that they have high similarities in space but not in semantics.In other words, we can fuse these features in the spatial dimension but not in the channel dimension.For this reason, we extract and separate spatial features in two modules and let them influence each other only in the spatial dimension.At the same time, the RGB module can process semantic information without being affected by channels.
As mentioned above, the feature maps in the two modules match well with each other in the spatial dimension.Therefore, each module can tell its partner which regions should be given more attention in the spatial dimension.Interactions come from two aspects: 1. Due to the lack of semantic information, the depth feature extractor in the DRM may make an incorrect judgment when occlusions or other insignificant objects are present at similar depths.Abundant semantic information in the RGB modality can endow a higher weight in the potential salient region and guide the DRM. 2. On the other hand, it is also difficult to detect complete salient objects, and it is even impossible for the RGB module to find salient objects in some complex scenes.Depth images are immune to terrible environments in many scenes, so they can not only restrain disturbances in the background but also easily find salient objects.Moreover, their clear boundaries are helpful for refining boundary details in RGB predictions.Similar to the RGB To prevent the DRM from interfering with the semantic information in the RGB module, inspired by the convolutional block attention module [48], we separate the spatial information and semantic information through a spatial attention (SA) module and a channel attention (CA) module for further interactive guidance and obtain an ISGM.The two modules integrate spatial and semantic information through global max pooling and global average pooling between the two blocks.For the CA module, squeezing the feature maps in the spatial dimension can aggregate the spatial information and focus on determining the salient regions.After the pooling operation, an H × W × C feature map can be transformed into two 1 × 1 × C channel descriptors.They are further transformed into feature vectors by a shared multilayer perceptron (MLP).Finally, elementwise summation and a sigmoid function are used to merge the feature vectors.
Different from the CA module, the SA module focuses on where the meaningful regions.Given a feature map, max pooling and average pooling are applied along the channels.An H × W × C feature map can be transformed into two H × W × 1 2D feature maps through an experimental 7 × 7 convolution kernel.Then, we concatenate the features in channels and forward them to a convolution layer with a sigmoid function to obtain presaliency distribution maps at multiple scales.These maps are used to strengthen the attention paid to the salient regions in the other module.In short, the CA and SA modules are computed as: where σ denotes the sigmoid function and f 7×7 represents a convolution operation with a 7 × 7 filter.
As mentioned above, the feature maps in the RGB module and DRM have high spatial similarity but low semantic similarity.In addition, the number of channels in the RGB module does not match that in the shallow DRM.Thus, we only conduct SA interaction between the two modules and have them exchange spatial saliency information with each other.They pay more attention to the potential salient regions in the other module and thereby achieve improved saliency.As illustrated in Fig. 2, we obtain CA and SA at the same time from the feature maps behind the blocks in the RGB module and DRM.Then, the CA module performs elementwise multiplication with its own feature maps to redistribute the weights between channels, and the SA module performs the same operation with the other module to transmit spatial saliency information.The overall process can be expressed as follows: where denotes elementwise multiplication, F i (i = depth/rgb) denotes the input feature maps in the DRM or RGB module, F i denotes an intermediate quantity after the CA module, F i denotes the results obtained after information interaction and M k j (k = S/C; j = depth/rgb) denotes the SA or CA in the DRM or RGB module.Residual connection is further conducted as: ( Immediately after, temporal information is extracted from the RGB module's high-level features through a bidirectional ConvGRU-based temporal module, we obtained the features containing temporal information by the following equation: where T (•) indicates processing of temporal module.Finally, after the corresponding operation and decoding of F T in the RGB stream and the depth stream.The predictions of the two decoders are fused in an additive manner as the final prediction of the network, as shown in the following equation: where P rgb and P depth denote the prediction maps of the RGB stream network and the depth stream network, respectively.

Dataset and generated depth maps
The generation of pseudo frames or labels is widely used in data augmentation [7,15] so a similar strategy is conducted for depth images.To expand our training data and reduce the workload of annotation, we generate depth images from existing RGB video SOD datasets.Monodepth2 [12] is an outstanding depth estimation network, but it is still not sufficiently robust.Limited training data cause weak generalization in multiple scenes.Initial depth estimations for images with complex scenes are of low quality, as shown in Fig. 4. We refine these estimations with existing GTs to make them available.Experiments are conducted on two datasets (DAVIS [49] and UVSD [50]) that have complete GTs for each frame.Generally, salient objects always have close depth values overall and differ from the background.
To this end, the pixel values in salient regions (Depth s ) are set as average values in corresponding regions in the initial depth estimation map.Salient regions can be determined by GT.A Gaussian filter is further used (sigma=3 in DAVIS and 1 in UVSD) to smooth the boundaries and a weight β = 1.3 is also set for the pixels to enhance the depth contrast and make the images more realistic.The visualization results are shown in Fig. 4, and the generated depth images are highly similar to the existing depth images in Fig. 1.In summary, this process can be represented as: Although sufficient training data can be generated in this way, real data are also necessary.To ensure an appropriate proportion of real data in the total training data, we only use these two usual datasets and build a new RGBD video dataset (3D-VSOD) using an Eagle sensor from LANXINTECH-NOLOGY.It contains 20 new video clips (approximately 1 K frames) and clips from our previous eloesdataset [38] (approximately 1K frames) which was used for traditional methods.The difference is that all the frames in eloesdataset are now annotated for train ing the network.The amounts of some outdoor scenes and objects are increased to enrich the scenes.Figure 5 shows some samples from 3D-VSOD.The loss of some depth in outdoor scenes is unavoidable if the distance is out of range but we do not need to conduct too much postprocessing.

Datasets and evaluation metrics
In the field of video SOD, many researchers have demonstrated the feasibility of split training [14,25], in which the spatial feature module is first trained on image datasets and then the entire network is trained to learn temporal features to reduce the need for video data.Our network is trained in a similar manner.The spatial feature extractor is initialized by parameters trained on ImageNet and then trained on MSRA-B [51] and HKU-IS [52] to learn spatial saliency together with the decoder.
After that, VOS [53], DAVIS [50], and FBMS [54] are used to learn temporal saliency.Considering different modalities in RGB and depth images, we do not initialize the DRM on existing parameters pretrained on ImageNet as performed by many researchers.The DRM's shallow structure is sufficient to excavate simple depth features and it is easy to train although random initialization is performed by the Xavier [55].DAVIS and UVSD datasets with generated depth frames (approximately 7K frames in total) as generated data, eloesdataset [38] and 3D-VSOD (approximately 2K frames in total) as real data are used as training and validation datasets.DAVIS and UVSD are trained and verified as the original division and eloesdataset and 3D-VSOD are divided by about 1:1.
In the field of significance detection, there are many ways to measure the consistency between the predicted results of the model and the GT map.Since the essence of significance detection is to realize the binary classification of pixels, its evaluation indexes are largely derived from the general indexes of binary classifiers.Therefore, we adopt five most commonly used objective evaluation indexes: the precision recall (PR) curve, receiver operating characteristic (ROC) curve, F-measure, S-measure and mean absolute error (MAE).We also provide the precision, recall, F-measure and area under the curve (AUC) to intuitively represent our model's effect, and verify the effectiveness of the model through these five evaluation indexes in the experimental part.
The PR curve, ROC curve and F-measure are all evaluation indicators based on confusion matrix.The ROC curve trades off the true positive rate (TPR) and false positive rate (FPR) of prediction at different probability thresholds while the AUC is the area under the ROC curve.In an ROC curve, the FPR is set to the x-axis and TPR is set to the y-axis.
From the definitions of the FPR and TPR, a higher FPR and a smaller TPR obviously help to build a more efficient model.Therefore, if the curve is closer to the upper left corner and the AUC value is larger, the model is better.
The PR curve is similar to the ROC curve.Under different thresholds, the PR curve is drawn with the precision of the y-axis and the recall of the x-axis.Both higher precision and higher recall rates lead to a more efficient model.
In SOD, precision is more important than the recall rate, so the F-measure is utilized to give a higher weight to precision: Here, β 2 is set to 0.3 as done by most models.The MAE denotes the difference between the prediction and ground truth for each pixel, and the MAE is defined as: The S-measure is a new measure that was recently proposed in [56].It combines the region-aware (S r ) and object-aware (S o ) structural similarity between a saliency map and ground truth: α is a balance parameter that is set to 0.5.

Implementation and experimental setup
The Adam optimizer [57] uses momentum and adaptive learning rates to speed up convergence and is widely used for model training.As in most experiments, we set the learning rate lr = 1e-3, betas = (0.9, 0.999), eps = 1e-8, and weight decay = 0 in Adam to train our network end to end.In the training process, according to the experimental experience, we select the default parameters of ResNet50 network for network training.All the images are resized to 224 × 224 and three RGB and depth frames together compose a video clip.The binary cross entropy (BCE) loss is used as loss function as in many SOD visual tasks: where S(x) denotes saliency predictions of the models and G(x) denotes the ground truth of the input.The network is realized on PyTorch, an open source and widely used deep learning platform.We train and test our network on a single NVIDIA RTX 3080 GPU (with 10 GB of memory) and an Intel 3.6 GHz Core i9-10859k CPU.The model is trained for approximately 10K iterations on Ubuntu 20.04 and the parameters that provide the best S-measure on the validation datasets are chosen as the final results.

Ablation study
To investigate the effectiveness of the proposed methods, we carry out ablation experiments with the DRM and ISGM on DAVIS, UVSD and 3D-VSOD.

The effectiveness of the DRM
We perform experiments only on the RGB module without the ISGM and addition fusion to demonstrate the effectiveness of depth information and the DRM.As shown in Fig. 7, the RGB module has difficulty achieving boundary detail detection and background constraints, and is even unable to identify salient objects in challenging scenes.Salient objects tend to differ in depth so we can obtain potential saliency regions through depth information.The RGB module pays more attention to these regions and finds foreground salient objects through prior information provided by the DRM.
Furthermore, we show visualizations of the output features obtained from the DRM through the sigmoid activation function.The effect of the DRM on the boundary can be proven theoretically through the visual feature map.As shown in Fig. 6, the DRM produces a strong suppression effect at the boundary of the salient object, which can be represent as f sigmoid boundary ≈ 0. We can further know that the feature values at the boundary f boundary 0 before activation.The pixelwise addition operation with the RGB module's prediction reduces the output value near the boundary, and thus produces an effective refinement.
Through the above certification, we know that pixelwise addition with the RGB module's prediction plays a part in boundary optimization.Figure 6 and Table 2 compare the visualization output and objective evaluation indicators of the RGB module with and without addition fusion with the DRM.We can see that without the participation of the ISGM, the role of the DRM is mainly reflected in boundary optimization.

The effectiveness of the ISGM
As discussed above, depth information is useless for suppressing background interference in the RGB module due to insufficient information exchange.To further demonstrate the effectiveness of the ISGM, we show the visualization provided by the RGB module with and without ISGM interaction in Fig. 6.We do not conduct addition fusion here.The visualization proves that the ISGM has no significant effect on the boundary but is effective in background inhibition and potential saliency guidance.Moreover, the objective indicators demonstrate the effectiveness of the ISGM and support the boundary optimization role of pixelwise addition.
In conclusion, addition fusion mainly optimizes boundary details but is useless in terms of background suppression and saliency guidance.Therefore, we design the ISGM to exchange spatial information between the two modules.The rich spatial information in the DRM helps the RGB module find salient objects and suppress interference in the background.The final results and the objective evaluation indicators in Table 3 demonstrate the effectiveness of our two innovations.

Comparisons with state-of-the-art approaches
We compare our model with 9 state-of-the-art methods including three RGBD image SOD models(A2dele [58], BBS(BBSNet) [59], and PGAR [60]) to contrast the influence of temporal information, three 2D video SOD models (GF [21], FCNS [25], and RCR(RCRNet) [14]) to contrast the influence of depth information and three RGBD video SOD models (Zhang [41], Lino [42], and Eloes [38]) to prove the superiority of our model.Table 3 shows the comparison between our method and other methods in terms of their Fmeasures, MAEs, and S-measures; Fig. 8 shows the intuitive comparisons among the PR curves, ROC curves and the column diagrams of some detail indicators; and Fig. 9 shows a subjective comparison of the visualization results.The comparisons show that the temporal features and depth features can significantly improve the detection effect of our model.In complex scenes, the 3D image algorithms BBSNet and PGAR without temporal feature extractors perform better than the 2D video algorithms without depth feature extractors.This proves that the depth information in complex scenes even exceeds the temporal information in terms of importance.However, for relatively simple scenes, the RGB spatial features are sufficient to obtain salient areas, so the importance of depth information is reduced, and the importance of temporal information is highlighted.Experiments show that both temporal and depth information are vital in various scenes.In addition, the comparisons show that even with rich temporal and spatial information, traditional algorithms perform far weaker than deep learning algorithms based on temporal or depth information alone.However, the RGBD video SOD algorithm is still limited as a traditional algorithm, so it is necessary to perform more research on deep-learning-based RGBD video SOD.
Our proposed method utilizes a deep learning algorithm that integrates the advantages of depth information and temporal information in saliency detection, and obtains results that outperform most advanced algorithms for RGBD images, RGB video, and RGBD video, which shows that the deep learning-based saliency detection algorithm for RGBD video has great potential.However, the RGBD video saliency detection algorithms are still limited to the traditional algorithm domain, it is necessary to extend RGBD video saliency detection to the deep learning domain.

Good cases and failure cases
In Fig. 10, we show some good and bad cases for the model.I In the first case, our DIR-Net was successful in separating objects of similar color from the scene's background, while in the second case, our model accurately identified small objects.This is because of the efficient merging of the RGB and depth data by our two-stream network.DIR-Net does not perform perfectly in some complex scenarios.In the third In the final case of failure, the depth information containing the spatial position of the object provided the wrong guide, misidentifying the camel behind it as a significant object.In general, DIR-Net uses multi-level multimodal interaction to extract feature information, so complicated scenarios and rough depth information may have an impact on our model's identification.
In conclusion, the proposed deep learning algorithm based on depth information and temporal information outperforms most state-of-the-art methods with respect to 3D images and 2D/3D videos.Notably, we achieve great improvements on DAVIS and UVSD.Both datasets are more challenging than 3D-VSOD, so the introduction of depth information is more valuable.However, limited by the shooting range of an RGBD camera, we cannot obtain scenes as complex as those in DAVIS and UVSD.The scenes in 3D-VSOD are so simple that depth or temporal information alone is sufficient to obtain great predictions.As a result, the improve-Fig.9 Visual comparisons with state-of-the-art approaches, including 3D image (3DI), 2D video (2DV) and 3D video (3DV) SOD methods.All salient objects can be completely highlighted through depth and temporal information ment is not as prominent on 3D-VSOD compared with those achieved on DAVIS and UVSD.Nevertheless, the proposed network is still far better than existing traditional RGBD video SOD approaches on all datasets.Moreover, insufficient real datasets for training result in the network being more inclined to fit the generated data.We believe that with the development of RGBD video SOD, increasingly more real data and more challenging datasets from radar will enhance the practicability of the network.

Conclusion
In this paper, we first propose a deep learning method for RGBD video SOD.First, we devise a DRM to support the RGB module to explore spatial saliency and optimize boundaries through depth information.In particular, in view of the semantic feature conflicts among multiple modalities, the proposed method extracts features with different modalities, and they guide each other through their respective spatial distributions via an ISGM.Experiments demonstrate the effectiveness of depth images in terms of boundary details, object detection and interference suppression.In addition, we establish an RGBD video SOD dataset and propose a postprocessing method to refine generated depth images to obtain sufficient training and validation data.This approach can tackle major difficulties in deep learning domains.The experimental results show that the proposed approach achieves state-of-the-art performance and the developed innovation is effective.Critically, our proposed deep learning algorithms are trained and tested on graphics GPUs, which depends heavily on the computing power of the device.However, lightweight models are often needed in real-world applications.In the future, we will focus on finding better model compression and faster methods to improve operational efficiency.

Fig. 2
Fig.2The architecture of our proposed RGBD video salient object detection network (DIR-Net).The DRM extracts depth features, which are further used to guide potential salient regions through the ISGM and refine the boundaries in the final predictions.The ISGM excavates spa-

Fig. 3
Fig.3The overall architecture of our network consists of three main parts: two VGG19 encoding branches; a decoding branch; the designed cross-attention fusion module CAF and the proposed cross-modal feature aggregation module CMFA

Fig. 4 Fig. 5
Fig. 4 RGB images from DAVIS and UVSD, initially generated depth images from Monodepth2 and refined depth maps are located from right to left in a row

Fig. 6
Fig. 6 Results of our ablation study: RGB images, depth images, GTs, final predictions, predictions from the RGB module without the DRM, predictions without the ISGM, and predictions with the ISGM but without addition fusion are shown from right to left in turn

Fig. 7
Fig. 7 Visualizations from the DRM; RGB images, depth images and visual results obtained through the sigmoid function are shown from right to left in turn

Fig. 8
Fig. 8 The PR curve, ROC curve, detail evaluation index column diagram (accuracy, recall, F-measure, and AUC) are shown from top to bottom, and the results obtain on the DAVIS, UVSD, and 3D-VSOD datasets are shown from left to right

Fig. 10
Fig. 10 Some good and failure examples of the proposed model (the 1st and 2nd lines are good cases, the 3rd and 4th lines are bad cases).From left to right, RGB, Depth, GT and our results, the blue box represents a missed detection and the red box represents a false detection

Table 1
'T' refers to traditional methods and 'D' refers to deep learning methods

Table 2
Objective evaluation indexes of the ablation experiment