Full-Duplex Strategy for Video Object Segmentation

Previous video object segmentation approaches mainly focus on using simplex solutions between appearance and motion, limiting feature collaboration efficiency among and across these two cues. In this work, we study a novel and efficient full-duplex strategy network (FSNet) to address this issue, by considering a better mutual restraint scheme between motion and appearance in exploiting the cross-modal features from the fusion and decoding stage. Specifically, we introduce the relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding sub-spaces. To improve the model's robustness and update the inconsistent features from the spatial-temporal embeddings, we adopt the bidirectional purification module (BPM) after the RCAM. Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios (e.g., motion blur, occlusion) and achieves favourable performance against existing cutting-edges both in the video object segmentation and video salient object detection tasks. The project is publicly available at: https://dpfan.net/FSNet.

Fig. 1 Comparison between three strategies for embedding appearance and motion patterns before the fusion and decoding stage.
(a) Direction-independent strategy [44] without information transmission, (b) Simplex strategy [141] with only unidirectional information transmission, e.g., using motion guides appearance or vice versa, and (c) our full-duplex strategy with simultaneously bidirectional information transmission.This paper mainly focuses on discussing directional modelling (b & c) in the deep learning era.

Introduction
Over the past three years, social platforms have accumulated a large number of short videos.Analyzing these videos efficiently and intelligently has become a challenging issue today.Video object segmentation (VOS) [16,41,115,118] is a fundamental technique to address this issue, whose purpose is to delineate pixellevel moving object 1 masks in each frame.Besides video analysis, many other applications have also benefited from VOS, such as robotic manipulation [1], autonomous cars [70], video editing [43], action segmentation [103], optical flow estimation [24], medical diagnosis [45], interactive segmentation [14,19,37,72,131], URVOS [87], and video captioning [77].
Those conflicts are prone to accumulating inaccuracies and the propagation of spatial-temporal embeddings, which cause the shortterm feature drifting problem [129].
As shown in Fig. 1 (a), the direction-independent strategy [21,44,48,97,122] is the earliest solution by encoding the appearance and motion features individually and fuse them directly.However, this intuitive way will implicitly cause feature conflicts since the motion-and appearance-aware features are derived from two distinctive modalities, which is extracted from separate branches.An alternative way is to integrate them in a guided manner.As illustrated in Fig. 1 (b), several recent methods opt for the simplex strategy [39,61,65,74,80,100,141], which is either appearance-based or motion-guided.Although these two strategies have achieved promising results, they both fail to consider the mutual restraint between the appearance and motion features that both guide human visual attention allocation during dynamic observation, according to previous studies in cognitive psychology [50,99,119] and computer vision [44,107].
Intuitively, appearance and motion characteristics should be homogeneous to a certain degree for the same object within a short time.As seen in Fig. 2, the foreground region of appearance and motion intrinsically share the correlative patterns about perceptions, including semantic structure, movement trends.Nevertheless, misguided knowledge in the individual modality, e.g., static shadow under the chassis and small car in the background, will produce inaccuracies during the feature propagation.Thus, it easily stains the result (see blue boxes).
To address these challenges, we introduce a novel modality transmission strategy (full-duplex [5]) between spatial-and temporal-aware, instead of embedding them individually.The proposed strategy is the bidirectional attention scheme across motion and appearance cues, which explicitly incorporates the appearance and motion patterns in a unified framework as depicted in Fig. 1 (c).As seen in Fig. 2, our method visually performs better than the one with a simplex strategy (a & b).Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appearance) and our full-duplex strategy under our framework.In contrast, our FSNet offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue [129].
To fully investigate the simplex and full-duplex strategies of our framework, we present the following contributions: • We propose a unified framework Full-duplex Strategy Network (FSNet) for robust video object segmentation, which makes full use of spatialtemporal representations.• We adopt a bidirectional interaction module, dubbed the relational cross-attention module (RCAM), to extract discriminative features from the appearance and motion branches, which ensures mutual restraint between each other.To improve the model robustness, we introduce a bidirectional purification module (BPM), which is equipped with an interlaced decremental connection to update inconsistent features between the spatial-temporal embeddings automatically.• We demonstrate that our FSNet achieves favourable performance on five mainstream benchmarks, especially for our FSNet (N =4, CRF) outperforms the SOTA U-VOS model (i.e., MAT [141]) on the DAVIS 16 [82] leaderboard by a margin of 2.4% in terms of Mean-F score, with less training data (i.e., Ours-13K vs. MAT-16K).As an extension of our ICCV-2021 version [46], we incorporate more details to provide a better understanding of our novel framework as follow: • To provide our community with a comprehensive study, we have made a lot of efforts to improve the presentations (e.g., Fig. 1, Fig. 2, and Fig. 7) and discussions (see Sec.

Unsupervised VOS
Although there are many works addressing the VOS task in a semi-supervised manner, i.e., by supposing an object mask annotation is given in the first frame, other researchers have attempted to address the more challenging unsupervised VOS (U-VOS) problem.Early U-VOS models resort to low-level handcrafted features for heuristic segmentation inference, such as long sparse point trajectories [10,31,75,90,111], object proposals [58,59,69,83], saliency priors [27,106,108], optical flow [100], or superpixels [32,33,123].These traditional models have limited generalizability and thus low accuracy in highly dynamic and complex scenarios due to their lack of semantic information and high-level content understanding.Recently, RNNbased models [4,93,97,113,126,138] have become popular due to their better capability of capturing longterm dependencies and their use of deep learning.In this case, U-VOS is formulated as a recurrent modelling issue over time, where spatial features are jointly exploited with long-term temporal context.
How to combine motion cues with appearance features is a long-standing problem in this field.To this end, Tokmakov et al. [96] proposed to use the motion patterns required from the video simply.However, their method cannot accurately segment objects between two similar consecutive frames since it relies heavily on the guidance of optical flow.To resolve this, several works [21,92,97] have integrated the spatial and temporal features from the parallel network, which can be viewed as plain feature fusion from the independent spatial and temporal branch with an implicit modelling strategy.Li et al. [62] proposed a multi-stage processing method to tackle U-VOS, which first utilizes a fixed appearance-based network to generate objectness and then feeds this into the motionbased bilateral estimator to segment the objects.

Attention-based VOS
The attention-based VOS task is closely related to U-VOS since it extracts attention-aware object(s) from a video clip.Traditional methods [40,112,125,142] first compute the single-frame saliency based on various hand-crafted static and motion features and then conduct spatial-temporal optimization to preserve coherency across consecutive frames.Recent works [55,73,110] aim to learn a highly semantic representation and usually perform spatial-temporal detection end-toend.Many schemes have been proposed to employ deep networks that consider temporal information, such as ConvLSTM [30,60,93], take optical-flows/adjacentframes as input [61,110], 3D convolutional [55,73], or directly exploit temporally concatenated deep features [56].Besides, long-term influences are often taken into account and combined with deep learning.Li et al. [63] proposed a key-frame strategy to locate representative high-quality video frames of salient objects [7,139] and diffused their saliency to illdetected non-key frames.Chen et al. [15] improved saliency detection by leveraging long-term spatialtemporal information, where high-quality "beyond-thescope frames" are aligned with the current frames.Both types of information are fed to deep neural networks for classification.Besides considering how to better leverage temporal information, other researchers have attempted to address different problems in video salient object detection (V-SOD), such as reducing the data labelling requirements [127], developing semisupervised approaches [94], or investigating relative saliency [116].Fan et al. [30] recently introduced a V-SOD model equipped with a saliency shift-aware ConvLSTM, together with an attention-consistent V-SOD dataset with high-quality annotations.Zhao et al. [137] build a large-scale with scribble annotation for weakly supervised video salient object detection.They propose an appearance-motion fusion module to Fig. 3 The architecture of our FSNet for video object segmentation.The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy.Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features.Finally, we utilize the decoder to generate our prediction.
aggregate the spatial-temporal features attentively.

Overview
Suppose that a video clip contains T consecutive frames {A t } T t=1 .We first utilize optical flow field generator H, i.e., FlowNet 2.0 [42], to generate T − 1 optical flow maps {M t } T −1 t=1 , which are all computed by two adjacent frames (M t = H[A t , A t+1 ]).To ensure the inputs match, we discard the last frame in the pipeline.Thus, the proposed pipeline takes both the appearance image {A t } T −1 t=1 and its paired motion map {M t } T −1 t=1 as the input.First, M t & A t pairs at frame t 2 are fed to two independent ResNet-50 [36] branches (i.e., motion and appearance blocks in Fig. 3).The appearance features {X k } K k=1 and motion features {Y k } K k=1 extracted from K layers are then sent to the Relational Cross-Attention Modules (RCAMs), which allows the network to embed spatialtemporal cross-modal features.Next, we employ the Bidirectional Purification Modules (BPMs) with N cascaded units.BPMs focus on distilling representative carriers from fused features {F n k } N n=1 and motion-based features {G n k } N n=1 .Finally, the predictions (i.e., S t M and S t A ) at frame t are generated from two decoder blocks.
2 Here, we omit the superscript "t" for the convenient expression.

Relational Cross-Attention Module
As discussed in Sec. 1, a single-modality (i.e., motion or appearance) guided stimulation may cause the model to make incorrect decisions.To alleviate this, we design a cross-attention module (RCAM) via the channel-wise attention mechanism, which focuses on distilling out effectively squeezed cues from two modalities and then modulating each other.As shown in Fig. 4 (c), the two inputs of RCAM are appearance features {X k } K k=1 and motion features {Y k } K k=1 , which are obtained from the two different branches of the standard ResNet-50 [36].Specifically, for each k -level, we first perform global average pooling (GAP) to generate channelwise vectors Next, two 1×1 conv, i.e., φ(x; W φ ) and θ(x; W θ ), with learnable parameters W φ and W θ , generate two discriminated global descriptors.The sigmoid function σ[x] = e x /(e x + 1), x ∈ R is then applied to convert the final descriptors into the interval [0, 1], i.e., into the valid attention vector for channel weighting.Then, we perform outer product ⊗ between X k and σ θ(V Y k ; W θ ) to generate a candidate feature Q X k , and vice versa, as follows: Then, we combine element-wise addition operation ⊕, conducted in the corresponding k -th level block B k [x] in the ResNet-50, we finally obtain the fused features Z k that contain comprehensive spatial-temporal correlations: where k ∈ {1 : K} denotes different feature hierarchies in the backbone.Note that Z 0 denotes the zero tensor.
In our implementation, we use the top four feature pyramids, i.e., K = 4, suggested by [117,135].

Bidirectional Purification Module
In addition to the RCAM described above, which integrates common cross-modality features, we further introduce the bidirectional purification module (BPM) to improve the model robustness.
Following the standard in action recognition [89] and saliency detection [120], our bidirectional purification phase comprises N BPMs connected in a cascaded manner.As shown in Fig. 3, we first employ the feature allocator ψ {F,G} (x; W

{F,G} ψ
) to unify the feature representations from the previous stage: where k ∈ {1 : K} and n ∈ {1 : N } denote different feature hierarchies and number of BPM, respectively.To be specific, ψ {F,G} (x; W {F,G} ψ ) is composed of two 3×3 conv, each with 32 filters to reduce the feature channels.Note that the allocator is conducive to reduce the computational burden as well as facilitate various element-wise operations.
Here, we consider a bidirectional attention scheme (see Fig. 5 (c)) that contains two simplex strategies (see Fig. 5 (a & b)) in the BPM.On the one hand, the motion features G n k contain temporal cues and can be used to enrich the fused features F n k by the concatenation operation.
On the other, the distractors in the motion feature G n k can be suppressed by multiplicating the fused features F n k .Besides acquiring robust feature representation, we introduce an efficient cross-modal fusion strategy in this scheme, which broadcasts high-level, semantically strong features to low-level, semantically weak features via interlaced decremental connection (IDC) with a topdown pathway [66].Specifically, as the first part, the spatial-temporal feature combination branch (see Fig. 5 (b)) is formulated as: where P is an up-sampling operation followed by a 1×1 convolutional layer (conv) to reshape the candidate guidance to a consistent size with F n k .Symbols ⊕ and respectively denote element-wise addition and concatenation operations with an IDC strategy 3 , followed by a 1×1 conv with 32 filters.For the other part, we formulate the temporal feature re-calibration branch (see Fig. 5 (a)) as: where denotes element-wise multiplication with an IDC strategy, followed by a 1×1 conv with 32 filters.

Decoder
After feature aggregation and re-calibration with multi-pyramidal interaction, the last BPM unit produces two groups of discriminative features (i.e., ) with a consistent channel number of 32.We integrate pyramid pooling module (PPM) [136] into each skip connection of the U-Net [86] as our decoder, and only adopt the top four layers in our implementation (K = 4).Since the features are fused from high to low level, global information is well retained at different scales of the designed decoder: Here, UP indicates the upsampling operation after the pyramid pooling layer, while is a concatenation operation between two features.Then, a conv C is used for reducing the channels from 64 to 32.Lastly, we use a 1×1 conv with a single filter after the upstream output (i.e., FN 1 & ĜN 1 ), followed by a sigmoid activation function to generate the predictions (i.e., S t A & S t M ) at frame t.

Learning Objective
Given a group of predictions S t ∈ {S t A , S t M } and the corresponding ground-truths G t at frame t, we employ the standard binary cross-entropy loss L bce to measure the dissimilarity between output and target, which computes: where (x, y) indicates a coordinate in the frame.The overall loss function is then formulated as: For final prediction, we use S t A since our experiments show that it performs better when combining appearance and motion cues.

Backbone Details
Without any modification, three standard ResNet-50 [36] (removing the top-three layers: average pooling, fully-connected, and softmax layers) backbones are adopted for the appearance branch, the motion branch and the merging branch.Each ResNet-50 backbone results in K = 4 hierarchies inspired by previous work [117].After removing the top fully connected layers, the feature hierarchies ({X k , Y k , Z k }, k ∈ {2 : 5}) from shallow to deep are extracted from the conv2 3 (k = 2), conv3 4 (k = 3), conv4 6 (k = 4), and conv5 3 (k = 5) layers of the ResNet-50, respectively.Note that we have also tried a two-branches setting, namely removing the merging branch and letting 3).Unfortunately, this leads to a 2.5% performance drop in performance concerning S α on the DAVIS 16 [82] dataset.This is because the third merging branch can sequentially enhance and promote the spatial-temporal features from RCAMs, leading to better segmentation accuracy.

Training Settings
We implement our model in PyTorch [79], accelerated by an NVIDIA RTX TITAN GPU.All the inputs are uniformly resized to 352×352.To enhance the stability and generalizability of our learning algorithm, we employ the multi-scale (i.e., {0.75, 1, 1.25}) training strategy [35] in the training phase.As can be seen from the experimental results in Tab. 5, the variant with N =4 (the number of BPM) achieves the best performance.We utilize the stochastic gradient descent (SGD) algorithm to optimize the entire network, with a momentum of 0.9, a learning rate of 2e −3 , and a weight decay of 5e −4 .The learning rate decreased by 10% per 20 epochs.

Testing Settings and Runtime
Given a frame along with its motion map, we resize them to 352×352 and feed them into the corresponding branch.Similar to [68,113,141], We employ the conditional random field (CRF) [52] post-processing technique.The inference time of our method is 0.08s per frame, regardless of flow generation and CRF postprocessing.

Datasets
We evaluate the proposed model on four widely used VOS datasets.DAVIS 16 [82] is the most popular of these, and consists of 50 (30 training and 20 validation) high-quality and densely annotated video sequences.MCL [49] contains 9 videos and is mainly used as testing data.FBMS [76] includes 59 natural videos, in which 29 sequences are used as the training set and 30 are for testing.SegTrack-V2 [59] is one of the earliest VOS datasets and consists of 13 clips.In addition, DAVSOD 19 [30] was specifically designed for the V-SOD task.It is the most challenging visual attention consistent V-SOD dataset with high-quality annotations and diverse attributes.

Training
Following a similar multi-task training setup as [61], we divide our training procedure into three steps: • We first adopt a well-known static saliency dataset DUTS [104] to train the spatial branch to avoid over-fitting, like in [30,93,110].

Testing
We follow the standard benchmarks [30,82] to test our model on the validation set (20 clips) of DAVIS 16 , the test set of FBMS (30 clips), the test set (Easy35 split) of DAVSOD 19 (35 clips), the whole of MCL (9 clips), and the whole of SegTrack-V2 (13 clips).

Evaluation Metrics
We define a prediction map at frame t as S t A and its corresponding ground-truth mask as G t .The formulations of the metrics are given as follows.

Metrics for U-VOS task
Following [129], we utilize two standard metrics to evaluate the performance of U-VOS models.Note that all prediction maps are ensured to be binary in the U-VOS task.
1. Mean Region Similarity: This metric, also called jaccard similarity coefficient, is defined as the intersection-over-union of the prediction map and the ground-truth mask.The formulation is defined as: where | • | is the number of pixels in the area.In all of our experiments, we also report the mean value of Mean-J , similar to [129].
2. Mean Contour Accuracy: Here, the contour accuracy metric we used is also called the contour F-measure.We compute the contour-based precision and recall between the contour points of c(S t A ) and c(G t ), where c(•) is the extraction of contour points of a mask.The formulation is defined as: where Similar to [129], we also report the mean value of Mean-F in all of our experiments.

Metrics for V-SOD task
Unlike the U-VOS task, the prediction map can be non-binary in the V-SOD benchmarking.More details refer to Sec. 4.3.1.

Mean Absolute Error (MAE):
It is a typical pixel-wise measure, which is defined as: where W and H are the width and height of ground-truth G t , and (x, y) are the coordinates of a pixel in G t .
2. Precision-Recall (PR) Curve: Precision and recall [2,8,22] are defined as: where S t A (T ) is the binary mask obtained by directly thresholding the prediction map S t A with Tab. 1 Video object segmentation (VOS) performance of our FSNet, compared with 14 SOTA unsupervised models and seven semi-supervised models on DAVIS 16 [82] validation set.'w/ Flow': the optical flow algorithm is used.'w/ CRF': conditional random field [52] is used for post-processing.The best scores are marked in bold.the threshold T ∈ [0, 255], and | • | is the total area of the mask inside the map.By varying T , a precision-recall curve can be obtained.
3. Maximum F-measure: This is defined as: where β 2 is set to 0.3 to focus more on the precision value than the recall value, as recommended in [9].We convert the non-binary prediction map into binary masks with threshold values from 0 to 255.In this paper, we report the maximum (i.e., F max β ) of a series of F-measure values calculated from the precision-recall curve by iterating over all the thresholds.

Maximum Enhanced-Alignment Measure:
As a recently proposed metric, E ξ [3] is used to evaluate both the local and global similarity between two binary maps.The formulation is as follows: where φ is the enhanced-alignment matrix.Similar to F max β , we report the maximum E ξ value computed from all the thresholds in all of our comparisons.
5. Structure Measure: Fan et al. [28] proposed a metric to measure the structural similarity between a non-binary saliency map and a groundtruth mask: 18) where α balances the object-aware similarity S o and region-aware similarity S r .We use the default setting (α = 0.5) suggested in [28].

U-VOS and V-SOD tasks 4.3.1 Evaluation on DAVIS 16 dataset
As shown in Tab. 1, we compare our FSNet with 14 SOTA U-VOS models on the DAVIS 16 public leaderboard.We also compare it with seven recent semi-supervised approaches as reference.We use a threshold of 0.5 to generate the final binary maps for a fair comparison, as recommended by [129].Our FSNet outperforms the best model (AAAI'20-MAT [141]) by a margin of 2.4% in Mean-F and 1.0% in Mean-J , achieving the new SOTA performance.Notably, the proposed U-VOS model also outperforms the semi-supervised model (e.g., AGA [47]), even though it utilizes the first ground-truth mask to reference object location.
We also compare FSNet against 13 SOTA V-SOD models.The non-binary saliency maps4 are obtained from the standard benchmark [30].This can be seen from Tab. 2, our method consistently outperforms all other models since 2018 on all metrics.In particular, for the S α and F max β metrics, our method improves the performance by ∼2.0% compared with the best AAAI'20-PCAS [34] model.

Evaluation on MCL dataset
This dataset has fuzzy object boundaries in the low-resolution frames due to fast object movements.Therefore, the overall performance is lower than on DAVIS 16 .As shown in Tab. 2, our method still stands out in these extreme circumstances, with a 3.0∼8.0%increase in all metrics compared with ICCV'19-RCR [127] and CVPR'19-SSAV [30].

Evaluation on FBMS dataset
This is one of the most popular VOS datasets with diverse attributes, such as interacting objects, dynamic backgrounds, and no per-frame annotation.As shown in Tab. 2, our model achieves competitive performance in terms of M. Further, compared to the previous bestperforming SSAV [30], it obtains improvements in other metrics, including S α (0.890 vs. SSAV=0.879)and E max ξ (0.935 vs. SSAV=0.926),making it more suitable to the human visual system (HVS) as mentioned in [28, Tab. 2 Video salient object detection (V-SOD) performance of our FSNet, compared with 13 SOTA models on three popular V-SOD datasets, including DAVIS 16 [82], MCL [49], and FBMS [76].' †' denotes that we generate non-binary saliency maps without CRF [52] for a fair comparison.'N/A' means the results are not available.

Evaluation on DAVSOD 19 dataset
Recently published, DAVSOD 19 [30] is the most challenging visual attention consistent V-SOD dataset with high-quality annotations and diverse attributes.It contains diversified challenging scenarios due to the video sequences containing shifts in attention.DAVSOD 19 is divided into three subsets, according to difficulty: DAVSOD 19 -Easy-35 (35 clips), DAVSOD 19 -Normal25 (25 clips), and DAVSOD 19 -Difficult20 (20 clips).Note that, in the saliency field, non-binary maps are required for evaluation; thus, we only report the results of FSNet without CRF post-processing in benchmarking the V-SOD task.In this document, we adopt the four metrics mentioned in Sec.4.2.2, including S α , E max ξ , F max β , and M. For showing the robustness of FSNet, in Tab. 3, we also make the first effort to benchmark all 11 SOTA models since 2018, in terms of the three difficulty levels: • Easy35 subset: Most of the video sequences are similar to those in the DAVIS 16 dataset, which also consists of a large number of single video objects.We see that FSNet outperforms all the reported algorithms across all metrics.As shown in Tab. 3, compared with the recent method (PCSA), our model achieves large improvements of 3.2% in terms of S α .• Normal25 subset: Different from previous subsets, this one includes multiple moving salient objects.Thus, it is more difficult than traditional V-SOD datasets due to the attention shift phenomena [30].
As expected, FSNet still obtains the best performance, with significant improvement, e.g., 6.4% for F max β metric.
• Difficult20 subset: This is the most challenging subset in existing V-SOD datasets since it contains a large number of attention shift sequences under cluttered scenarios.Therefore, from the results shown in Tab. 3, the performances of all the compared models decrease dramatically (e.g., F max β ≤ 0.5).Even though our framework is not specifically designed for the V-SOD task, we still easily obtain the best performance in two metrics (e.g., S α and F max β ).Different from the best two models, which utilize additional training data (i.e., RCR leverages pseudo-labels, SSAV utilizes the validation set), our model does not use any additional training data and still outperforms the SSAV model by 8.8% (F max β ), and achieves comparable performance to the second-best RCR (ICCV'19) model.These results are also supported by recent conclusions that "human visual attention should be an underlying mechanism that drives U-VOS and V-SOD" (TPAMI'20 [107]).

PR Curve
As shown in Fig. 6, we further investigate the precision-recall curves of different models on six V-SOD datasets, including DAVIS 16 [82], MCL [49], FBMS [76], and DAVSOD 19 [30] (i.e., Easy35, Normal25, and Difficult20).Note that the higher and more to the right in the PR curve, the more accurate performance.Even though existing SOTA methods have achieved significant progress in the V-SOD task on three typical benchmark datasets, we still obtain the best performance under all thresholds.Besides, as a recent and challenging dataset, the overall performances on the three subsets of DAVSOD 19 [30] are relatively poor.However, our FSNet again achieves more satisfactory performance by large margins.

Qualitative Results
Some qualitative results on the five datasets are shown in Fig. 7, validating that our method achieves high-quality U-VOS and V-SOD results.As can be seen in the 1 st row, the behind camel did not move, so it does not get noticed.Interestingly, as our fullduplex strategy model considers both appearance and motion bidirectionally, it can automatically predict the dominated camel in the centre of the video instead of the camel behind.A similar phenomenon is also presented in the 5 th row, our method successfully detects dynamic skiers with the video clip rather than the static man in the background.Overall, for these challenging situations, e.g., dynamic background (1 st
& 5 th rows), fast-motion (4 rd row), out-of-view (6 rd & 7 nd row), occlusion (7 nd row), and deformation (8 th row), our model is able to infer the real target object(s) with fine-grained details.From this point of view, we demonstrate that FSNet is a general framework for both U-VOS and V-SOD tasks.

Ablation Study
In this section, we conduct ablation studies to analyse our FSNet, including stimulus selection (Sec.4.

Stimulus Selection
We explore the influence of different stimuli (appearance only vs. motion only) in our framework.We use only video frames or motion maps (using [42]) to train the ResNet-50 [36] backbone together with the proposed decoder block (see Sec. 3.4).As shown in Tab. 4, M o. performs slightly better than App. in terms of S α on DAVIS 16 , which suggests that the "optical flow" setting can learn more visual cues than "video frames".Nevertheless, App. outperforms M o. in M metric on MCL.This motivates us to explore how to use appearance and motion cues simultaneously effectively.

Effectiveness of RCAM
To validate our RCAM (Rel.)effectiveness, we replace our fusion strategy with the vanilla fusion (Vanilla) using a concatenate operation followed by a convolutional layer to fuse two modalities.As expected (Tab.4), the proposed Rel.performs consistently better than the vanilla fusion strategy on both DAVIS 16 and MCL datasets.We would like to point out that our

Tab. 6
Ablation study for the simplex and full-duplex strategies on DAVIS 16 [82] and MCL [49].We set N = 4 for BPM.• It can alleviate error propagation within a network to an extent due to the mutual correction and bidirectional interaction.

Effectiveness of BPM
To illustrate the effectiveness of the BPM (with N = 4), we derive two different models: Rel. and FSNet, referring to the framework without or with BPM.We observe that the model with BPM gains 2.0∼3.0%than the one without BPM, according to the statistics in Tab. 4. We attribute this improvement to BPM's introduction of an interlaced decremental connection, enabling it to fuse the different signals effectively.Similarly, we remove the RCAM and derive another pair of settings (Vanilla & Bi-Purf.) to test the robustness of our BPM.The results show that even using the bidirectional vanilla fusion strategy (Bi-Purf.)can still enhance the stability and generalization of the model.This benefits from the purification forward process and re-calibration backward process in the whole network.

Number of Cascaded BPMs
Naturally, more cascaded BPMs should lead to better boost performance.This is investigated and the evaluation results are shown in Tab. 5, where N = {0, 2, 4, 6, 8}.Note that N = 0 means that NO BPM is used.Clearly, as can be seen from Fig. 8 (red star), we compare four variants of our FSNet, including N =0 (Mean-J =76.4,Mean-F=76.8),N =2 (Mean-J =80.4,Mean-F=81.4),N =4 (Mean-J =82.1, Mean-F=83.3),and N =4, CRF (Mean-J =83.4,Mean-F=83.1).It demonstrates that more BPMs leads to better results, but the performance reaches saturation after N = 4. Further, too many BPMs (i.e., N > 4) will cause high model-complexity and increase the over-fitting risk.As a trade-off, we use N = 4 throughout our experiments.

Effectiveness of Full-Duplex Strategy
To investigate the effectiveness of the RCAM and BPM modules with the full-duplex strategy, we study two unidirectional (i.e., simplex strategy in Fig. 4 & Fig. 5) variants of our model.In Tab. 6, the symbols ⇒, ⇐, and ⇔ indicate the feature transmission directions in the designed RCAM or BPM.Specifically, App.⇐ M o. indicates that the attention vector in the optical flow branch weights the features in the appearance branch and vice versa.(App.+Mo.) ⇐ M o. indicates that motion cues are used to guide the fused features extracted from both appearance and motion.The comparison results show that our elaborately designed modules (RCAM and BPM) jointly cooperate in a full-duplex fashion and outperform all simplex (unidirectional ) settings.

Prediction Selection
Which is the final prediction, S t A or S t M ?As mentioned in Sec.3.5, we choose S t A as our final segmentation result instead of S t M .The major reasons for doing so can be summarized as follows: • We employ the auxiliary supervision for the motion-based branch to learn more motion patterns inspired by [96].• More informative appearance and motion cues are contained in another branch at the phase of bidirectional purification.As shown in Tab. 7, three experiments are conducted to verify our assumption: (a) choosing S t M as the final result, (b) choosing (S t A + S t M )/2 as the final result, and (c) choosing S t A as the final result (Ours).As can be seen in Tab. 7, all three choices achieve very similar results, while S t A performs slightly better than the other two.Besides, considering the reduction of unnecessary Tab.7 Ablation study (Sec.4.5.1)for the choice of final segmentation result on DAVIS 16 [82] and MCL [49] dataset.
RCAM and BPM) jointly cooperate in a bidirectional manner and outperform all unidirectional settings.Besides, our bidirectional purification scheme (i.e., 'fulldup.' in Tab. 6) also achieves very notable improvement (2.1% and 1.0% gains in S α on DAVIS 16 [82] and MCL [49], respectively) against the "self-purification" variant (i.e., 'self-purf.' in Tab.6), which has a similar complex structure, further validating the benefit of the bidirectional behavior claimed in this study.

Relation Between RCAM and BPM
The two introduced modules, i.e., RCAM and BPM, focus on using appearance and motion features while ensuring the information flow between them.They can work collaboratively under the mutual restraint of our full-duplex strategy, but they cannot be substituted for one another.This is due to the RCAM transmits the features at each level in a point-to-point manner (e.g., X 1 → Y 1 ), and thus, it fits with the progressive feature extraction in the encoder.The BPM, on the other hand, broadcasts high-level features to low-level features via an interlaced decremental connection in a set-to-point manner (e.g., {F n 2 , F n 3 , F n 4 } → G n 2 ), which is more suitable for the multi-level feature interaction in the decoder.

Conclusion
In this paper, we present a simple yet efficient framework, termed full-duplex strategy network (FSNet), that fully leverages the mutual constraints of appearance and motion cues to address the video object segmentation problem.It consists of two core modules: the relational cross-attention module (RCAM) in the encoding stage and the efficient bidirectional purification module (BPM) in the decoding stage.The former one is used to abstract features from a dualmodality, while the latter is utilized to re-calibrate inconsistant features step-by-step.We thoroughly validate functional modules of our architecture via extensive experiments, leading to several interesting findings.Finally, FSNet acts as a unified solution that significantly advances SOTA models for both U-VOS and V-SOD tasks.In the future, we may extend our scheme to learn short-term and long-term information in an efficient Transformer-like framework [114,143] to further boost the accurarcy.
Fig. 2Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appearance) and our full-duplex strategy under our framework.In contrast, our FSNet offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue[129].

4 . 5 )
of our manuscript.• We investigate the self-purification mode of BPM under our FSNet (see Fig. 9 and Sec.4.5.4), the relation between RCAM and BPM (see Sec. 4.5.5), and the training effectiveness with less data (see Sec. 4.5.3

Fig. 4
Fig. 4 Illustration of our Relational Cross-Attention Module (RCAM) with a simplex (a & b) and full-duplex (c) strategy.

Fig. 5
Fig. 5 Illustration of our Bidirectional Purification Module (BPM) with a simplex and full-duplex strategy.

Fig. 6
Fig.6Precision-recall curves of SOTA V-SOD methods and the proposed FSNet across six datasets.Zoom in for details and the best view in color for friendlier observation.
This step lasts for 50 epoches with a batch size of 8 under the same training settings mentioned in Sec.3.6.2.• We then train the temporal branch on the generated optical flow maps.This step lasts for 50 epoches with a batch size of 8 under the same training settings mentioned in Sec.3.6.2.• We finally load the weights pre-trained on two subtasks into the spatial and temporal branches, and thus, the whole network is end-to-end trained on the training set of DAVIS 16 (30 clips) and FBMS (29 clips).The last step takes about 4 hours and converges after 20 epochs with a batch size of 8 under same the training settings mentioned in Sec.3.6.2.