DeepFTSG: Multi-stream Asymmetric USE-Net Trellis Encoders with Shared Decoder Feature Fusion Architecture for Video Motion Segmentation

Rahmon, Gani; Palaniappan, Kannappan; Toubal, Imad Eddine; Bunyak, Filiz; Rao, Raghuveer; Seetharaman, Guna

doi:10.1007/s11263-023-01910-x

DeepFTSG: Multi-stream Asymmetric USE-Net Trellis Encoders with Shared Decoder Feature Fusion Architecture for Video Motion Segmentation

Open access
Published: 17 October 2023

Volume 132, pages 776–804, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

DeepFTSG: Multi-stream Asymmetric USE-Net Trellis Encoders with Shared Decoder Feature Fusion Architecture for Video Motion Segmentation

Download PDF

Gani Rahmon ORCID: orcid.org/0000-0002-4961-8229¹,
Kannappan Palaniappan¹,
Imad Eddine Toubal¹,
Filiz Bunyak¹,
Raghuveer Rao² &
…
Guna Seetharaman³

1259 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Discriminating salient moving objects against complex, cluttered backgrounds, with occlusions and challenging environmental conditions like weather and illumination, is essential for stateful scene perception in autonomous systems. We propose a novel deep architecture, named DeepFTSG, for robust moving object detection that incorporates single and multi-stream multi-channel USE-Net trellis asymmetric encoders extending U-Net with squeeze and excitation (SE) blocks and a single shared decoder network for fusing multiple motion and appearance cues. DeepFTSG is a deep learning based approach that builds upon our previous hand-engineered flux tensor split Gaussian (FTSG) change detection video analysis algorithm which won the CDNet CVPR Change Detection Workshop challenge competition. DeepFTSG generalizes much better than top-performing motion detection deep networks, such as the scene-dependent ensemble-based FgSegNet_v2, while using an order of magnitude fewer weights. Short-term motion and longer-term change cues are estimated using general-purpose unsupervised methods—flux tensor and multi-modal background subtraction, respectively. DeepFTSG was evaluated using the CDnet-2014 change detection challenge dataset, the largest change detection video sequence benchmark with 12.3 billion labeled pixels, and had an overall F-measure of 97%. We also evaluated the cross-dataset generalization capability of DeepFTSG trained solely on CDnet-2014 short video segments and then evaluated on unseen SBI-2015, LASIESTA and LaSOT benchmark videos. On the unseen SBI-2015 dataset, DeepFTSG had an F-measure accuracy of 87%, more than 30% higher compared to the top-performing deep network FgSegNet_v2 and outperforms the recently proposed KimHa method by 17%. On the unseen LASIESTA, DeepFTSG had an F-measure of 88% and outperformed the best recent deep learning method BSUV-Net2.0 by 3%. On the unseen LaSOT with axis-aligned bounding box ground-truth, network segmentation masks were converted to bounding boxes for evaluation, DeepFTSG had an F-Measure of 55%, outperforming KimHa method by 14% and FgSegNet_v2 by almost 1.5%. When a customized single DeepFTSG model is trained in a scene-dependent manner for comparison with state-of-the-art approaches, then DeepFTSG performs significantly better, reaching an F-Measure of 97% on SBI-2015 (+ 10%) and 99% on LASIESTA (+ 11%). The source code, pre-trained weights, and video demo for DeepFTSG are available at https://github.com/CIVA-Lab/DeepFTSG.

SSD: Single Shot MultiBox Detector

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The world is in motion, and stateful dynamic perception that provides reactive information to the perceiver about the world is essential for the interpretation of visual motion across different time scales. The perception of moving objects is a hallmark of visual intelligence since autonomous systems need to interact with the world, not just perceive it. Gibson’s ecological optics conceptualized vision as an active perceptual system in which space and motion perception are inseparable (Gibson, 1950). A stateful moving object detection stream focuses attention on visual perceptual processes such as tracking, recognition, avoidance, comprehension, interaction, behavior, etc. General motion detection and segmentation is a challenging task because of background clutter, distracting surfaces, occlusions, sporadic object motion, and changing environments such as camera motion, degraded imaging optics, weather, haze, fog, dust, smoke, dynamic background, illumination changes, specularities, shadows, repetitive textures or camouflage effects. Many approaches and pipelines have been proposed for moving object detection to tackle the challenges mentioned above (Barnich & Van Droogenbroeck, 2011; Bianco et al., 2017; Shervin et al., 2020). Earlier approaches typically consisted of hand-crafted solutions with limited adaptation to changing scenarios and often relied on a collection of special case procedures to handle challenging conditions and video categories. Recently, deep learning architectures have been developed for supervised learning-based moving object change detection. Transfer learning combined with many state-of-the-art CNN models like VGG-16 and ResNet-18 trained on large benchmark datasets provides suitable feature embeddings to be learned for new visual tasks with only minor modifications and limited training requirements. Autoencoders are a popular deep learning architecture for segmentation tasks. The features extracted in the encoder module, using a series of convolution and pooling layers, are upsampled by the decoder module to recover the original spatial resolution of the input image.

However, many current deep learning networks proposed for moving object detection rely on a single image using spatial-only appearance cues within an encoder-decoder framework and ignore the rich temporal dimension (Lim & Keles, 2018, 2020).

In this paper, we propose a novel hybrid moving object detection system, Deep Flux Tensor with Split Gaussian (DeepFTSG), which integrates a learned neural appearance model with FTSG motion and change cues using a single and multi-encoder with a shared decoder fusion network for robust moving object detection.

The proposed DeepFTSG networks extend our recent single-stream Motion U-Net (Rahmon et al., 2021) hybrid deep architecture for motion segmentation which augments deep appearance with shallow motion and change cues using early fusion. In our earlier work, the unsupervised Flux Tensor with Split Gaussian (FTSG) motion analysis algorithm (Wang et al., 2014a), which detects motion across multiple temporal scales, won the CVPR 2014 Change Detection Workshop challenge with an overall F-measure of 72.83% (Goyette et al., 2012; Wang et al., 2014b).

The proposed DeepFTSG networks with early and middle-fusion architectures consist of single and multi-stream encoder modules extended by squeeze and excitation blocks, followed by a shared decoder module after multiple bottleneck stages associated with each stream which can be viewed as a joint topological fused feature representation prior to the decoding stream. The squeeze and excitation blocks allow the network to perform feature recalibration by emphasizing informative features and suppressing less useful ones. Figure 1 shows sample moving object detection results using the proposed DeepFTSG network. Figures 2 and 5 provide an overview of the proposed architectures and squeeze and excitation blocks that will be described in detail in later sections. Two versions of DeepFTSG were tested—DeepFTSG-1 consists of a single-stream, where appearance-based and spatiotemporal features are fused early before being fed to the network. DeepFTSG-2 consists of two streams, where the first stream receives a three-channel RGB frame as input and extracts appearance-based, spatial-only features; the second stream receives pixel-level motion and change cues for the corresponding video frame and encodes spatiotemporal features. The feature maps generated by these multi-streams are then concatenated and processed through the network’s decoder part resulting in a robust, multi-cue, moving object detection system. Pixel-level flux motion and background subtraction change cues are obtained using unsupervised hand-crafted approaches that do not require any training stage or labeled frames.

Robust multiscale object detection, image segmentation, and tracking tasks require object-level and pixel-level cues. The proposed DeepFTSG integrates pixel-level motion, and change cues efficiently computed using hand-crafted methods, with learned pixel and object-level appearance cues within a deep learning framework. The motion and change cues enable spatiotemporal reasoning, while the learned appearance features and feature fusion incorporate region and object-level information and semantic reasoning, significantly improving performance.

The main contributions of this paper are: (1) a robust moving object detection approach that integrates complementary appearance, motion, and change cues for spatiotemporal reasoning; (2) a novel multi-stream deep autoencoder network for fusing appearance-based and spatiotemporal information; (3) a hybrid, decoupled processing pipeline that takes advantage of hand-crafted pixel-level cues for reduced network complexity and labeled training data; and (4) the generalization capability of the proposed DeepFTSG to unseen videos, scenes and object categories compared to other approaches. The proposed system has been tested and evaluated on the comprehensive Change Detection 2014 Challenge dataset (Wang et al., 2014b).

2 Background and Related Work

Classical moving object detection approaches can be categorized into three broad classes; optical flow methods, temporal differencing, and background subtraction. Comprehensive reviews of these classical moving object detection methods can be found in Radke et al. (2005); Benezeth et al. (2008); Brutzer et al. (2011). Optical flow methods can be used with non-stationary cameras. However, reliable motion field computation under real-world conditions is challenging and computationally expensive, and these methods cannot deal with stopped objects. Temporal differencing-based methods are simple, fast, and can quickly adapt to different changes and thus are suitable for dynamic backgrounds, illumination changes, uncovered backgrounds by removed objects, etc. However, without an explicit background model, they cannot detect slow-moving or stopped objects, often resulting in foreground aperture problems and failing to detect parts of objects (particularly large objects with homogeneous interiors resulting in holes). Background subtraction-based methods that rely on change from an explicit background model are among the most popular moving object detection methods since they can handle slow-moving or stopped objects and do not suffer from foreground aperture problems. Sparse recovery methods for background subtraction are widely studied in the literature (Candes et al., 2011; Zhou et al., 2012; Liu et al., 2017; Xin et al., 2015; Liu et al., 2015). These methods identify moving objects by extracting sparse components from surveillance video frames, while low-rank components represent a background of stationary objects. However, background subtraction methods are sensitive to dynamic scene changes due to illumination changes, revealed background from moving objects, etc. Methods combining these approaches, such as Wang et al. (2014a), have produced better results.

The development of real-world computer vision systems has been revolutionized with the adoption of deep neural learning methods. Recent approaches for moving object detection explore deep learning architectures including convolutional neural networks (CNNs), generative adversarial networks (GANs), autoencoders (AE), recurrent neural networks (RNNs), multibranch networks trained with labeled data. DeepBS (Babaee et al., 2018) proposed a convolutional neural network trained using a combination of input frames and associated background images using the patch-based technique. The network is trained with randomly selected video frames (5% of the CDnet-2014 dataset) and associated ground truth masks. BSUV-net 2.0 (Tezcan et al., 2021) uses a fully convolutional neural network for background subtraction of unseen videos. The network input consists of a current frame and two background frames taken at different time points, along with their semantic segmentation results. A pre-trained DeepLabv3 is used to extract semantic segmentation results. BSGAN (Wenbo et al., 2020) uses median filtering for background estimation and then trains a Bayesian GAN to classify each pixel, to handle slow and sudden illumination changes, non-stationary backgrounds, and ghosting. Deep CNNs are adopted to construct the generator and the discriminator of Bayesian GAN. A 3D convolutional neural network with long short-term memory (LSTM) was proposed by Akilan et al. (2020) to incorporate temporal information in a deep learning framework for background subtraction. 3D convolutions manage the time-dependent video cues to capture the short temporal motions, and LSTM modules handle the long-short temporal motions during the down-sampling and up-sampling stages. Cascade CNN (Wang et al., 2017) is based on multi-resolution CNNs with a cascaded architecture. The network is trained with hand-picked frames that are made publicly available by the authors. FgSegNet (Lim & Keles, 2018) uses two encoder-decoder networks that produce multi-scale feature encodings. In the first model, three scales of inputs are given to an encoder. In the second model, a feature pooling module is included to extract multi-scale features. Both models use transposed CNNs on the decoder side. For training, 50 to 200 informative frames were manually selected with ground truth masks from the CDnet-2014 dataset. FgSegNet approach uses multiple networks that are optimized per video-sequence. FC-Siam (Caye Daudt et al., 2018) uses an encoder-decoder network with single and multiple streams to detect the change between two data images from large-scale Earth observation systems such as Copernicus or Landsat. Since two streams carry similar information, the authors used shared weights between them in the encoder part of the proposed fully connected siamese network.

Because many moving object detection benchmarks were established before the recent popularity of deep learning methods, no specific training and testing dataset partitions have been established in the benchmarks. That leads to different training and testing video frame partitioning schemes in various papers, making a fair comparison difficult. Consequently, as pointed out by Tezcan et al. (2021), most of the top-performing deep-moving object detection systems have been video frame- or video group-optimized and have never been tested on unseen videos, making it hard to judge their generalization capabilities. We address this limitation by using CDnet-2014 for training and validation, and SBI-15 (Maddalena & Petrosino, 2015) and LASIESTA (Carlos et al., 2016) as unseen testing videos.

Change detection can help us track and study the movement and behavior of arbitrary objects in a video sequence (Theau, 2008). Accurate video segmentation is, therefore, a crucial step in change detection. Moreover, video segmentation is an initial step of video object tracking. Hence, object-tracking datasets can also be used to evaluate the generalization capabilities of moving object detection methods. Many publicly available object tracking datasets, either single or multiple object tracking, could be used to address the generalization capability of the moving object detection methods. However, the evaluation result won’t be that accurate since moving object detection detects moving objects in the scene, and there might be more than one object moving in that scene, but in the case of single object tracking, only one object (object of interest) in the scene would have a ground truth and the other object even if they are moving would be ignored. Therefore, we used some video sequences of LaSOT (Fan et al., 2019) single object tracking dataset as unseen test videos to evaluate the generalization capability of the proposed methods.

3 Change Detection Deep Learning Networks

We have designed a novel hybrid system to robustly detect moving foreground objects. The proposed system combines unsupervised computer vision methods for motion and change detection with deep learning-based semantic segmentation and fusion frameworks. This approach reduces architecture complexity and the need for extensive labeled training datasets by taking advantage of available hand-crafted solutions that produce fast, reliable results. We built two deep networks to better analyze the contribution of motion and change cues to overall system performance. The first network, DeepFTSG-1 in Fig. 2, consists of a U-Net-like semantic segmentation architecture extended by squeeze and excitation blocks with single input streams, where the appearance-based and spatiotemporal information are fused early before being fed to the network. Our second network, DeepFTSG-2 in Fig. 5, extends DeepFTSG-1 by decoupling appearance-based information from spatiotemporal using multiple input streams. The first stream has appearance information, and the second stream incorporates spatiotemporal reasoning through motion and change cues. Finally, the two streams are combined after the joint topological representation through middle fusion.

3.1 DeepFTSG-1: Single-Stream Early Fusion for Spatiotemporal Change Detection

Table 1 Detailed configuration and specifications of the proposed DeepFTSG-1

DeepFTSG: Multi-stream Asymmetric USE-Net Trellis Encoders with Shared Decoder Feature Fusion Architecture for Video Motion Segmentation

Abstract

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Convolutional neural network: a review of models, methodologies and applications to object detection

1 Introduction

2 Background and Related Work

3 Change Detection Deep Learning Networks

3.1 DeepFTSG-1: Single-Stream Early Fusion for Spatiotemporal Change Detection

3.1.1 Multi-modal Background Subtraction for Change Estimation

3.1.2 Tensor-Based Motion Estimation

3.2 DeepFTSG-2: Multi-Stream Middle Spatiotemporal Fusion

4 Experimental Results

4.1 Benchmark Evaluation Datasets

4.2 DeepFTSG Training Details

4.3 Evaluation Metrics

4.4 Experiments on CDnet-2014 Benchmark Videos

4.5 Evaluation of Generalization Power

4.6 Scene Dependent Assessment

4.7 Ablation Study

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A Ablation Experiments

Appendix A Ablation Experiments

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation