Global video object segmentation with spatial constraint module

We present a lightweight and efficient semi-supervised video object segmentation network based on the space-time memory framework. To some extent, our method solves the two difficulties encountered in traditional video object segmentation: one is that the single frame calculation time is too long, and the other is that the current frame’s segmentation should use more information from past frames. The algorithm uses a global context (GC) module to achieve high-performance, real-time segmentation. The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time. Moreover, the prediction mask of the previous frame is helpful for the segmentation of the current frame, so we input it into a spatial constraint module (SCM), which constrains the areas of segments in the current frame. The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources. We added a refinement module to the decoder to improve boundary segmentation. Our model achieves state-of-the-art results on various datasets, scoring 80.1% on YouTube-VOS 2018 and a J&F\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\cal J}{\rm{\& }}{\cal F}$$\end{document} score of 78.0% on DAVIS 2017, while taking 0.05 s per frame on the DAVIS 2016 validation dataset.


Introduction
Video object segmentation, which aims to draw a detailed object mask on video frames, is widely applicable to various fields such as autopilots, video editing, and video synthesis. It originates from video object tracking [1]. Approaches can be divided into unsupervised methods [2,3] that input only the video, and semi-supervised methods [4][5][6][7] that require a user to provide initial labels. In our work, we consider the second approach. The reason for doing so is that defining what constitutes an "interesting object" is often application-specific, and the same video could have multiple valid solutions. Thus, cues regarding which objects are of interest can be concretely indicated by labels specifying this on a few key frames.
Existing deep learning-based algorithms for semisupervised video object segmentation can be classified as propagation-based methods, matching-based methods, hybrid methods, and space-time memory based methods. Propagation-based methods [4,[8][9][10][11] utilize the target's temporal coherence, and rely on the mask from previous frames. For example, MaskTrack [11] combines the segmentation mask of the previous frame with the current frame to form the mask of the current frame. However, these methods suffer from occlusion problems and error drift. Matching-based methods [5,[12][13][14] uses the first frame of a given video as a reference frame and detect the segmented object independently in each frame. These methods are more robust and reduce the impact of occlusion, but do not take full advantage of spatiotemporal information. Hybrid methods [6,[15][16][17] integrate the above two methods, employing the previous frame and the first frame to segment the current frame, to integrate the advantages of the two types of method. Accordingly, the performance and accuracy of some hybrid algorithms are improved on the former two classes.
Since hybrid methods can significantly improve video target segmentation, it is natural to ask whether we can use more frames to learn richer contextual information. A recent paper uses this idea in the design of a new Space-Time Memory (STM) network [7,18,19]. In order to use information from more frames, STM stores key-value pairs extracted from past frames into a memory pool and then matches information extracted from the current frame with the information in the memory pool, at the pixel level. This algorithm has better robustness and good segmentation performance, even in the case of occlusion and appearance variation.
Although STM-based methods achieve state-ofthe-art precision, they suffer from excessive memory consumption, especially on long videos. When the STM module learns new information from a new frame, the module adds it to the memory. Over time, more and more memory is used, and may even result in memory exhaustion. To solve this problem, the author reduces the number of frames read and updates the memory every five frames. However, linearly increasing memory is still used over time, and the solution does not make the best use of the information in each frame.
In our work, inspired by Ref. [7], we employ a global context module (see Fig. 1) that retrieves the segmentation information in a more efficient way. As the learned video frames advance, the module automatically updates the information. Unlike the linear memory growth of Ref. [18], the size of the global context module is fixed and does not increase over time.
There is no chance of memory exhaustion, and we can learn variations in the object through time.
When similar objects enter the field of view, the model sometimes makes incorrect predictions. Furthermore, the model performs poorly when the shape of the object changes dramatically. For these problems, we employ a spatial constraint module, inspired by Ref. [20] (see Fig. 1). It uses a mask Our solution has three key steps: (i) Context extraction: extract each frame's information into a fixed-size updater (see Eqs. (2) and (3)). (ii) Context distribution: match the current frame's semantic information with that in the updater at pixel level (see Eq. (4)). (iii) Spatial constraint enforcement: the mask of the previous frame is input into the spatial constraint module (see Eq. (5)). from the previous image as a rough constraint to guide the model in removing confusing instances of similar appearance. In addition, we use Atrous Spatial Pyramid Pooling (ASPP) [21] modules to handle scale changes in the video. Finally, we use a refinement module [22][23][24][25] after the decoder to further improve the segmentation results near the target boundary.

Detection-based methods
Detection-based methods rely on fine-tuning using the first-frame ground truth. They assume that a powerful frame-level target detector can be constructed, which can segment video frame by frame. OSVOS [12] is a representative algorithm, which uses a pre-trained convolution network for foregroundbackground segmentation, and first-frame ground truth for fine-tuning. OnAVOS [13] and OSVOS-S [5] introduce an online adaptation mechanism based on OSVOS. PML [26] proposes an embedding network with triplet loss and a nearest neighbour classifier. Most detection-based methods require online training, so the fine-tuning time will greatly affect the performance of the model, making it incapable of providing real-time results. A model based on fine-tuning from the first frame will be more robust to occlusion and other problems. However, due to the loss of available temporal information, such methods can fail if the shape of the target changes so drastically that the detector cannot recognize the target.

Propagation-based methods
Propagation-based methods rely on the temporal coherence of the video, since most videos are smoothly varying. Thus we only need an adjustment to the mask of the previous frame to get the mask for the current frame. MaskTrack [11] is a typical propagation-based approach that inputs the mask of the previous frame and the current frame into the model, and outputs the mask of the current frame. Lucid [27] extends this method by introducing an elaborate data augmentation mechanism. Jointtask [2] and learning-correspondence [28] approaches first learn a visual representation, and then uses KNN [29] to train the model to learn a feature mapping representation to perform cycle consistency tracking. The advantage of this approach is that it can overcome rapid large changes in appearance, but it cannot overcome occlusion, drift, and other problems. [18] is a semi-supervised video object segmentation method. Although traditional propagationbased methods and matching-based methods achieve good results, they still do not use as many frames as possible in the video sequence, so much semantic information is lost. Inspired by the non-local method of Ref. [30], STM uses a novel attention module, which allows multiple frames from the video to pass through the encoder module, stores the information in the memory module, and then matches the information in the current frame with the information in the memory module, at the pixel level, to determine whether each pixel belongs to the foreground object. It does not need to limit the number of frames.

STM
As STM does not rely on the assumption of video smoothness when learning spatial semantic information between distant pixels, it is possible to train the network first with static pictures with masks. Previous work [7,19] has also used this strategy to generate three-frame composite video clips by applying random affine transformations to static pictures with different parameters. We use image datasets annotated with object masks to train our network, and by doing so, we can produce a model that is robust to a variety of object appearance and category transformations.
STM also has various drawbacks, such as incorrect matching to similar-looking objects, imprecise edge processing for target objects, and poor segmentation quality when the object appearance changes too much. There are thus many improvement schemes for STM. For example, to alleviate mismatching, KMN [19] improves STM's memory reading module by using a 2D Gaussian kernel. AFB-URR [31] reduces memory consumption. STCN [32] and LCM [33] target improved segmentation accuracy. RMNet [34] uses optical flow [35,36] to constrain the extent of the segmentation.
Due to the limited memory capacity, adding information to the memory module continually will lead to memory exhaustion. STM addresses this by saving image information every five frames, but this violates the original intent of matching all frames before the current frame, one by one. To alleviate the increasing memory consumption during STM usage, we employ a global context module. Every time we read a new number of frames, the global context module automatically updates its content without increasing resource consumption.
To sum up, the advantage of STM-based methods is that their network model elegantly uses as many frames as possible, and learns more context information than traditional methods, thus accurately predicting the mask of the current frame.

Approach
In this section, we introduce a new efficient video object segmentation (VOS) framework based on STM methods. We first overview our framework in Section 3.1. In Section 3.2, we describe the principle of operation of the global context module, and then we introduce the spatial constraint module in Section 3.3, We finally describe the boundary-aware refinement module in Section 3.4.  are used to generate features at H × W resolution with C channels. The GC module has two functions: context extraction and updating, and context distribution. First, we use the memory encoder to extract semantic information from previous frames and their masks, and put it into a fixed-size updater. Next, we use the query encoder to encode the current frame to get a local feature embedding. We match the local features of the current frame with those in the updater at pixel level, and then use an atrous spatial pyramid pooling module to get richer semantic information. The feature map is then spatially constrained to the target object through the spatial constraint module (SCM) to reduce errors due to similar objects. Finally, our prediction map is obtained through the decoder via a boundary-aware refinement module (BAM).

STM versus GCM
Many recent VOS methods use attention mechanisms, with encouraging results. As a formulation, we may define query embedding of the current frame as Q r ∈ R HW ×C , key embedding of the memory frames as K y ∈ R T HW ×C , and value embedding of the memory frames as V l ∈ R T HW ×C , where H, W , C, T denote height, width, number of channels, and temporal extent, respectively. Space-time memory propagation is formulated as where a distribution map is computed by the correlation function CorF. After multiplying Q r and K tr y , softmax is applied to the resulting feature map, converting its values to the range [0, 1], and then the value embedding V l is propagated into each location of the current frame.
In the STM, the key-value pair vectors for each frame are stored in the memory module. As time advances, the number of video frames increases, and these vectors are concatenated, so K y and V l become larger and larger: computing STM requires more effort with greater video resolution or video duration.
In order to overcome the problem of excessive consumption of system resources, we employ the global context module, which works differently from the STM. The global context module automatically updates the information, without increasing its size, while having almost the same representation ability as STM.

Context extraction and update
The global context module evolved from STM module, so their architectures are very similar. As  shows, we have two different encoders. One memory encoder encodes previous frames and their masks to generate the keys and values, of size H × W × C N and H × W × C M respectively, where C N and C M are the numbers of channels used. The other query encoder encodes the current frame, also generating queries and values.
STM keeps concatenating keys and values, so its memory pool gets ever bigger. The innovation of the GCM over STM is to combine keys and values generated by past frames into a fixed-size updater, and to update data automatically as new frames arrive. We call the step doing this the global summary step.
In this step, the STM method treats the key-value pair vectors generated by the encoder as H × W locations, where each location is a vector of C N (C M ) dimensions, while the GCM treats the key-value pairs as C N (C M ) one-channel feature maps and then considers them as several weight matrices related to the key-value pair vectors. The GCM first computes the context matrix of the current frame from the keyvalue pair vectors generated by the context extraction process, using where E t denotes the output of the encoder at time t, F t is the feature matrix of this frame, and K y , V l are functions that generate keys and values respectively. We then include the resulting information in F t in the global context matrix. Since the matrix has fixed size, we do so without using additional resources. The update for the global context module is performed as Eq. (3): where U t denotes the global context module. The weights ensure each F p for 1 p t to contribute equally to U t .

Context distribution
We match the query and value information extracted from the current frame to the information stored in the global context module at the pixel level, which we call context distribution. In this process, we multiply the query with size H ×W ×C N by GT which has size C M × C N to get a matrix of size H × W × C M , and then concatenate the matrix with the value produced by the current frame to get the output of GC module. This may be written: where I t represents the distributed global features for frame t, and Q r is the function generating the queries.
The global context module summarizes the areas of semantic interest in the query position of the current frame for context features in past frames. The STM does this by first identifying such areas by query-key matching, then summarizing their values by weighted sum. The GCM achieves the same goal more effectively as the global context vector is already a global summary of all previously semantically similar regions in the framework. Query location only needs to determine the appropriate weight of the global context vector to generate a vector that summarizes all regions of interest.

Spatial constraint module
We employ a spatial constraint module (SCM, see Fig. 4) to ensure spatial consistency between adjacent frames, and reduce error due to similarity of of appearance, avoiding false predictions caused by similar instances of the same category. The prediction mask of the previous frame is a 0-1 mask of shape H × W , which is cascaded with the current frame embedding (H × W × C) to obtain a feature map of shape H × W × (C + 1). A convolution layer with a 3 × 3 kernel and a sigmoid function are used to generate a spatial prior, which is a gate map of shape H × W . The prior is multiplied by the current frame embedding. The SCM can be expressed as where E T represents the encoder feature map of T frame, P T −1 represents the predicted mask of the previous frame, f n denotes the convolution function, and ⊕ and ⊗ represent concatenation and elementwise product, respectively. Example attention maps generated by the SCM are shown in Fig. 5.

Architecture
The spatial constraint module greatly reduces problems due to occlusion, but the target object may also change as the video progresses. SCM is not good enough alone to ensure high segmentation accuracy, so we use several methods to improve the segmentation accuracy our architecture. After the context distribution operation, we employ an atrous spatial pyramid pooling (ASPP) module, to obtain semantic information at different scales.
To improve the segmentation boundary, we apply a refinement module before soft aggregation. Refinement modules are usually designed as encoderdecoder modules, as shown in Fig. 6, with residual connections to avoid loss of precision while learning deeper information about the frame.
S refined = S coarse + S residual (6) We employ a novel residual refinement module (RRM) to refine both region and boundary drawbacks in coarse maps. As Fig. 6 shows, all of our convolution  cores are of size 3 × 3. A batch normalization, a ReLu activation function, and a maxpool function are used after each convolution during the encoding phase. In the decoding phase, we use bilinear interpolation up-sampling; after each up-sampling is completed, we use 3 × 3 convolution and skip the convolution of the encoder and decoder. Similarly, a batch normalization and a ReLu activation function are used after the convolution operation. The loss function of the RRM is hybrid loss, which will be described later. The output of this RRM module is used as input to soft aggregation [16], which merges the multi-object prediction; the loss function for soft aggregation is cross entropy loss.

Hybrid loss
Accuracy of boundaries is one of the difficulties in image segmentation. To solve this issue, we employ the concept of hybrid loss. We combine three losses corresponding to three levels: where l (k) is the loss of the k-th side output, where r and c represent pixel coordinates, G is the ground-truth mask, and S is the predicted value of the object. SSIM is the structural similarity index. It is designed to assess picture quality, capture structure information, and learn structure relationships between a target and ground truth. SSIM loss acts on a patch-level, and the key is that it considers boundaries. SSIM loss is defined as where x, y sets represent areas of size N ×N extracted from the predicted probability map S and ground truth. µ x , µ y , σ 2 x , and σ 2 y are the mean and variance of x and y, respectively. σ xy is the covariance of x and y.
The third loss is the IoU loss, which is acts at a map level: where r and c represent pixel coordinates, G is the ground-truth mask, and S is the predicted value of the object.

Experiments
This section describes implementation details of our framework and experiments carried out the on the DAVIS 2016 [37], DAVIS 2017 [38], and YouTube-VOS 2018 [39] datasets. Evaluation metrics used for object segmentation are the average region similarity (J mean), the average contour accuracy (F mean), and the average of the two (J &F mean). As Fig. 7 shows, our network model achieves a very good balance between speed and accuracy relative to other methods.

Datasets
The DAVIS 2016 & 2017 [37,38] datasets are intended to benchmark pixel-perfect labelling. Their goal is to provide realistic video scenes including camera jitter, background clutter, occlusion, and other complications. DAVIS 2016 [37] is a singletarget dataset containing 50 video sequences, 30 of which are for training and 20 for validation. DAVIS 2017 [38] is a multi-target dataset. Each frame contains several different annotated targets. It includes 150 video sequences, 376 target instances, and 10,459 frames.
The YouTube-VOS 2018 dataset [39] is by far the largest video object segmentation dataset, comprising 4453 YouTube video clips and 94 target instances, which allows comprehensive evaluation and comparison of video object segmentation methods.

Implementation details
Our model is first pre-trained on the video clips simulated using an image dataset, and then trained on the video dataset.

Pre-training on image datasets
Training with a static image database compensates for the lack of frames in the video database, and avoids over-fitting caused by a lack of training data. This method assumes no temporal relationship between images, and uses static picture datasets to train the video object segmentation models. Previous work used static images to train their networks, and we took a similar approach. The specific implementation applies random affine transformations [11] to various images. A video sequence composed of three frames is generated and used to train our network, making our network more robust and easier to adapt to different segmentation targets. We pre-trained our model on the CoCo dataset [46].

Main training on video datasets
We used real video data for the main training stage, using DAVIS 2016 [37], DAVIS 2017 [38], and YouTube-VOS 2018 [39] datasets according to different training objectives. We randomly used three frames in the correct temporal order from the same video sequence as training samples. In order to learn appearance changes in objects over a long period, we randomly skipped frames during the sampling process. As training progressed, the number of frames skipped increased from 0 to 25.

Other training details
We randomly clipped input frames to a size of 384×384. We used the Adam [47] optimizer with a fixed learning rate of 10 −5 . We froze the batch normalization layer during training. The mini-batch size was 4. Both preand main-training used random affine transformations, but the main training process was less random. The sampling intervals increased by 5 after every 20 epochs, both for Davis and YouTube-VOS.

Ablation study
We performed ablation experiments using the DAVIS 2017 dataset to see how each module of our network contributes to the final results.

Pre-training and main training
An interesting result from our experiments is that when we only do pre-training, the video segmentation capability of the model is better than when the model only undergoes main training, which indicates that the size of the training set has a significant influence on the resulting network. When omitting pre-training, the overall accuracy on the YouTube dataset for the main training-only model decreased by 15% (see Table 1): our model is severely over-fitting. These experiments show that the rich static image resources used in pre-training can help enhance our network's robustness, so we use both pre-and main-training strategies for the model to achieve the best results.

Global context module
The GCM uses a fixed-size updater so that as the number of video frames increases, the model memory usage does not: the network can learn information from each frame. Results of a comparison to STM's update module using the DAVIS 2017 dataset are shown in Table 2 with STM using the same scheme of reading all frames as GC. It can be seen that GCM's speed of processing video is significantly better, while accuracy is not greatly affected.The J mean and J mean obtained by STM are 0.3% and 2.3% higher than by GCM, respectively. The improvement is minimal, but GC runs three times faster than STM. Table 3 shows the memory consumption of the two  methods. As t increases, STM's resource consumption increases linearly, while GCM's resource consumption remains at a very low level.

Spatial constraint module
The spatial constraint module is used to reduce mismatching of target objects with similar appearance. A comparison was performed with and without the module using the DAVIS 2017 dataset. It shows that the module can significantly prevent mismatching yet has little effect on computational efficiency, as shown in Table 4. In a multi-object video set, the target is more susceptible to interference from similar objects, and the improvement provided by the SCM becomes very obvious: when SCM is used, J and F are improved by 5.8% and 3.8%, respectively, while SCM does not affect speed. As Fig. 5 shows, the SCM uses a mask from the previous frame to focus the current frame on the target object, greatly reducing mismatching.

DAVIS 2016 (single object)
The first comparison used the verification set from the DAVIS 2016 benchmark, with single-object videos.
We directly cite results for other representative works from the DAVIS 2016 benchmark website, including for the recent STM [18] and RANet [15]. Results are given in Table 5. We can see that using the online learning method returns higher scores. Figure 7 draws a scatter diagram for various methods according to speed and accuracy. It can be seen that the accuracy of methods based on online learning is very high, but the online learning process is time-consuming, and the calculation time is prolonged. Offline learning methods have high calculation speed, but lower accuracy. Recent methods such as STM achieve a balance between accuracy and speed, running at 6.7 FPS. Our framework improves upon STM, and its speed reaches 25 FPS. It is noteworthy that the videos in DAVIS are very short, mostly not exceeding 100 frames. As the time taken by STM increases linearly with number of frames, as video length increases, STM will become  slower and slower, while our framework can maintain high computing speed for any video length. In general, our method achieves the highest speed, and its J mean score is also among the best. As Fig. 9, columns 1, 3 show, even when the target object undergoes severe deformation, our method can segment the object accurately and is unaffected by occlusion.

DAVIS 2017 (multiple object)
DAVIS17 is a multi-object segmentation database, in which many objects interfere and obscure each other. Multi-object scenarios are more challenging than single-target scenarios. In Table 4, we compare our framework with several existing mainstream frameworks and see that online learning-based methods perform equally well in multi-target scenarios. However, the computation time for online learning methods is prolonged. For offline learning methods, our framework is more accurate and faster than STM.
The spatial constraint module gives our network model a distinct advantage in multi-target classification tasks. In Fig. 9, rows 2, 4, 5, our method correctly identifies different entities.

YouTube-VOS
One of the features of the YouTube-VOS dataset is that there are some unseen targets in the validation set. Table 6 compares different methods using this dataset. STM again achieved high scores in this test. Our framework significantly improves upon STM, achieving high scores on seen and unseen object segmentation. Figure 8 shows visual examples of the segmented results of our framework and other frameworks.  [16], FEELVOS [6], and RaNet [15]. Our spatial constraint module can effectively handle many challenging situations, such as object confusion, size changes, and appearance transformations. Our refinement module can help to segment the edges of the target object. In the first row, RGMP [16], FEELVOS [6], and RaNet [15] all identify two dogs as the same entity, while our method accurately identifies two entities. In the second row, the RaNet method again has a problem of misidentification. In the third row, the RGMP and FEELVOS methods do not recognize the target object. In the last row, all three methods have mismatching problems. However, there is still room for improvement in our framework. As Fig. 10 shows, when an object is severely deformed, it may lead to in inaccurate results (see row 1, columns 4, 5). When the target object does not appear in a long sequence of frames, this may cause segmentation to fail (see row 2, columns 3, 4). Thanks to the robustness of our network, the number of frames in which segmentation fails usually does not exceed two (see row 2, column 5). Mismatches can also occur when several split objects are very close together and there are interactions between them (see row 3, columns 2, 4). Since the spatial constraint module uses the mask from the previous frame's segmentation result, our network may also treat an occlusion as a object target if there is occlusion in the current frame (see row 4, columns 3, 4, 5).

Qualitative results
To sum up, though some imperfect segmentation results under extreme conditions still exist, generally our framework provides very good segmentation results even with target occlusion, target confusion, complex object appearance; our network also achieves a very good balance between accuracy and speed.

Conclusions
We have designed a new video object segmentation framework. Fast video frame information acquisition and updating are achieved through the GCM based on the STM approach; it captures object segmentation information in processed frames through a fixed-size updater. We also use a spatial constraint module, which helps our network to achieve outstanding results in multi-target problems. Finally, we use a refinement module to help our network provide a more refined segmentation boundary for the target object.
As the experiments on benchmark datasets show, our method outperforms STM, in terms of both accuracy and speed. Furthermore, because of the GCM, our network cannot run out of memory over time. Overall, our solution is efficient and compatible, and we hope it will set a strong baseline for other real-time video object segmentation solutions in the future.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.