1 Introduction

In the visual world, objects rarely occur in isolation. The psychophysical and computational studies (Hegdé et al., 2008; Nakayama et al., 1989) have demonstrated that human vision systems can perceive heavily occluded objects with contextual reasoning and association. The question then becomes, can our video understanding system perceive objects that are severely obscured?

Fig. 1
figure 1

Sample video clips from OVIS. Click them to watch the animations (best viewed with Acrobat/Foxit Reader). The hairs and whiskers of animals are all exhaustively annotated

Fig. 2
figure 2

Different occlusions levels in OVIS. Unoccluded objects are colored green, slightly occluded objects are colored yellow, and severely occluded objects are colored red

Our work aims to explore this matter in the context of video instance segmentation, a popular task proposed in Yang et al. (2019) that targets a comprehensive understanding of objects in videos. To this end, we explore a new and challenging scenario called Occluded Video Instance Segmentation (OVIS), which requests a model to simultaneously detect, segment, and track object instances in occluded scenes.

As the major contribution of this work, we collect a large-scale dataset called OVIS, specifically for video instance segmentation in occluded scenes. While being the second video instance segmentation dataset after YouTube-VIS (Yang et al., 2019), OVIS consists of 296k high-quality instance masks out of 25 commonly seen semantic categories. Some example clips are given in Fig. 1. The most distinctive property of OVIS dataset is that most objects are under severe occlusions. The occlusion level of each object is also labeled (as shown in Fig. 2) and we also present an AP (average precision) based metric to measure performance under different occlusion degrees. Therefore, OVIS is a useful testbed to evaluate video instance segmentation models for dealing with heavy object occlusions.

To dissect the OVIS dataset, we conduct a thorough evaluation of 9 state-of-the-art algorithms whose code is publicly available, including FEELVOS (Voigtlaender et al., 2019a), IoUTracker+ (Yang et al., 2019), MaskTrack R-CNN (Yang et al., 2019), SipMask (Cao et al., 2020), STEm-Seg (Athar et al., 2020), STMask (Li et al., 2021), TraDeS (Wu et al., 2021), CrossVIS (Yang et al., 2021), and QueryVIS (Fang et al., 2021). However, the experimental results suggest that current video understanding systems fall behind the capability of human beings in terms of occlusion perception. The highest AP is only 16.3 achieved by Yang et al. (2021) and the highest AP on the heavily occluded group is only 6.3 achieved by Li et al. (2021). In this sense, we are still far from deploying those techniques into practical applications, especially considering the complexity and diversity of scenes in the real visual world.

To alleviate the occlusion issue, we also present a plug-and-play module called temporal feature calibration. For a given query frame in a video, we resort to a reference frame to complement its missing object cues. Specifically, the proposed module learns a calibration offset for the reference frame with the guidance of the query frame, and then the offset is used to adjust the feature embedding of the reference frame via deformable convolution (Dai et al., 2017). The refined reference embedding is used in turn to assist the object recognition of the query frame. Our module is a highly flexible plug-in. While applied to MaskTrack R-CNN (Yang et al., 2019) and SipMask (Cao et al., 2020) respectively, we obtain an AP of 15.4 and 14.3, significantly outperforming the corresponding baselines by 4.6 and 4.1 in AP respectively.

To summarize, our contributions are three-fold:

  • We advance occlusion handling and video instance segmentation by releasing a new benchmark dataset named OVIS (short for Occluded Video Instance Segmentation). OVIS is designed with the philosophy of perceiving object occlusions in videos, which could reveal the complexity and the diversity of real-world scenes.

  • We streamline the research over the OVIS dataset by conducting a comprehensive evaluation of 9 state-of-the-art video instance segmentation algorithms, which could be a baseline reference for future research on OVIS.

  • As a minor contribution, we present a plug-and-play module called Temporal Feature Calibration to alleviate the occlusion issue. Using MaskTrack R-CNN (Yang et al., 2019) and SipMask (Cao et al., 2020) as baselines, the proposed module obtains remarkable improvements on both OVIS and YouTube-VIS. More importantly, its “plug-and-play” nature makes it widely applicable to future endeavors on OVIS.

Compared with our conference version (Qi et al., 2021) that briefly describes the OVIS dataset and challenge held in 2021, the improvements are concluded as follows: (1) more thorough experiments (e.g., oracle experiments, error analysis, per-class result analysis) are conducted to dissect the OVIS dataset and the occlusion problem; (2) we comprehensively evaluate the effect of leveraging temporal context and adjusting the NMS threshold adaptively on occlusion handling; (3) more baseline results (e.g., the results that training with augmented image sequences, the results obtained with larger backbone or larger input resolutions) are provided, which can be a better reference for future work; (4) we further summarize remaining difficulties and future directions that deserve attention in OVIS.

2 Related Work

2.1 Video Instance Segmentation

Our work focuses on Video Instance Segmentation in occluded scenes. The most relevant work to ours is Yang et al. (2019), which formally defines the concept of video instance segmentation and releases the first dataset called YouTube-VIS. Built upon the large-scale video object segmentation dataset YouTube-VOS (Xu et al., 2018), the 2019 version of YouTube-VIS dataset contains a total of 2883 videos, 4883 instances, and 131k masks in 40 categories. Its latest 2021 version contains a total of 3859 videos, 8171 instances, and 232k masks. While YouTube-VIS is not designed to study the occluded video understanding problem, most objects in the OVIS dataset are under severe occlusions. Our experimental results show that OVIS is much more challenging.

Since the release of the YouTube-VIS dataset, video instance segmentation has attracted great attention in the computer vision community, arising a series of algorithms recently. MaskTrack R-CNN (Yang et al., 2019) is the first unified model for video instance segmentation. It fulfills video instance segmentation by adding a tracking branch to the popular image instance segmentation method Mask R-CNN (He et al., 2017). Lin et al. (2020) propose a modified variational auto-encoder architecture built on the top of Mask R-CNN. MaskProp (Bertasius & Torresani, 2020) is also a video extension of Mask R-CNN which adds a mask propagation branch to track instances by the propagated masks. SipMask (Cao et al., 2020) extends single-stage image instance segmentation to the video level by adding a fully-convolutional branch for tracking instances. STMask (Li et al., 2021) improves feature representation by spatial feature calibration and temporal feature fusion. Different from those top-down methods, STEm-Seg (Athar et al., 2020) proposes a bottom-up method, which performs video instance segmentation by clustering the pixels of the same instance. Built upon Transformers, VisTR (Wang et al., 2020) supervises and segments instances at the sequence level as a whole. IFC (Hwang et al., 2021) further reduces the computations of full space-time transformers by only executing attention between memory tokens. QueryVIS (Fang et al., 2021) follows a multi-stage paradigm and leverages the intrinsic one-to-one correspondence in queries across different stages. Based on FCOS (Tian et al., 2019), SGNet (Liu et al., 2021) dynamically divides instances into sub-regions and performs segmentation on each region. CrossVIS (Yang et al., 2021) uses the instance feature in the current frame to localize the same instance in other frames. Different from the tracking-by-detection paradigm, TraDeS (Wu et al., 2021) integrates tracking cues to assist detection.

2.2 Other Related Tasks

Our work is also relevant to several other tasks, including:

Video Object Segmentation Video object segmentation (VOS) is a popular task in video analysis. According to whether to provide the mask for the first frame, VOS can be divided into semi-supervised and unsupervised scenarios. Semi-supervised VOS (Hu et al., 2018; Johnander et al., 2019; Khoreva et al., 2017; Li & Loy, 2018; Li et al., 2020b; Oh et al., 2018, 2019; Voigtlaender and Leibe, 2017; Wang et al., 2021a) aims to track and segment a given object with a mask. Many Semi-supervised VOS methods (Khoreva et al., 2017; Li & Loy, 2018; Voigtlaender and Leibe, 2017) adopt an online learning manner which fine-tunes the network on the mask of the first frame during inference. Recently, some other works (Hu et al., 2018; Johnander et al., 2019; Li et al., 2020b; Oh et al., 2018, 2019; Wang et al., 2021a) aim to avoid online learning for the sake of faster inference speed. Unsupervised VOS methods (Li et al., 2018; Tokmakov et al., 2017; Wang et al., 2019) aim to segment the primary objects in a video without the first frame annotations.

As the first video object segmentation dataset, DAVIS (Caelles et al., 2019; Perazzi et al., 2016) contains 150 videos and 376 densely annotated objects. Xu et al. (2018) further proposes the larger YouTube-VOS dataset with 4453 video clips and 7755 objects based on the large-scale YouTube-8M (Abu-El-Haija et al., 2016) dataset. Different from video instance segmentation that needs to classify objects, both unsupervised and semi-supervised VOS does not distinguish semantic categories. In addition, only one or several salient objects are annotated in these VOS datasets, while we annotate all the objects belonging to the pre-defined category set.

Video Semantic Segmentation Video semantic segmentation requires semantic segmentation for each frame in a video. The popular video semantic segmentation datasets include Cityscapes (Cordts et al., 2016), CamVid (Brostow et al., 2009), etc. There are 5000 video clips in the Cityscapes (Cordts et al., 2016) dataset. Each clip consists of 30 frames and only the 20th frame is annotated. CamVid (Brostow et al., 2009) dataset contains 4 videos and the authors annotate one frame every 30 frames, obtaining 800 annotated frames finally. LSTM (Fayyaz et al., 2016), GRU (Nilsson & Sminchisescu, 2018), and optical flow (Zhu et al., 2017) are introduced to leverage temporal contextual information for more accurate or faster video semantic segmentation. Video semantic segmentation does not require distinguishing instances and tracking objects across frames.

Video Panoptic Segmentation Kim et al. (2020) define a video extension of panoptic segmentation (Kirillov et al., 2019), which requires generating consistent panoptic segmentation, and in the meantime, associating instances across frames. They further reformatted the VIPER dataset with 124 videos and proposed the Cityscapes-VPS dataset which contains 500 videos.

Open-World Video Object Segmentation Different from the aforementioned tasks, open-world video object segmentation (Wang et al., 2021b) is taxonomy-free and requires segmenting and tracking all the objects class-agnostically. The proposed UVO dataset (Wang et al., 2021b) contains 1200 videos and all the videos are densely annotated.

Multi-Object Tracking Multi-object tracking (MOT) (Smeulders et al., 2013) aims to detect the bounding boxes of objects and track them in a given video. Some popular datasets focus on the tracking of pedestrians and cars in street scenes, such as MOT16 (Milan et al., 2016) and KITTI (Geiger et al., 2012). Meanwhile, UA-DETRA (Wen et al., 2020) features vehicle tracking only.

Multi-Object Tracking and Segmentation Multi-object tracking and segmentation (MOTS) (Voigtlaender et al., 2019b) extends multi-object tracking (MOT) (Smeulders et al., 2013) from a bounding box level to a pixel level. Voigtlaender et al. (2019b) release the KITTI MOTS and MOTSChallenge datasets, and propose Track R-CNN that extends Mask R-CNN by 3D convolutions to incorporate temporal context and an extra tracking branch for object tracking. Xu et al. (2020) release the ApolloScape dataset which provides more crowded scenes and proposes a new track-by-points paradigm. The task definition of MOTS is similar to video instance segmentation, which means an algorithm needs to simultaneously detect, segment, and track objects. While MOTS mainly focuses on pedestrians and cars in the streets, VIS targets more diverse scenes and more general objects in our daily life, such as animals.

Video Object Detection Video object detection (VOD) is a direct extension of image-level object detection. Compared with multi-object tracking, the video object detection task does not require tracking an object. Some commonly used datasets include the ImageNet-VID dataset (Russakovsky et al., 2015), which contains 3862 snippets for training, 555 snippets for validation, and 937 snippets for evaluation.

Our work is of course relevant to some image-level recognition tasks, such as semantic segmentation (Chen et al., 2017, 2018; Long et al., 2015), instance segmentation (He et al., 2017; Huang et al., 2019; Kirillov et al., 2020), panoptic segmentation (Kirillov et al., 2019; Li et al., 2020a; Xiong et al., 2019), large vocabulary instance segmentation (Gupta et al., 2019; Wu et al., 2020a), etc.

2.3 Occlusion Understanding

There are also some works focusing on occlusion understanding and handling. BCNet (Ke et al., 2021) adds a new branch to infer the occluders and utilizes the obtained occluder features to enhance the feature of occludees. OCFusion (Lazarow et al., 2020) introduces the occlusion head to indicate the occlusion relation between each pair of mask proposals. Zhan et al. (2020) proposes a self-supervised method that can recover the occlusion ordering and complete the invisible parts of occluded objects. Different from the full-DNN paradigm described above, Some methods (Kortylewski et al., 2020a, 2021, 2020b) integrate compositional models and deep convolutional neural networks into a unified model which is more robust to partial occlusions. As for pedestrian detection in crowded scenes, Wang et al. (2018b) and Zhang et al. (2018) propose new loss functions to enforce predicted boxes to locate compactly to the corresponding ground-truth objects while far from other objects. Zhou and Yuan (2018) regresses two bounding boxes for each object to localize the full body and visible part of a pedestrian respectively. Liu et al. (2019a) introduces adaptive-NMS which adaptively increases the NMS threshold in crowd scenes. Wu et al. (2020b) aggregates the temporal context to enhance the feature representations. Chu et al. (2020) predicts multiple instances in one proposal. In Multi-Object Tracking, Chu et al. (2017) and Zhu et al. (2018) utilize the attention module to attend to the visible parts of objects. Liu et al. (2020) and Xu et al. (2019) exploit the topology between objects to track the occluded objects. In our experiments, to test the effect of temporal aggregation on occlusion handling, a temporal feature calibration module is presented, in which the calibrated features from neighboring frames are fused with the current frame for reasoning occluded objects and improving the recognition in each frame.

3 OVIS Dataset

Given an input video, video instance segmentation requires detecting, segmenting, and tracking object instances simultaneously from a predefined set of object categories. An algorithm is supposed to output the class label, confidence score, and a sequence of binary masks of each instance.

The focus of this work is on collecting a large-scale benchmark dataset for video instance segmentation with severe object occlusions. In this section, we mainly review the data collection process, the annotation process, and the dataset statistics.

Fig. 3
figure 3

Number of instances per category in the OVIS dataset

3.1 Video Collection

We begin with selecting 25 semantic categories, including Person, Bird, Cat, Dog, Horse, Sheep, Cow, Elephant, Bear, Zebra, Giraffe, Poultry, Giant panda, Lizard, Parrot, Monkey, Rabbit, Tiger, Fish, Turtle, Bicycle, Motorcycle, Airplane, Boat, and Vehicle. The categories are carefully chosen mainly for three motivations: (1) most of them are animals, with which object occlusions extensively happen, (2) they are commonly seen in our life, (3) these categories have a high overlap with popular large-scale image instance segmentation datasets (Gupta et al., 2019; Lin et al., 2014) so that models trained on those datasets are easier to be transferred. The number of instances per category is given in Fig. 3.

As the dataset is to study the capability of our video understanding systems to perceive occlusions, we ask the annotation team to (1) exclude those videos, where only one single object stands in the foreground; (2) exclude those videos with a clean background; (3) exclude those videos, where the complete contour of objects is visible all the time. Some other objective rules include (1) video length is generally between 5 and 60 s, and (2) video resolution is generally \(1920\times 1080\).

After applying the objective rules, the annotation team delivers 8644 video candidates and our research team only accepts 901 challenging videos after a careful re-check. It should be mentioned that due to the stringent standard of video collection, the pass rate is as low as 10%.

3.2 Annotation

Given an accepted video, the annotation team is asked to exhaustively annotate all the objects belonging to the pre-defined category set. Each object is given an instance identity and a class label. In addition to some common rules (e.g., no ID switch, mask fitness \(\le 1\) pixel), the annotation team is trained with several criteria particularly about occlusions: (1) if an existing object disappears because of full occlusions and then re-appears, the instance identity should keep the same; (2) if a new instance appears in an in-between frame, a new instance identity is needed; and (3) the case of “object re-appears” and “new instances” should be distinguishable by you after you watch the contextual frames therein. All the videos are annotated every 5 frames and the final annotation granularity of most videos is 5 or 6 fps.

To deeply analyze the influence of occlusion levels on model performance, OVIS provides the occlusion level annotation of every object in each frame. The occlusion levels are defined as follows: no occlusion, slight occlusion, and severe occlusion. As illustrated in Fig. 2, no occlusion means the object is fully visible, slight occlusion means that more than 50% of the object is visible, and severe occlusion means that more than 50% of the object area is occluded. After the frame-level occlusion degree is annotated, we can quantify the occlusion degree of each instance through the whole video by gathering the occlusion level in all frames of the instance. Specifically, We first map the three occlusion levels mentioned before into numeric scores. The no occlusion, slight occlusion, and server occlusion are mapped into 0, 0.25, and 0.75, respectively. Then, given an instance that appears in multiple frames, we use the averaged occlusion scores of top 50% frames with highest scores to represent the occlusion degree of instances.

Table 1 Comparing OVIS with YouTube-VIS in terms of statistics
Fig. 4
figure 4

Comparison of OVIS with YouTube-VIS, including the distribution of instance duration (a), BOR (b), the number of instances per video (c), and the number of objects per frame (d)

Each video is handled by one annotator to get the initial annotation, and the initial annotation is then passed to another annotator to check and correct if necessary. The final annotations will be examined by our research team and sent back for revision if deemed below the required quality.

While being designed for video instance segmentation, it should be noted that OVIS is also suitable for evaluating video object segmentation in either a semi-supervised or unsupervised fashion, and object tracking since the bounding-box annotation is also provided. The relevant experimental settings will be explored as part of our future work.

3.3 Dataset Statistics

As YouTube-VIS (Yang et al., 2019) is the only dataset that is specifically designed for video instance segmentation nowadays, we analyze the data statistics of OVIS with YouTube-VIS as a reference in Table 1. We compare OVIS with two versions of YouTube-VIS: YouTube-VIS 2019 and YouTube-VIS 2021. Note that some statistics, marked with \(\star \), of YouTube-VIS are only calculated from the training set because only the annotation of the training set is publicly available. Nevertheless, considering the training set occupies 78% of the whole dataset, those statistics could still reflect the properties of YouTube-VIS roughly.

In terms of basic and high-level statistics, OVIS contains 296k masks and 5223 instances. The number of masks in OVIS is larger than YouTube-VIS 2019 and YouTube-VIS 2021 that have 131k and 232k masks, respectively. The number of instances in OVIS is larger than YouTube-VIS 2019 that has 4883 instances, and less than YouTube-VIS 2021 that has 8171 instances. Note that there are fewer categories in OVIS, so the mean instances count per category is larger than YouTube-VIS 2021. Nonetheless, OVIS has fewer videos than YouTube-VIS as our design philosophy favors long videos and instances so as to preserve enough motion and occlusion scenarios.

As is shown, the average video duration and the average instance duration of OVIS are 12.77 s and 10.05 s respectively. Fig. 4a presents the distribution of instance duration, which shows that all instances in YouTube-VIS last less than 10 s. Long videos and instances increase the difficulty of tracking and the ability of long-term tracking is required.

Fig. 5
figure 5

Visualization of occlusions with different BOR values

As for occlusion levels, the proportions of objects with no occlusion, slight occlusion, and severe occlusion in OVIS are 18.2%, 55.5%, and 26.3% respectively. 80.2% of instances are severely occluded in at least one frame, and only 2% of the instances are not occluded in any frame. It supports the focus of our work, that is, to explore the ability of video instance segmentation models in handling occlusion scenes.

In order to compare the occlusion degree with other datasets, we define a metric named Bounding-box Occlusion Rate (BOR) to approximate the degree of occlusion. Given a video frame with N objects denoted by bounding boxes \(\{\mathbf{B }_1,\mathbf{B} _2,\dots ,\mathbf{B} _N\}\), we compute the BOR for this frame as

$$\begin{aligned} \text {BOR}=\frac{|\bigcup _{1\le i<j\le N} \mathbf{B} _i\bigcap \mathbf{B} _j| }{|\bigcup _{1\le i\le N}{} \mathbf{B} _i| }, \end{aligned}$$

where the numerator means the area sum of the intersection between any two or more bounding boxes. The denominator means the area of the union of all the bounding boxes. An illustration is given in Fig. 5, which shows the larger the BOR value is, the heavier the occlusion is.

Then we utilize mBOR, the average value of BORs of all the frames in a dataset (frames that do not contain any objects are ignored), to characterize the dataset in terms of the occlusion. As shown in Table 1, the mBOR of OVIS is 0.22, much higher than that of YouTube-VIS 2019 and YouTube-VIS 2021 (0.07 and 0.06, respectively). The BOR distribution is further compared in Fig. 4b. As can be seen, most frames in YouTube-VIS are located in the region where \(\text {BOR}\le 0.1\). In comparison, the BOR of about half frames in OVIS is no less than 0.2. This supports that there are more severe occlusions in OVIS than YouTube-VIS. However, it should be mentioned here that BOR can only roughly reflect the occlusion between objects. Therefore, mBOR could serve as an effective indicator for occlusion degrees, but only reflect the occlusion degree in a partial or rough way.

In addition to long videos &instances and severe occlusions, OVIS features crowded scenes, which is a natural result caused by heavy occlusions. OVIS has 5.80 instances per video and 4.72 objects per frame, while those two values are 2.10 and 1.95 respectively in YouTube-VIS 2021. The comparison of the two distributions is further depicted in Fig. 4c, d.

3.4 Evaluation Metrics

Following previous methods (Yang et al., 2019), we use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as the evaluation metrics. The mean value of APs is also employed.

In addition, thanks to the occlusion level annotations in OVIS, we are able to analyze the performance under different occlusion levels. We divide all instances into three groups called slightly occluded, moderately occluded, and heavily occluded, in which the occlusion scores of instances are in the range of [0, 0.25], [0.25, 0.5], [0.5, 0.75] respectively. The proportions of the three groups are 23%, 44%, and 49% respectively. Then, we can get the AP of each group (denoted by \(\text {AP}_{{\textit{SO}}}\), \(\text {AP}_{{\textit{MO}}}\), and \(\text {AP}_{{\textit{HO}}}\) respectively) by ignoring the instances of other groups.

4 Experiments

In this section, we comprehensively study the newly collected OVIS dataset by conducting experiments on 9 existing video instance segmentation algorithms and propose our new baseline method.

4.1 Implementation Details

Datasets On the newly collected OVIS dataset, the whole dataset is divided into 607 training videos, 140 validation videos, and 154 test videos. The split proportions of different categories are approximately the same, and there are at least 4 videos per category in the validation and test set. This split will be fixed as an official split. If not specified, the experiments are conducted on the validation set of OVIS.

Table 2 Overall results of state-of-the-art methods on the OVIS dataset

A Temporal Feature Calibration Plug-in One of the keys to tackling occlusion is to complement the missing object cues. In a video that has a temporal dimension, a mild assumption is that usually, the missing object cues in the current frame may have appeared in adjacent frames. Hence, it is natural to leverage adjacent frames to alleviate occlusions. However, caused by motions, the features of different frames are not aligned in the spatial dimension. Things get much worse because of the existence of severe occlusions. To solve this issue, following (Bertasius et al., 2018; Dosovitskiy et al., 2015), we present an easy plug-in called temporal feature calibration as illustrated in Fig. 6.

Denote by \(\mathbf{F} _\mathbf{q} \in \mathbb {R}^{H\times W\times C}\) and \(\mathbf{F} _\mathbf{r} \in \mathbb {R}^{H\times W\times C}\) the feature tensor of the query frame (called target or current frame in some literature) and a reference frame, respectively. The feature calibration first computes the spatial correlation (Dosovitskiy et al., 2015) between \(\mathbf{F} _\mathbf{q} \) and \(\mathbf{F} _\mathbf{r} \). Given a location \(\mathbf{x} _\mathbf{q} \) in \(\mathbf{F} _\mathbf{q} \) and \(\mathbf{x} _\mathbf{r} \) in \(\mathbf{F} _\mathbf{r} \), we compute

$$\begin{aligned} \mathbf{c} {} \mathbf ( {} \mathbf{x} _\mathbf{q} ,\mathbf{x} _\mathbf{r} {} \mathbf ) = \sum _{o\in [-k,k]\times [-k,k]}{} \mathbf{F} _\mathbf{q} (\mathbf{x} _\mathbf{q} +o)\mathbf{F} _\mathbf{r} (\mathbf{x} _\mathbf{r} +o)^\mathrm {T}\text {.} \end{aligned}$$

The above operation will transverse the \(d\times d\) area centered on \(\mathbf{x} _\mathbf{q} \), then outputs a \(d^2\)-dimensional vector.

Fig. 6
figure 6

The pipeline of temporal feature calibration, which can be inserted into different video instance segmentation models by changing the following prediction head

After enumerating all the positions in \(\mathbf{F} _\mathbf{q} \), we obtain \(\mathbf{C} \in \mathbb {R}^{H\times W\times d^2}\) and forward it into multiple stacked convolution layers to get the spatial calibration offset \(\mathbf{D} \in \mathbb {R}^{H\times W\times 18}\). We then obtain a calibrated version of \(\mathbf{F} _\mathbf{r} \) by applying deformable convolutions with \(\mathbf{D} \) as the spatial calibration offset, which is denoted as \(\overline{\mathbf{F }}_\mathbf{r }\). At last, we fuse the calibrated reference feature \(\overline{\mathbf{F }}_\mathbf{r }\) with the query feature \(\mathbf{F} _\mathbf{q} \) by element-wise addition for the localization, classification, and segmentation of the current frame afterward.

During training, for each query frame \(\mathbf{F} _\mathbf{q} \), we randomly sample a reference frame \(\mathbf{F} _\mathbf{r} \) from the same video. As compared with the short videos in YouTube-VIS (the longest video in YouTube-VIS contains only 36 frames), the first frame and the last frame of a long video in OVIS (the longest video in OVIS contains 500 frames) may be totally different. In order to ensure that the reference frame has a strong spatial correspondence with the query frame, the sampling is only done locally within \(\varepsilon _{\text {train}}=5\) frames. Since the temporal feature calibration is differentiable, it can be trained end-to-end by the original detection and segmentation loss. When inference, all frames adjacent to the query frame within the range \(\varepsilon _{\text {test}}=5\) are taken as reference frames. We linearly fuse the classification confidences, regression bounding box coordinates, and segmentation masks obtained from each reference frame and output the final results for the query frame.

In the experiments, we denote the new methods as CMaskTrack R-CNN and CSipMask, when Calibrating MaskTrack R-CNN (Yang et al., 2019) models and Calibrating SipMask (Cao et al., 2020) models, respectively.

Experimental Setup For all our experiments, we adopt ResNet-50-FPN (He et al., 2016) as the backbone. The models are initialized by Mask R-CNN which is pre-trained on MS-COCO (Lin et al., 2014). All frames are resized to \(640 \times 360 \) during both training and inference for fair comparisons with previous works (Yang et al., 2019; Cao et al., 2020; Athar et al., 2020). For our new baselines (CMaskTrack R-CNN and CSipMask), we use three convolution layers of kernel size \(3\times 3\) in the module for temporal feature calibration. The training epoch is set to 12, and the initial learning rate is set to 0.005 and decays with a factor of 10 at epoch 8 and 11.

Table 3 Quantitative comparison between the new methods and their corresponding baselines on the OVIS dataset and the YouTube-VIS dataset

4.2 Main Results

On the OVIS dataset, we first produce the performance of several state-of-the-art algorithms whose code is publicly available, including mask propagation methods (e.g., FEELVOS Voigtlaender et al. 2019a), track-by-detect methods (e.g., IoUTracker+ Yang et al. 2019), and recently proposed end-to-end methods (e.g., MaskTrack R-CNN Yang et al. 2019, SipMask Cao et al. 2020, STEm-Seg Athar et al. 2020, STMask Li et al. 2021, TraDeS Wu et al. 2021, CrossVIS Yang et al. 2021, and QueryVIS Fang et al. 2021). The standard deviation of the reported results below is about 0.5.

As presented in Table 2, although most of these methods can obtain more than 30 AP on the YouTube-VIS dataset, all of them encounter a great performance degradation of at least 50% on OVIS compared with that on YouTube-VIS. Especially in the heavily occluded instance group, all methods suffer from a significant performance drop of more than 80%. For example, SipMask (Cao et al., 2020), which achieves an AP of 32.5 on YouTube-VIS, only obtains an AP of 2.2 in the heavily occluded group of OVIS validation set. It firmly suggests that severe occlusion will greatly improve the difficulty of video instance segmentation, and further attention should be paid to video instance segmentation in the real world where occlusions extensively happen. Benefiting from the feature calibration and temporal fusion, STMask (Li et al., 2021) obtains an \(\text {AP}_{{\textit{HO}}}\) of 5.1 on the validation set and 6.3 on the test set, surpassing all other methods in the heavily occluded group.

It is worth noting that, as the only bottom-up video instance segmentation method, STEm-Seg achieves similar \(\text {AP}_{{\textit{SO}}}\) with MaskTrack R-CNN and TraDeS, but much higher \(\text {AP}_{{\textit{HO}}}\) (3.9 vs. 2.7 vs. 3.0). It demonstrates that the bottom-up paradigm like STEm-Seg may perform better than the general top-down paradigm on occlusion handling. Our interpretation is that the bottom-up architecture avoids the detection process which is difficult in occluded scenes.

Fig. 7
figure 7

Evaluation examples on OVIS. Each row presents the results of 5 frames in a video sequence. ae are successful cases and f, h are failure cases

Table 4 Oracle results on OVIS and YouTube-VIS

In addition, as shown in Table 3, by leveraging the feature calibration module, the performance on OVIS is significantly improved. CMaskTrack R-CNN leads to an AP improvement of 4.6 over MaskTrack R-CNN (10.8 vs. 15.4), and CSipMask leads to an AP improvement of 4.1 over SipMask (10.2 vs. 14.3). Besides, the experiments also show that TFC can boost the performance of all occlusion levels, and the improvement of heavy occlusion and moderate occlusion is more significant (see Fig. 11 for more details). We also evaluate the proposed CMaskTrack R-CNN and CSipMask on the YouTube-VIS dataset. As shown in Table 3, CMaskTrack R-CNN and CSipMask surpass the corresponding baselines by 1.8 and 2.6 in terms of AP, respectively, which demonstrates the flexibility and the generalization power of the proposed feature calibration module.

To present the qualitative evaluation results of methods on OVIS, some evaluation examples of CMaskTrack R-CNN are given in Fig. 7, including 5 successful cases (a)–(e) and 3 failure cases (f) and (h). In (a), the car in the yellow mask first blocks the car in the red mask entirely in the 2nd frame, then is entirely blocked by the car in the purple mask in the 4th frame. It is surprising that even in this extreme case, all the cars are well tracked. In (b), CMaskTrack R-CNN successfully tracks the bear in the yellow mask, which is partially occluded by another object, i.e., the bear in the purple mask, and the background, i.e., the tree. In (d), we present a crowded scene where almost all the ducks are correctly detected and tracked. In (f), two persons and two bicycles heavily overlap with each other. CMaskTrack R-CNN fails to track the person and segment the bicycle. In (g), when two cars are intersecting, severe occlusion leads to failure of detection and tracking. In (h), although humans could sense that there are two persons with hats at the bottom, CMaskTrack R-CNN cannot detect and track them because the appeared visual cues are inadequate.

Table 5 The results of training with and without augmented image sequences
Table 6 Adaptive NMS oracle results of MaskTrack R-CNN on OVIS

4.3 Discussions

Oracle Results We conduct the image oracle and identity oracle experiments to explore the impact of image-level prediction and cross-frame association on the performance of the OVIS dataset. In order to compare with the YouTube-VIS dataset (Yang et al., 2019), we use MaskTrack-RCNN for experiments. Following (Yang et al., 2019), in the image oracle experiments, we use ground-truth bounding boxes, masks, and category labels to replace the predictions by MaskTrack R-CNN, and then track those ground-truth bounding boxes by the tracking branch. In the identity oracle experiment, we first assign each per-frame prediction to the closest ground-truth bounding box, and then aggregate the bounding boxes with the same identity through the video.

The results are shown in Table 4. On the OVIS dataset, the image oracle experiments and identity oracle experiments obtain 58.4 and 25.5 AP, respectively. This demonstrates that the image level prediction is more critical for the performance of occluded video instance segmentation, which is mainly associated to object segmentation and classification in frames. It can be expected that more advanced image-based techniques could be explored further so as to approach this upper limit. Interestingly, both oracle experiments achieve lower performance on the OVIS dataset than that on the YouTube-VIS dataset, which shows that whether for image-level prediction or cross-frame association, the OVIS dataset is more challenging than the YouTube-VIS dataset. Moreover, in identity experiments, the AP on YouTube-VIS achieves almost no gain (only 4% improvement over the result of MaskTrack R-CNN baseline), while the AP on OVIS is greatly improved (121% improvement over MaskTrack R-CNN), which demonstrates that the tracking task on OVIS is much more difficult than that on YouTube-VIS.

Effect of Leveraging Image Datasets Caused by the high cost of exhaustively annotating high-quality video segmentation masks, video inadequacy is a common problem among existing video segmentation datasets. The lack of diversity in video scenes may affect the generalization capability of models trained on those datasets. To this end, we further train several models with both the video data in OVIS and additional augmented image sequences/pairs synthesized from other large-scale image instance segmentation datasets. In our experiments, the proportions of video data and augmented image data are 65% and 35% respectively. These pseudo image sequences are generated from the COCO (Lin et al., 2014) dataset by on-the-fly random perspective and affine transformation. The evaluation results are shown in Table 5. We can see that by leveraging the augmented image sequences, all these three baseline methods can achieve remarkable AP improvements, which can serve as a reference for future research.

Analysis of NMS Threshold Non-Maximum Suppression (NMS) is a necessary post-processing for most detection methods.

To test the impact of NMS threshold on occlusion handling, inspired by Liu et al. (2019a), we design the adaptive NMS oracle experiment. Specifically, for each ground-truth bounding box, we calculate the maximum IoU d between it and all other ground-truth boxes. Then, the NMS threshold of all the predicted boxes that correspond to this ground-truth box will be assigned as \(\max (d, 0.5)\). In this way, a larger NMS threshold will be applied to the predictions in dense scenes, which can prevent NMS from removing the true positives that are close to other ground-truth boxes.

As presented in Table 6, based on MaskTrack R-CNN, the adaptive NMS oracle experiment improves \(\text {AP}_{{\textit{MO}}}\) and \(\text {AP}_{{\textit{HO}}}\) by 0.2, which proves that using a higher NMS threshold adaptively improves the performance in occluded scenes. However, even though we exactly know the real density (Liu et al., 2019a) of boxes in the adaptive NMS oracle experiment, \(\text {AP}_{{\textit{SO}}}\) decreases from 23.0 to 22.8. The overall AP only improves from 10.8 to 11.2, which shows that the NMS threshold adjusting is not a bottleneck on OVIS.

One interpretation is that adjusting the NMS threshold is more important for tasks that require detecting the amodal bounding boxes (additionally containing the occluded invisible parts), such as the full-body bounding boxes in crowded pedestrian detection datasets (Shao et al., 2018; Zhang et al., 2017). For two occluded objects, the IoU of amodal bounding boxes will be much higher than the IoU of the bounding boxes of only the visible parts (like the boxes in OVIS). In addition, some learnable NMS methods (Hosang et al., 2017; Liu et al., 2019a) have also been proposed, and many new methods (Carion et al., 2020; Fang et al., 2021) based on set prediction even do not need NMS post-processing. These new methods require further exploration in OVIS.

Table 7 Error analysis under different occlusion levels

Error Analysis To explore the detailed influence of occlusion levels on video instance segmentation, in this subsection, we analyze the frame-level error rates of classification, segmentation, and tracking under different occlusion levels. A segmentation error refers to that the IoU between the predicted mask of an object and its ground-truth less than 0.5 and the tracking error is reflected by ID switch rate.

Formally, we denote the predicted masks and labels in all frames as \(M=\{m_1,m_2,\ldots ,m_n\}\) and \(Y=\{y_1,y_2,\ldots ,y_n\}\), respectively, where n is the number of predictions. The corresponding matched ground-truth masks and labels as \(M^*=\{m^*_1,m^*_2,\ldots ,m^*_n\}\) and \(Y^*=\{y^*_1,y^*_2,\ldots ,y^*_n\}\), respectively.

Regarding classification error rates, we consider the predicted object whose IoU with its matched ground-truth is greater than 0.5, then count the proportion of classification errors among them, as

$$\begin{aligned} \text {E}_{cls}=\frac{|\{m_j |\text {IoU}(m_j,m^*_j)>0.5 \wedge y_j\not = y^*_j \}|}{|\{m_i | \text {IoU}(m_i,m^*_i)>0.5\}|}. \end{aligned}$$

For segmentation error rates, following Bolya et al. (2020), we consider masks whose IoU with its matched ground-truth is greater than 0.1. A mask \(m_i\) will be counted as a segmentation error if its IoU with the corresponding ground-truth \(m^*_i\) is less than 0.5. Then the segmentation error rate is calculated as

$$\begin{aligned} \text {E}_{seg}=\frac{|\{m_j |0.1<\text {IoU}(m_j,m^*_j)<0.5\}|}{|\{m_i | \text {IoU}(m_i,m^*_i)>0.1\}|}. \end{aligned}$$

The ID switch rate refers to the ratio of ID switches in the tracking sequence of all instances. Following Voigtlaender et al. (2019b), the predicted ID of a ground-truth instance in a frame is defined as the tracking ID of the closest predicted mask. If the ID of a ground-truth instance is not equal to that of its latest tracked predecessor, it will be considered as an ID switch.

Based on the error rates defined above, we further evaluate MaskTrack and CMaskTrack. In addition, we define a baseline named “MaskTrack R-CNN + LSS + DCN” by applying the local sampling strategy and applying one deformable convolution layer to the query frame. As a result, by comparing “MaskTrack R-CNN + LSS + DCN” and our method, we could obtain the performance gain purely brought by temporal feature calibration.

As shown in Table 7, the three types of error rates all significantly increase when the occlusion level increases. Among them, the segmentation error rate increases the most, from 12.1 to 34.1% for MaskTrack R-CNN, which demonstrates that severe occlusion will greatly increase the difficulty of the segmentation task. In this sense, accurately localizing the object is helpful for mitigating the impact of occlusions. Meanwhile, among the three error types, the error rate of classification is much higher than that of segmentation and tracking. So a better classification result is important to improving the overall performance.

One could also observe that (1) no matter in terms of classification error rate, segmentation error rate, or ID switch rate, the gain of our method over “MaskTrack R-CNN + LSS + DCN” increases when the occlusion level increases (e.g., CMaskTrack R-CNN decreases the classification error rate by 2.7%, 3.9%, and 6.6% respectively); (2) in terms of segmentation error rate and ID switch rate, the gain of “MaskTrack R-CNN + LSS + DCN” over the baseline “MaskTrack R-CNN” does not change too much when the occlusion level increases (e.g., “MaskTrack R-CNN + LSS + DCN” decreases the segmentation error rate by 6.1%, 6.4%, and 6.5% respectively); (3) in terms of classification error rate, the gain of “MaskTrack R-CNN + LSS + DCN” over the baseline “MaskTrack R-CNN” even decreases when the occlusion level increases (No occlusion: 11.1%, Slight occlusion: 8.1%, and Severe occlusion: 4.3%).

By comparing Observation (1), (2), and (3), one could conclude that the TFC module improves more in occluded scenes compared with using other training strategies (e.g., the local sampling strategy) and model structure (e.g., applying deformable convolution to the query frame). The same conclusion is also drawn if we compare the relative error decreasing rate.

Effect of Better Feature Representations To test the effect of better feature representations on occlusion, we further try Swin-T (Liu et al., 2019b) and ResNeXt-101 (Xie et al., 2017) backbone on MaskTrack R-CNN and QueryVIS. As can be seen in Table 8, both Swin-T and ResNeXt-101 achieve great improvement (about 4 AP) on OVIS. And these larger backbones can also achieve obvious AP improvement at all occlusion levels.

Table 8 Effect of larger backbones

Effect of Larger Input Resolutions We try to replace the \(640\times 360\) input resolution with \(1280\times 720\) which is similar to the commonly used input resolution for COCO (Lin et al., 2014). As shown in Table 9, when the input resolution increases, the performance improves slightly (0.5 AP for MaskTrack R-CNN Yang et al. 2019 and 0.3 AP for SipMask Cao et al. 2020).

Table 9 Effect of larger input resolutions
Table 10 Effect of three existing occlusion handling methods that are specifically designed for image-level detection tasks

Methods Specifically Designed for Occlusion We also migrate three image-level detection methods to the CMaskTrack R-CNN model, including (1) the repulsion loss (Wang et al., 2018b) which requires the predicted boxes to keep away from other ground-truth boxes; (2) the compact loss (Zhang et al., 2018) which enforces proposals to be close and locate compactly to the corresponding ground-truth; (3) the occluder branch (Ke et al., 2021) (without any extra designs like the Non-local (Wang et al., 2018a) operation and boundary prediction) which additionally learns the feature of occluders with a new branch and then fuses the feature of occluders and occludees. In particular, the repulsion loss and compact loss are specifically designed for crowded pedestrian detection, and the occluder branch is designed for the occlusion problem of common objects.

As shown in Table 10, the compact loss and occluder branch improve \(\text {AP}_{{\textit{HO}}}\) by 0.4 and 0.3 respectively, while their overall AP improvements are marginal. We believe more gains can be achieved by developing more delicate occlusion handling algorithms and leveraging occluded data (see Sect. 5 for future work discussion).

Per-class Results The per-class AP scores of CMaskTrack R-CNN are shown in Fig. 8. It shows that the Top-5 challenging categories are Bicycle, Turtle, Motorcycle, Giraffe, and Bird. The confusion matrix is also given in Fig. 9. As it shows, most categories can be correctly classified except for some visually similar category pairs (e.g., Poultry and Bird, Bicycle and Motorcycle).

Fig. 8
figure 8

Per-class AP of CMaskTrack R-CNN on OVIS

Fig. 9
figure 9

Confusion matrix for classification

Table 11 Effect of the local sampling strategy on the OVIS validation set
Table 12 Effect of the local sampling strategy and the comparison of different feature fusion methods

Ablation Study of the TFC Module. To verify the rationality of the TFC module, we firstly test the effect of the local sampling strategy of reference frames during training. As shown in Table 11, by only sampling the reference frames locally within \(\varepsilon _\mathrm{\text {train}}=5\) frames instead of sampling in the whole video, MaskTrack R-CNN, SipMask, and QueryVIS all obtain significant AP improvements of 2.7, 2.6, and 1.7 respectively, which demonstrates that the local sampling strategy of reference frames during training is necessary and beneficial to learn how to track objects in the long videos of OVIS.

We further study the temporal feature calibration module with a few alternatives. The first option is a naive combination, which sums up the feature of the query frame and the reference frame without any feature alignment. The second option is to replace the correlation operation in our module by calculating the element-wise difference between feature maps, which is similar to the operation used in Bertasius and Torresani (2020). We denote the three options as “\(+\) Uncalibrated Addition” and “\(+\) \(\text {Calibration}_{\text {diff}}\)” respectively and our module as “\(+\) \(\text {Calibration}_{\text {corr}}\)” in Table 12.

As we can see, with the enhanced MaskTrack R-CNN (with local sampling strategy of reference frames during training) as the base model, the naive “\(+\) Uncalibrated Addition” combination even degrades the final AP. This is because the direct addition of the uncalibrated features from other frames may bring noises to the object localization process. In contrast, after applying feature calibration, the performance is improved. “\(+\) \(\text {Calibration}_{\text {corr}}\)” achieves an AP of 15.4, an improvement of 1.9 over the baseline method without feature fusion and 1.0 over “\(+\) \(\text {Calibration}_{\text {diff}}\)”. We argue that the correlation operation is able to provide a richer context for feature calibration because it calculates the similarity between the query position and its neighboring positions. Testing on P-100 GPU, the speed of CMaskTrack R-CNN when using \(\text {Calibration}_{\text {diff}}\) and \(\text {Calibration}_{\text {corr}}\) are 16 and 7 fps respectively.

We also conduct experiments to analyze the influence of the reference frames range \(\varepsilon _{test}\). \(\varepsilon _{test}=0\) means applying the deformable convolutional layer to the query frame itself. As can be seen in Fig. 10, the AP increases when \(\varepsilon _{test}\) increases, and reaching the highest value at \(\varepsilon _{test}=5\). Even if \(\varepsilon _{test}=1\), the performance exceeds the setting of \(\varepsilon _{test}=0\), which demonstrates that calibrating features from adjacent frames is beneficial to video instance segmentation.

Fig. 10
figure 10

Results of different reference frame range \(\varepsilon _{test}\) on the OVIS validation set. Notably, \(\varepsilon _{test}=0\) indicates applying the deformable convolutional layer to the query frame itself, without leveraging adjacent frames

To further compare the improvement of TFC on different occlusion levels, we evaluate the relative gain of AP on different occlusion levels for a fair comparison. As shown in Fig. 11, we report the relative gain by varying \(\varepsilon _{test}\). The larger the \(\varepsilon _{test}\) is, the more temporal context will be aggregated. As can be seen, the relative gain of \(\text {AP}_{{\textit{HO}}}\) is much higher than that of \(\text {AP}_{{\textit{MO}}}\). The relative gain of \(\text {AP}_{{\textit{SO}}}\) is smallest once the temporal context is considered (\(\varepsilon _{test}\>0\)). The result demonstrates the effectiveness of temporal feature aggregation on occlusion handling.

Fig. 11
figure 11

Relative gain of different occlusion levels with increasing reference frame range \(\varepsilon _{test}\)

5 Future Directions

In the future, there are still many interesting issues that can be studied and many remaining difficulties to be addressed with OVIS, such as:

Occlusion-aware Models Effectively handling occlusions is one of the most straightforward ways to improve the performance in OVIS. In terms of occlusion-aware models, there are a few directions that can be exploited in our future work. For example, compositional models (Kortylewski et al., 2020a, b, 2021) might be a good choice as they are robust to partial occlusions. It is also interesting to test if completing the invisible parts of occluded objects (a.k.a. de-occlusion Zhan et al. 2020) is useful in this scenario.

Occluded Data Generation Due to the high cost of annotation, the scale of video instance segmentation datasets is relatively smaller than image datasets. Some works (DeVries & Taylor, 2017; Yun et al., 2019; Dwibedi et al., 2017; Ghiasi et al., 2021) have proposed augmenting the common datasets (e.g., COCO Lin et al. 2014) with partial occlusions, and some works (Nikolenko, 2019; Kar et al., 2019; Devaranjan et al., 2020) synthesize structured amodal data in occluded scenes using simulators. It can be anticipated that utilizing those data with proper training paradigms will improve the performance in VIS.

Learning from Occlusion Annotations In OVIS, a coarse annotation of occlusion levels (no occlusion, slight occlusion, and server occlusion) is given per object. As a prior knowledge that can be accessed during training, learning paradigms that can abstract such information deserve special attention.

Large Scale Model Pre-Training According to our experiments, it improves the performance to conduct joint training with image datasets. With the development of self-supervised learning (He et al., 2021), exploiting the unlimited amounts of unlabeled data for model pre-training, then transferring the pre-trained model into OVIS will largely enhance the discriminative power of frame embeddings.

Dataset Versatility At last, we are also interested in formalizing the experimental track of OVIS for video object segmentation, either in an unsupervised, semi-supervised, or interactive setting. It is also of paramount importance to extend OVIS to video panoptic segmentation (Kim et al., 2020). We believe the OVIS dataset will trigger more research in understanding videos in complex and diverse scenes.

6 Conclusions

In this work, we target video instance segmentation in occluded scenes and accordingly contribute a large-scale dataset called OVIS. OVIS consists of 296k high-quality instance masks of 5223 heavily occluded instances. While being the second benchmark dataset after YouTube-VIS, OVIS is designed to examine the ability of current video understanding systems in terms of handling object occlusions. A general conclusion is that the baseline performance on OVIS is far below that on YouTube-VIS, which suggests that more effort should be devoted in the future to tackling object occlusions or de-occluding objects (Zhan et al., 2020). We also explore ways about leveraging temporal context cues to alleviate the occlusion matter and conduct a comprehensive analysis of occlusion handling on OVIS.