Object removal from complex videos using a few annotations

We present a system for the removal of objects from videos. As input, the system only needs a user to draw a few strokes on the first frame, roughly delimiting the objects to be removed. To the best of our knowledge, this is the first system allowing the semi-automatic removal of objects from videos with complex backgrounds. The key steps of our system are the following: after initialization, segmentation masks are first refined and then automatically propagated through the video. Missing regions are then synthesized using video inpainting techniques. Our system can deal with multiple, possibly crossing objects, with complex motions, and with dynamic textures. This results in a computational tool that can alleviate tedious manual operations for editing high-quality videos.


Introduction
In this paper, we propose a system to remove one or several objects from a video, starting with only a few user annotations.More precisely, the user only needs to approximately delimit in the first frame the objects to be edited.Then, these annotations are refined and propagated through the video.One or several objects can then be removed automatically.This results in a flexible computational video editing tool, with numerous potential applications.Removing unwanted objects (such as a boom microphone) or people (such as an unwanted wanderer) is a common task in video post-production.Such tasks are critical given the time constraints of movie production and the prohibitive costs of reshooting complex scenes.They are usually achieved through extremely tedious and time-consuming frame-by-frame processes, for instance using the Rotobrush tool from Adobe After Effects [2] or professional visual effects softwares such as SilhouetteFX or Mocha.More generally, the proposed system paves the way to sophisticated movie editing tasks, ranging from crowd suppression to unphysical scenes modifications, and has potential applications for multi-layered video editing.
Two main challenges arise in developing such a system.First, not a single part of the objects to be edited shall be left over in the tracking part of the algorithm; otherwise, they are propagated and enlarged by the completion step, resulting in unpleasant artifacts.Second, our visual system is good at spotting temporal discontinuities and aberrations, making the completion step a tough one.We address both these issues in this work.
The first step of our system consists of transforming a rough user annotation into a mask that accurately represents the object to be edited.For this, we use a classical strategy relying on a CNN-based edge detector, followed by a watershed transform yielding super-pixels, which are eventually selected by the user to refine the segmentation mask.After this step, a label is then given to each object.The second step is the temporal propagation of the labels.There we make use of state-of-the-art advances in CNN-based multiple objects segmentation.Besides, our approach includes an original and crucial algorithmic brick which consists in learning the transition zones between objects and the background, in such a way that the objects will be fully covered by the propagated masks.We call the resulting brick a smart dilation by analogy with the dilation operators of mathematical morphology.Our last step is then to remove some or all of the objects from the video, depending on the user's choice.For this, we employ two strategies: a motion-based pixel propagation for static background and a patchbased video completion for dynamic textures.Both methods rely heavily on the knowledge of segmented objects.This interplay between objects segmentation and the completion scheme improves the method in many ways: it allows for better video stabilization, for a faster and more accurate search for similar patches, and for a more accurate foreground/background separation.These improvements yield completion results with very little or no temporal incoherence.
We illustrate the effectiveness of our system through several challenging cases including severe camera shake, complex and fast object motions, crossing objects, and dynamic textures.We evaluate our method on various datasets, in both objects segmentation and objects removal.Moreover, we show on several examples that our system yields comparable or better results than state-of-the-art video completion methods applied on manually segmented masks.This paper is organized as follows: First, we briefly explore some related works (section 2).Next, we introduce our proposed approach which includes three steps: First frame annotation, objects segmentation and objects removal (section 3).Finally, we show experimental results as well as some evaluation and comparison with other state-of-the-art methods.A shorter version of this work can be found in [40].

Related works
The proposed computational editing approach is related to several families of works that we now briefly review.

Video object segmentation
Video object segmentation, the process of extracting space-time segments corresponding to objects, is a widely studied topic whose complete review is beyond the scope of this paper.For a long time, such methods have not been accurate enough to avoid using greenscreen compositing to extract objects from videos.Significant progress for the supervised segmentation has been achieved by the end of the 2000s, see e.g.[2], and in particular, the use of supervoxels became the most flexible way to incorporate user annotations in the segmentation process [44,78].Other efficient approaches to the supervised object segmentation problem are introduced in [49,53].
A real breakthrough occurred with approaches relying on Convolutional Neural Networks (CNN).In the DAVIS-2016 challenge [63], the most efficient methods were all CNN-based, both for the unsupervised and semi-supervised tasks.For the semi-supervised task, where a first frame annotation is available, methods mostly differ in the way they train the networks.The One Shot Video Object Segmentation (OSVOS) method, introduced in [8], starts from a pre-trained network and retrains it using a large video dataset, before fine-tuning it per-video using the annotation at the first frame to focus on the object being segmented.With a similar approach, [62] relies on an additional mask layer to guide the network.The method in [7] further improves the results from OSVOS with the help of Multi Networks Cascade (MNC) [20].
All these approaches work image-per-image without explicitly checking for temporal coherence, and therefore can deal with large displacements and occlusions.
However, since their backbone is a network used for semantic segmentation, they cannot distinguish between instances of the same class or between objects that resemble each other.
Another family of works deals with the segmentation of multiple objects.Compared with the single object segmentation problem, an additional difficulty here is to distinguish between different object instances which may have similar colors and may cross each other.Classical approaches include graph-based segmentation using color or motion information [42,58,87], the tracking of segmentation proposals [17,45], or bounding box guided segmentation [22,71].
The DAVIS 2017 challenge [66] established a ranking between methods aiming at the semi-supervised segmentation of multiple objects.Again, the most efficient methods were CNN-based.It is proposed in [77] to modify the OSVOS network [8] to work with multiple labels and to perform online fine-tuning to boost the performances.In [37], the networks introduced in [74] are adapted to the purpose of multiple objects segmentation through the heavy use of data augmentation, still using annotation of the first frame.The authors of this work also exploit motion information by adding optical flow information to the network.This method is further improved in [46] by using a deeper architecture and a re-identification module to avoid propagating errors.This last method has achieved the best performance in the DAVIS-2017 challenge [66].With a different approach, Hu et al. [31] employ a recurrent network exploiting the long-term temporal information.
Recently, with the release of a large-scale video object segmentation dataset called YouTube Video Object Segmentation (YouTube-VOS) [83], many further improvements have been made in the field.Among them, one of the most notable work is PreMVOS [48] which has won the 2018 DAVIS Challenge [9] and Youtube-VOS challenge [83].
In PreMVOS, the algorithm first generates a set of accurate segmentation mask proposals for all objects in each frame of a video.To achieve this, a variant of the Mask R-CNN [29] object detector is used to generate coarse object proposals, then a fully convolutional refinement network inspired by [82] and based on the DeepLabv3+ [11] architecture produces accurate pixel masks for each proposal.Secondly, these proposals are selected and merged into accurate and temporally consistent pixel-wise object tracks over the video sequence.In contrast with PreMVOS which focuses on the accuracy, some methods trade off accuracy for speed.Those methods take the first frame with its mask annotation either as guidance to slightly adjust parameters of the segmentation model [85] or as a reference for segmenting the following frames without tuning the segmentation model [12,13,57].
Although these methods yield impressive results in terms of the accuracy of the segmentation, they may not be the optimal solutions for the problem we consider in this paper.As said above, when removing objects from a video it is crucial for the video completion step that no part of the removed objects remains after the segmentation.Said differently, we are in a context where recall is much more important than precision, see Section 4.2 for the definitions of these metrics.In the experimental section, we compare our segmentation approach to several state-of-the-art methods with the aim of optimizing a criterion which penalizes under-detection of objects.

Video editing
Recently, advances in both the analysis and the processing of videos have permitted advances in the emerging field of computational video editing.Examples include, among others, tools for the automatic, dialogue-driven selection of scenes [41], time slice video synthesis [19], or methods for the separate editing of reflectance and illumination components [6].It is proposed in [89] to identify accurately the background in videos to either improve the stabilization process or proceed to tasks such as background suppression or multi-layered editing.In a sense, our work is more challenging since we need to identify moving objects with enough accuracy so they can be removed seamlessly.
Because we learn a transition zone between objects and the background, our work is also related to image matting techniques [43], and their extension to videos [16] as a necessary first step for editing and compositing tasks.Lastly, since we deal with semantic segmentation and multiple objects, our work is also related to the soft semantic segmentation recently introduced for still images [1].

Video inpainting
Image inpainting, also called image completion, refers to the task of reconstructing missing or damaged image regions by taking advantage of the image contents outside these missing regions.
The first approaches were variational [50], or PDEbased [4] and dedicated to the preservation of geometry.They were followed by patch-based methods [18,23], inherited from texture synthesis methods [24].Some of these methods have been adapted to videos, often by mixing pixel-based approaches for reconstructing the background and greedy patch-based strategies for moving objects [60,61].In the same vein, different methods have been proposed to improve or speed up the reconstruction of the background [26,30], with the strong limitation that the background should be static.Other methods yield excellent results in restricted cases, such as the reconstruction of cyclic motions [36].
Another family of works which performs very well when the background is static relies on motion-based pixel propagation.The idea is to first infer a motion field outside and inside the missing regions.Using the completed motion field, pixel values from outside the missing region are then propagated inside it.For example, Grossauer et.al describes in [28] a method for removing blotches and scratches in old movies using optical flow.A limitation of this work is that the estimation of the optical flow suffers from the presence of the scratches.Using a similar idea, but avoiding calculating the optical flow directly in the missing regions, several methods try to restore the motion field inside these missing regions by gradually propagating motion vectors [51], by sampling spatialtemporal motion patches [72,73], or by interpolating the missing motion [5,88].
In parallel, it was proposed in [79] to address the video inpainting problem as a global patch-based optimization problem, yielding unprecedented time coherence at the expense of very heavy computational costs.The method in [54] was developed from this seminal contribution, by accelerating the process and taking care of dynamic texture reconstruction.Other state-of-the-art strategies rely on a global optimization procedure, taking advantage of either shift-maps [27] or an explicit flow field [32].This last method arguably has the best results in terms of temporal coherence, but since it relies on two-dimensional patches, it is not suitable for the reconstruction of dynamic backgrounds.Recently, it was proposed in [39] to improve the global strategy of [54] by incorporating the optical flow in a systematic way.This approach has the ability to reconstruct complex motions as well as dynamic textures.
Let us add that the most recent approaches to image inpainting rely on convolutional neural networks and have the ability to infer elements that are not present in the image at hand [33,59,76].To the best of our knowledge, such approaches have not been adapted to videos because their training cost is prohibitive.
In this work, we will propose two complementary ways to perform the inpainting step needed to remove objects in videos.A first method is fast and relies on a frame-by-frame completion of the optical flow, followed by the propagation of voxel values.This approach is inspired by the recently introduced method [5], itself sharing ideas with the approach from [32] and yielding impressive gains in terms of computational times.Such approaches are computationally efficient but not able to deal with moving backgrounds and dynamic textures.For these complex cases, we rely on a more sophisticated (and much slower) second approach extending the ideas we initially developed in [39].

Proposed method
The general steps of our method are as follows: (a) First, the user draws a rough outline of each object of interest in one or several frames, for instance in the first one (Section 3.1); (b) These approximate outlines are refined by the system, then propagated to all remaining frames using different labels for different objects (Section 3.2); (c) If some errors are detected, the user may manually correct them in one or several frames (using step (a)) and propagate these edits to the other frames (using step (b)); (d) Finally, the user selects which of the selected objects he/she wants to remove, and the system removes the corresponding regions in the whole video, reconstructing the missing parts in a plausible way (Sections 3.3.1 and 3.3.2).For this last step two options are available : a fast one for static background and a more involved one for dynamic backgrounds.
In the first step most methods only select the object to be removed.There are, however, several advantages to tracking multiple objects with different labels: 1.It gives more freedom to the user for the inpainting step with the possibility to produce various results depending on which objects are removed; in addition, objects which are labeled but not removed are considered as important by the system and therefore better preserved during the inpainting of other objects.
2. It may produce better segmentation results than tracking a single object, in particular when several objects have similar appearance.
3. It facilitates video stabilization and therefore increases the temporal coherence during the inpainting step, as shown in the results (Section 4.3).
4. It is of interest for other applications, e.g., action recognition or scene analysis.
The illustration of these steps can be found in the supplementary website https: //object-removal.telecom-paristech.fr/

First frame annotation
A classical method to cut out an object in a frame involves commercial tools such as the Magic Wand of Adobe Photoshop which is fast and convenient.However, this classical method requires many refinement steps and is not accurate with complex objects.To increase the precision and reduce the user's intervention, many methods have been proposed where interactive image segmentation is performed using scribbles, point clicks, superpixels, etc.Among them, some state-of-the-arts annotators achieve a high degree of precision by using edge detectors to find the contour map and create a set of object proposals from this map [35]; the appropriate regions are then selected by the user using point clicks.The main drawbacks of these approaches are a large computation time and a weak level of user input.In order to balance between human effort and accuracy, we adopt a fast and simple algorithm.Our system first generates a set of superpixels from the first image, then the user can select suitable superpixels by simply drawing a coarse contour around each object.The set of superpixels is created using an edge-based approach.More precisely, an FCN-based edge detector network introduced in [80] is applied to the first image, and its output is a probability map of edges.Superpixels are extracted from this map by the wellknown watershed transform [52], which runs directly on edge scores.There are two main advantages of using this CNN-based method to compute the edge map: 1.It has shown superior performances over traditional boundary detection methods that use local features such as colors and depths.In particular, it is much more accurate.

It is extremely fast:
one forward pass of the network takes about 2 ms hence the annotation step is performed in real time and very interactively.
After computing all superpixels, the user selects the suitable ones by drawing a contour around each target object to get rough masks.Superpixels which overlap these masks by more than 80 percent are selected.The user can also refine the mask by adding or removing superpixels using mouse clicks.As a result, accurate masks for all objects of interest are extracted in a frame within few seconds of interactive annotation.

Objects segmentation
In this step, we start from the object masks computed on the first frame using the method described in the previous section, and we aim at inferring a full space-time segmentation of each object of interest in the whole video.We want our segmentation to be as accurate as possible, in particular without false negatives.
Doing this in complex videos with several objects which occlude each other is an extremely challenging task.
As described in Section 2, CNNs have made important breakthroughs in semantic image segmentation with extensions to video segmentation in the last two years [9,64,66].However, current CNN-based semantic segmentation algorithms are still essentially image-based, and do not take global motion information sufficiently into account.As a consequence, semantic segmentation algorithms cannot deal with sequences where: (a) several instances of similar objects need to be distinguished; and (b) these objects may eventually cross each other.Examples of such sequences are Les Loulous1 introduced in [54] or Museum and Granados-S32 introduced in [26,27].
On the other hand, more classical video tracking techniques like optical-flow based propagation or global graph-based optimization do take global motion information into account [84].Nevertheless, they are most often based on bounding boxes or rough descriptors and do not provide a precise delineation of objects' contours.Two recent attempts to adapt video-tracking concepts to provide a precise multi-object segmentation [68,75] fail completely when objects cross each other like in the Museum, Granados-S3 or Loulous sequences.
In the rest of this section, we describe a novel hybrid technique which combines the benefits of classical video tracking with those of CNN-based semantic segmentation.The structure of our hybrid technique is shown in Figure 1.CNN-based modules are depicted in green and red, and their inner structure is described in Section 3.2.1 and Figure 2. Modules that are inspired from video-tracking concepts are depicted in blue and are detailed in Section 3.2.2.
Note that the central part of Figure 1 operates in a frame-by-frame basis.Each segmentation proposal by the Multi-OSVOS network (in green), or by the Mask propagation module (in blue) is improved by the Refinement network (in red).In the right part of the figure the Mask linking module (in blue) builds a graph that links all segmentation proposals from the previous steps, and makes a global decision on the optimal segmentation for each of the K objects to be tracked.Finally the Keyframe extraction module is required to set sensible temporal limits to the Mask propagation iterations, and the final post-processing module further refines the result with the objective of maximizing the recall, which is much more important than precision in the case of video inpainting.All these modules will be explained in more detail in the next sections.

Semantic segmentation networks
Our system uses two different semantic segmentation networks: a multi-OSVOS network and a refinement network.Both operate on a frame by frame basis.
Our implementation of multi-OSVOS computes K+1 masks for each frame: K masks for the K objects of interest and one novel additional mask covering the objects' boundaries.We call this latter mask a smart dilation layer, it is a key to guarantee that the segmentation does not miss any part of the objects, which is especially difficult in the presence of motion blur.
While the multi-OSVOS network provides a first prediction, the refinement network takes mask predictions as an additional guidance input and improves those predictions based on image content, similarly to [62].
Training these networks is a challenging task, because the only labeled example we can rely on (for supervised training) is the first annotated frame and the corresponding K masks.The next paragraphs focus on our networks' architectures and on semi-supervised training techniques that we use to circumvent the training difficulty.
Multi-OSVOS network.The training technique of our semantic segmentation networks is mainly inspired from the OSVOS network [8], a breakthrough which achieved the best performance in DAVIS-2016 challenge [63].The OSVOS network uses a transfer learning technique for image segmentation: the network is first pre-trained on a large database of labeled images.After training, this so-called parent network can roughly separate all foreground objects from the background.Next, the parent network is fine-tuned using the first frame annotation (annotation mask and image) in order to improve the segmentation of a particular object of interest.OSVOS has proven to be a very fast and accurate semi-supervised method to obtain a background/foreground separation.Our Multi-OSVOS network uses a similar transfer learning technique, yet with several important differences: • Our network can identify different objects separately (instead of a simple foreground/background segmentation) and provides a smart dilation mask, i.e. a smart border which covers the interfaces between segmented objects and the background, and reduces a lot the number of false negative pixels.The ground truth for this smart dilation mask is defined in the fine-tuning step by a 7-pixels wide dilation of the union of all object masks.
• Unlike OSVOS, which uses a fully convolutional network (FCN) [47], our network uses the Deeplab v2 [10] architecture as the parent model since it outperforms FCN in some common datasets such as PASCAL VOC 2012 [25].• In the fine-tuning training step we adopt a data augmentation technique in the spirit of Lucid Tracker [37]: we remove all objects from the first frame using Newson et al's image inpainting algorithm [55], then the removed objects undergo random geometric deformations (affine and thin plate deformations), and eventually they are Poisson blended [65] over the reconstructed background.This is a sensible way of generating large amounts of labeled training data with an appearance similar to what the network might observe in the following frames.The smart dilation mask is of particular importance to ensure that segmentation masks do not miss any part of the object, which is typically difficult in the presence of  motion blur.A typical example can be seen in figure 3 where some parts of the man's hands and legs cannot be captured by simply dilating the output mask because motion blur leads to partially transparent zones which are not recognized by the network as part of the man's body.With the smart dilation mask, the missing parts are properly captured, and there are no leftover pixels.
Refinement network The multi-OSVOS network can separate objects and background precisely, but it relies exclusively on how they appear in the annotated frame without consideration of their position, shape or motion cues across frames.Therefore, when objects have similar appearance, multi-OSVOS fails to separate between individual object instances.In order to take such cues into account we propagate and compare the prediction of multi-OSVOS across frames using video tracking techniques (Section 3.2.2) and then we doublecheck and improve the result after each tracking step using the refinement network described below.
The refinement network has the same architecture as the multi-OSVOS network, except that (a) it takes an additional input, namely mask predictions for the K foreground objects from another method, and (b) it does not produce as an output the (K + 1)-th smart dilation mask that does not require any further improvement for our purposes.
Training is performed in exactly the same way as for multi-OSVOS, except that the training set has to be augmented with inaccurate input mask predictions.These should not be exactly the same as the output masks, otherwise the network would learn to perform a trivial operation ignoring the RGB information.Such inaccurate input mask predictions are created by applying relevant random degradations to ground truth masks, e.g., small translations, affine and thinplate spline deformations, followed by a coarsening step (morphological contour smoothing and dilation) to remove details of the object contour; finally, some random tiny square blocks are added to simulate common errors in the output of multi-OSVOS.The ground truth output masks in the training dataset are also dilated by a structuring element of size 7 × 7 pixels in order to have a safety margin which ensures that the mask does not miss any part of the object.

Multiple object tracking
As a complement to CNN-based segmentation we use more classical video tracking techniques in order to take global motion and position information into account.
The simplest ingredient of our object tracking subsystem is a motion-based mask propagation technique that uses a patch-based similarity measure to propagate a known mask to the consecutive frames.
It corresponds to block (b) in Figure 1 and it will be described in more detail below.
This simple scheme alone can provide results similar to other object tracking methods such as SeamSeg [68] or ObjectFlow [75].In particular it is able to distinguish between different instances of similar objects, based on motion and position.
However it loses track of the objects when they cross each other, and it accumulates the errors.To prevent this from happening we complement the mask propagation module with five coherence reinforcement steps: Semantic segmentation: The refinement network (Section 3.2.1) is applied to the output of each mask propagation step in order to avoid errors accumulating from one frame to the next.
Keyframe extraction: Mask propagation is effective only when it propagates from frames where object masks are accurate (especially when objects do not cross each other).Frames where this is detected to be true are labeled as keyframes, and mask propagation is performed only between pairs of successive keyframes.
Mask linking: When the mask propagation step is not sure about which decision to make, it will provide not one, but several mask candidates for each object.A graph-based technique allows to link together all these mask candidates.This way the decision on which mask candidate is the best for a given object on a given frame is taken based on global motion and appearance information.
Post-processing: After mask linking a series of postprocessing steps are performed that use the original Multi-OSVOS result to expand labelling to unlabelled regions.
Interactive correction: In some situations where errors appear, the user can manually correct them on one frame and this correction is propagated to the remaining frames by the propagation module.
The following paragraphs describe in detail the inner workings of the four main modules of our multiple object tracking subsystem: Keyframe extraction.A frame t is a keyframe for an object i ∈ {1, . . ., K} if the mask of this particular object is known or can be computed with high accuracy.
All frames where the object masks were manually provided by the user are considered keyframes.This is usually the first frame or very few representative frames.
The remaining frames are considered keyframes for a particular object when the object is clearly isolated from other objects and the mask for this object can be computed easily.To quantify this criterion, we rely on the multi-OSVOS network which returns K + 1 masks O i for each frame t and i ∈ {1, . . ., K + 1}.This allows to compute the global foreground mask Mask propagation Masks are propagated forwards and backwards between keyframes to ensure temporal coherence.More specifically, the forward propagation proceeds as follows: Given the mask M t at frame t, the propagated mask M t+1 is constructed with the help of a patch-based nearest neighbor shift map φ t from frame t + 1 to frame t, defined as φ t (p) := argmin δ q∈Np u t+1 (q) − u t (q + δ) i.e. it is the shift δ that minimizes the squared Euclidean distance between the patch centered at pixel p in frame t+1 and the patch around p+δ at frame t.In this expression, N p denotes a square neighborhood of given size centered at p, and D t (p) is the associated patch in frame t, i.e.D t (p) = u t (N p ) with u t the RGB image corresponding to frame t.The 2 -metric between patches is denoted as d.To improve robustness and speed, this shift map is often computed using an approximate nearest neighbor searching algorithm such as Coherency Sensitive Hashing (CSH) [38], or FeatureMatch [67].To capture the connectivity of patches across frames in the video, two additional terms are used in [68] for space and time consistency: the first term penalizes the absolute shift and the latter penalizes neighbourhood incoherence to ensure adjacent patches flow coherently.Moreover, to reduce the patch space dimension and to speed up the search, all patches are represented with lower dimension features, e.g. the main components in the Walsh-Hadamard space, see [68] for more details.We use this model to calculate our shift map.
Once the shift map has been computed we propagate the mask as follows: Let u t (p) be the RGB value of the pixel p in frame t, then the similarity between a patch D t+1 (p) in frame t + 1 and its nearest neighbour D t (p + φ(p)) in frame t is measured as Using this similarity measure the mask M t+1 is propagated from M t using the following rule: The final propagated mask M t+1 is obtained by a series of morphological operations including opening and hole filling on Mt+1 followed by the refinement network to correct some errors.
Then M t+1 is iteratively propagated to the next frame t + 2 using the same procedure until we reach the next keyframe.
Although this mask propagation approach is useful, several artifacts may occur when objects cross each other: the propagation algorithm may lose track of an occluded object or it could mistake one object for the other.
To avoid such errors, mask propagation is performed in both forwards and backwards directions between keyframes.This gives for each object two candidate masks at each frame t: M 1 t = M F W t , i.e. the one that has been forward-propagated from a previous keyframe t < t and M 2 t = M BW t , i.e. the one that has been backward-propagated from an upcoming keyframe t > t.In order to circumvent both lost and mistaken objects we consider for each object two additional candidate masks: The decision between these four mask candidates for each frame and each object is deferred to the next step, which makes that decision based on a global optimization.
Mask linking After the backward and forward propagation, each object has 4 mask proposals (except for keyframes where it has a single mask proposal).In order to decide which mask to pick for each object at each frame, we use a graph-based data association technique (GMMCP) [21] that is specially well-suited for video tracking problems.This technique does not only allow to select among the 4 candidates for a given object on a given frame.It is also capable of correcting erroneous object-mask assignments on a given frame, based on global similarity computations between mask proposals along the whole sequence.The underlying generalized maximum multi-cliques problem is clearly NP-hard, but the problem itself is of sufficiently small size to be handled effectively by a fast Binary-Integer Program as in [21].
Formally, we define a complete undirected graph G = (V, E) where V is a set of vertices, each vertex corresponding to a mask proposal.Vertices in the same frame are grouped together to form a cluster.E is the set of edges connecting any two different vertices.Each edge e ∈ E is weighted by a score measuring the similarity between the two masks it connects.This score will be detailed in the next paragraph.All vertices in different clusters are connected together.The objective is to pick a set of K cliques4 that maximize the total similarity score, with the restriction that each clique contains exactly one vertex from each cluster.Each selected clique represents the most coherent tracking of an object across all frames.
Region similarity for mask linking In our graphbased technique, a score needs to be specified to measure the similarity between the two masks, and the associated image data.This similarity must be robust to illumination changes, shape deformation or occlusion.Many previous approaches in multiple object tracking [21,69] have focused on global information of the appearance model, typically the global histogram, or motion information (given by the optical flow or a simple constant velocity assumption).However, when dealing with large displacement and with an unstable camera, the constant velocity assumption is invalid and optical flow estimation is hard to apply.Furthermore, using only global information is not sufficient since our object regions already resemble in global appearance.To overcome this challenge, we define our similarity score as the combination between global and local features.More precisely, each region R is described by the corresponding mask M its global HSV histograms H, a set P of SURF keypoints [3] in it and a set E of vectors which connect each keypoint with the centroid of the mask.Each region is determined by four elements: Fig. 4 Mask proposals are linked across frames to form a graph.The goal is then to select a clique from this graph minimizing the overall cost.As a result, a best candidate is picked for each frame to ensure that the same physical object is tracked.Then the similarity between two regions is defined as: where d c is the cosine distance between two HSV histograms which encode global color information, S P is the local similarity computed based on keypoint matching, and α is the balance coefficient to specify the contribution of each component.S P is computed by where γ ij is the indicator function which is set to 1 if two keypoints p i and p j match, and zero otherwise.This function is weighted by w ij based on the position of the matching keypoints with respect to the centroid of the region: 2σ where d c is the cosine distance between two vectors and σ is a constant.
Post-processing At this time, we already have K masks for K objects for all frames in video.Now we perform a post-processing step to make sure our final mask covers all the details of the objects.This is very important in video object removal since any missing detail can cause perceptually annoying artifacts in the object removal result.This post-processing includes two main steps: The first step is to give a label for each region in the global foreground mask F t = K i=1 O i t (the union of all object masks produced by multi OSVOS for frame t) which does not have any label yet.To this end, we proceed as follows: First, we compute the connected components C of all masks O i t and try to assign a label to all pixels in each connected component.To this end we consider the masks M j t that were obtained for the same frame t (and possibly another object class j by the mask linking method).A connected component is considered as isolated if C ∩ M j t is empty for all j.For non-isolated components a label will be assigned by a voting scheme based on the ratio r j (C) = |C| , i.e. the assigned label for region C will be ĵ = argmax j r j (C), the one with the highest ratio.If r j (C) > 80% then region C is also assigned label j regardless of the voting result, which leads to possibly multiple labels per pixel.
In the second step, we do a series of morphological operations, namely opening and hole filling.Finally we dilate each object mask again with size 9 × 9, this time allowing overlap between objects.

Object removal
Following the method from the previous section, all selected objects have been segmented along the complete video sequence.From the corresponding masks, the user can then decide the objects to be removed.This last step is performed thanks to video inpainting techniques that we now detail.First, we present a simple inpainting method that is adapted to the case where the background is static (or can be stabilized) and revealed at some point in the sequence.This first method is fast and relies on the reconstruction of a motion field.Then, we present a more involved method for the case where the background is moving, with possibly some complex motion as in the case of dynamic textures.

Static background
We assume for this first inpainting method that the background is visible at least in some frames (for instance because the object to be removed is moving over a large enough distance).We also assume that the background is rigid and that its motion is only due to the camera motion.In this case, the best option to perform inpainting is to copy the visible parts of the background into the missing regions, from either past or future frames.For this, the idea is to rely on a simple optical-flow pixel propagation technique.Motion information is used to track the missing pixels and establish a trajectory from the missing region toward the source region.
Overview of the method Our optical flow-based pixel propagation approach is composed of three main steps, as illustrated in Figure 6.After stabilizing the video to compensate the camera movements, we use FlowNet 2.0 to estimate forward and backward optical flow fields.These optical flow fields are then inpainted using a classical image inpainting method to fill in the missing information.Next, these inpainted motion fields are concatenated to create a correspondence map between pixels in the inpainting region and known pixels.Lastly, missing pixels are reconstructed by a copy-paste scheme followed by a Poisson blending to reduce artifacts.
Motion field reconstruction A possible approach to optical flow inpainting is smooth interpolation, for instance, in the framework of a variational approach, by ignoring the data term and using only the smoothness term in the missing regions, as proposed in [5,88].However, this approach leads to over-smoothed and unreliable optical flow.
Therefore, we choose to reconstruct the optical flow using more sophisticated image inpainting techniques.More specifically we first compute, outside the missing region, forward/backward optical flow fields between two consecutive frames using the FlowNet approach from [34].We then rely on the image inpainting method from [55] to interpolate these motion fields.
Optical flow-based pixel reconstruction Once the motion field inside the missing region is filled, it is used to propagate pixel values from the source toward the missing regions.For this to be done, we map each pixel in the missing region to a pixel in the source region.This map is obtained by accumulating the optical flow field from frame to frame (with bilinear interpolation).We do both forward and backward optical flow, which leads us to two correspondance maps: forward map and backward map.From either map, we can reconstruct missing pixels with a simple copy-paste method, using the known values outside the missing region.
We perform two passes: first a forward pass using the forward map to reconstruct the occlusion, then a backward pass using the backward map.After these two passes, the remaining missing information corresponds to parts that have never been revealed in the video.To reconstruct this information, we first use the image inpainting method from [55] to complete one keyframe, which is chosen to be the middle frame of the video, and then propagate information from this frame to other frames in the video using forward and backward maps.
Poisson blending Videos in real-life often contain illumination changes, especially when they are recorded outdoor.This is problematic for our approach that simply copy-paste pixel values.When the illumination of the sources is different from the illumination of the restored frame, visible artifacts across the border of the occlusion may appear.A common way to resolve this is by applying a blending technique, e.g.Poisson blending [65], which fuses a source image and a target image in the gradient domain.However, doing Poisson blending frame-by-frame may affect the temporal consistency.To maintain it, we adopt the recent method of Bokov et al. [5] which takes into account the information of the previous frame.In this method, a regularizer which penalizes discrepancies between the reconstructed colors and their corresponding colors in the optical-flow-aligned previous frame is introduced.More specifically, given the colors of the current and previous inpainted frames I t (p), I t−1 (p), respectively, the refined Poisson-blended image I(p) can be obtained by minimizing the discretized energy functional [5]: Here, ∂Ω t denotes the outer-boundary pixels of the missing region Ω t , G t (p) is the target gradient field and O t (p) is the optical flow at position p between frames t − 1 and t.The terms w P B p are defined as where I P B is the usual Poisson blended image, and are used to weight the reconstruction results from the previous frame I t−1 in the boundary conditions.In this definition, σ P B is a constant controling the strength of the temporal-consistency enforcement.These weights allow to better deal with global illumination changes while enforcing temporal stability.
This Poisson blending technique is applied at every pixel propagation step to support the copy-paste framework.

Dynamic background
The simple optical flow-based pixel propagation method that we proposed in section 3.3.1 can produce plausible results if the video contains only static background and simple camera motion.More involved methods are needed to deal with large pixel displacement and complex camera movements.They are typically based on joint estimation of optical flow and color information inside the occlusion, see for instance [32,81].However, when the background is dynamic or contains moving objects, these latter methods often fail to capture oscillatory patterns in the background.In that situation, global patch-based methods are preferred.They rely on the minimization of a global energy computed over space-time patches.This idea was first proposed in [79], later improved in [54], and recently improved again in Le et al. [39].
Let us describe briefly the method proposed in [39].A prior stabilization process is applied to compensate the instabilities due to camera movements (see below for the improvement proposed in the current work).Then a multiscale coarse-to-fine scheme is used to compute a solution to the inpainting problem.The general structure of this scheme is the following: at each scale of a multiscale pyramid, we alternate until convergence the computation of an optimal shift map between pixels in the inpainting domain and pixels outside (using a metric between patches which involves image colors, texture features, and optical flow), and the update of image colors inside the inpainting domain (using a weighted average of the values provided by the shift map).A key to the quality of the final result is the coarse initialization of this scheme; it is obtained by progressively filling in the inpainting domain (at the coarsest scale) using patch matching and (mapped) neighbors averaging together with a priority term based on optical flow.The heavy use of optical flow at each scale helps a lot to enforce the temporal consistency even in difficult cases such as dynamic background or complex motion.In particular, the method can reconstruct moving objects even when they interact with each other.The whole method is computationally heavy but the speed is significantly boosted when all steps are parallelized.
We have recently brought several improvements to this method of [39]: In general, patch-based video inpainting techniques require a good video stabilization as a pre-processing step to compensate patch deformations due to camera motions [56,70].This video stabilization is usually done by calculating a homography between two consecutive frames using keypoints matching followed by a RANSAC algorithm to remove outliers [15].However, large moving objects appearing in the video may reduce the performances of such an approach because too many keypoints may be selected on these objects and prevent the homography from being estimated accurately from the background.This problem can be solved by simply neglecting all segmented objects for computing the homography.This is easy to do: since we already have the masks of the selected objects, we just have to remove all keypoints which are covered by masks.This is an advantage of our approach where both segmentation and inpainting are addressed.
Background/Foreground inpainting : In addition to stabilization improvement, multiple segmentation masks are also helpful for inpainting separately the background and the foreground.More precisely, we first inpaint the background neglecting all pixels contained in segmented objects.After that, we inpaint in priority the segmented objects that we want to keep and which are partially occluded.This increases the quality of the reconstruction, both for the background and for the objects.Furthermore, it reduces the risk of blending segmented objects which are partially occluded because segmented objects have separate labels.In particular, it is extremely helpful when several objects overlap.
Let us finally mention another advantage of our joint tracking/inpainting method: objects are better segmented and thus easier to inpaint for it is a wellknown fact that the inpainting of a missing domain may be of lower quality if the boundary values are not suitable.In our case, time continuity of segmented objects and the fact of using different labels for different objects have a huge impact on the quality of the inpainting.

Results
We first evaluate our results for the segmentation step of the proposed method, for which we provide quantitative and visual results and comparisons with state-of-the-art methods.We then provide several visual results for the complete object removal process, again comparing with the most efficient methods.These visual comparisons are given as isolated frames in the paper and it is of course more informative to go for the complete videos in the supplementary material, see https://object-removal.telecom-paristech.fr/.
We consider various datasets: we use sequences from the DAVIS-2016 [63] challenge, from the MOViC [14], and from the ObMIC [86] datasets; we also consider classical sequences from the papers [26] and [54].Eventually, we provide several new challenging sequences containing strong appearance changes, motion blur, objects with similar appearance and possibly crossing, as well as complex dynamic textures.
Concerning the number of annotated frames: Unless otherwise stated only the first frame is annotated by the user in all experiments.In some examples (e.g.CAMEL) not all objects are visible in the first frame and we use another frame for annotations.In a few examples we annotate more than one frame (e.g first and last frame in TEDDY BEAR-FIRE AND JUMPING GIRL-FIRE) in order to illustrate the flexibility of the system for correcting errors.

Implementation details
For the segmentation part, we use the Deeplab v2 [10] architecture for the multi-OSVOS and refining networks.We initialize the network using the pretrained model provided by [10] and then adapt it to video using the training set of DAVIS-2016 [64] and train-val set in DAVIS-2017 [66] datasets (sequences from which we exclude the validation set of DAVIS-2016).For the data augmentation procedure, we generate 100 pairs of images and ground truth from the first frame annotation, following the same protocol as in [37].For the patch-based mask propagation and mask linking, we evolved from the implementation of [68] and [21], respectively.
For the video inpainting step, we use the default parameters from our previous work [39].In particular, the patch size is set to 5, and the number of levels in the multi-scale pyramid is 4.
For a typical sequence with resolution (854 × 480) and 100 frames, the full computational time is of the order of 45 minutes for segmentation plus 40 minutes for inpainting on a core I7 CPU machine with 32 Gb of RAM and a GTX 1080 GPU.While this is a limitation of the approach, the complete object removal is about one order of magnitude faster than the single completion step from state-of-the-art methods [54] or [32].While interactive editing is out of reach for now, these computational times allow the offline postprocessing of sequences.

Object segmentation
For the proposed object removal system, and as explained in detail above, the most crucial point is that the segmentation masks shall completely cover the considered objects, including motion and transition blur.Otherwise, unacceptable artifacts remain after the full object removal procedure (see Figure 13 for an example).In terms of performance evaluation, this means that we favor recall over precision, as defined below.This also means that the ground truth provided with classical datasets may not be fully adequate to evaluate segmentation in the context of object removal, because they do not include transition zones induced by, e.g., motion blur.For this reason, recent video inpainting methods that make use of these databases to avoid the tedious manual selection of objects, usually start from a dilation of the ground truth.In our case, a dilation is learned by our architecture (smart dilation) at the segmentation step, as explained above.For these reasons, we compare our method with state-of-the-art object segmentation methods, after various dilations and on the dilated versions of the ground truth.We also provide visual results in our supplementary website: https://object-removal.telecom-paristech.fr/.
Evaluation Metrics We briefly recall here the evaluation metrics that we use in this work: some of them are the same as in the DAVIS-2016 challenge [63] and we also add other metrics that are specialized for our task.The goal is to compare the computed segmentation mask (SM) to the ground truth mask (GT).The recall metric is defined as the ratio between the area of the intersection between SM and GT, and the area of GT.The precision is the ratio between the area of the intersection and the area of the SM.Eventually, the IOU (intersection over union), or Jaccard index, is defined as the ratio between intersection and union.
Single object segmentation We use the DAVIS-2016 [63] validation set and compare our approach to recent semi-supervised state-of-the-art techniques (SeamSeg [68], ObjectFlow [75], MSK [64], OSVOS [8] and onAVOS [77]) using the pre-computed segmentation masks provided by the authors.As explained above, we consider a dilated version of the Tab. 1 Quantitative evaluation of our object segmentation method compared to other state-of-the-art methods, on the single object DAVIS-2016 [63] validation set.As explained in the text, the main objective when performing object removal is to achieve high Recall scores.
ground truth (we use a dilation by a 15×15 structuring element, as in [32,39]).Therefore, we apply a dilation of the same size to the masks from all the concurrent methods.In our case, this dilation has both been learned (size 7 × 7) and applied as a post-processing step (size 9 × 9).Since the composition of two dilations with such sizes yields a dilation with size 15 × 15, the comparison is fair.
Table 1 shows the comparisons using the three abovementioned metrics.Our method has the best recall score overall, therefore achieving its objective.The precision score remains very competitive.Besides, our method outperforms OSVOS [8] and MSK [64], those having a similar neural network backbone architecture (VGG16), on all metrics.The precision and IOU scores compare favorably with onAVOS [77] which uses a deeper and more advanced network.Table 2 provides a comparison between OSVOS [8] and our approach on two sequences from [27].These sequences have been manually segmented by the authors of [27] for video inpainting purposes.On such extremely conservative segmentation masks (in the sense that they over-detect the object), the advantage of our method is particularly strong.
As a further experiment, we investigate the ability of dilations with various sizes to improve the recall without degrading the precision too much.For this, we plot precision-recall curves as a function of the structuring element size (ranging from 1 to 30).To include our method on this graph, we start from our original method (highlighted with a green square) and apply to it either erosions with a radius ranging from 1 to 15, or dilation with a radius ranging from 1 to 15. Again this makes sense since our method has learned a dilation whose equivalent radius is 15.Results are displayed in Figure 8.As can be seen from this Quantitative evaluation of our object segmentation method compared to other state-of-the-art methods, on two multiple objects datasets (MOVICs [14] and ObMIC [86]) figure, our method is the best in terms of recall, and the recall is increasing significantly with respect to the dilation size.With the sophisticated onAVOS method, on the other hand, the recall increases slowly, and the precision drops drastically as the dilation size increases.Basically, these experiments show that the performances achieved by our system for the full coverage of a single object (that is, with as few missed pixels as possible) cannot be obtained from state-ofthe-art object segmentation methods by using simple dilation techniques.
Multiple objects segmentation Next, we perform the same experiments for datasets containing videos with multiple objects.Since the test ground truth was not yet available (at the time of this writing) for the DAVIS-2017 dataset and since our network was trained on the train-val set of this dataset, we consider two other datasets: MOViCs [14] and ObMIC [86].The datasets include multiple objects, but only have one label per sequence.To evaluate the multiple object situations, we only kept sequences containing more than one object, and then manually re-annotated the ground truth giving different labels for different instances.Observe that these datasets contain several major difficulties such as large camera displacement, motion blur, similar appearances, and crossing objects.Results are summarized in Table 3. From this table, roughly the same conclusions as in the single object can be drawn, namely the superiority of our method in term of recall score, without sacrificing much the precision score.
Some qualitative results of our video segmentation technique are shown in Figure 7.In the first two rows, we show some frames corresponding to the single object case, on the DAVIS-2016 dataset [63].The last three rows show multiple objects segmentation results on MOViCs [14], ObMIC [86] and Granados's sequences [26] respectively.We observe on these examples that our approach yields full object coverage, even with complex motion and motion blur.This is particularly noticeable on the sequences KITE-SURF and PARAGLIDING-LAUNCH.In the multiple objects cases, the examples illustrate the capacity of our method to deal with complex occlusions.This cannot be achieved with mask tracking methods such as objectFlow [75] or SeamSeg [68].The OSVOS method [8] yields some confusion between objects, probably because the temporal continuity is not taken into account by this approach.

Object removal
Next, we evaluate the complete object removal pipeline.We consider both the inpainting versions that we have introduced.We use the simple, opticalflow based method introduced in Section 3.3.1 for sequence having static background.We refer to this fast method as the static version.We use the more complex method derived from [39] and detailed in Section 3.3.2for more involved sequences, exhibiting challenging situations such as dynamic background, camera instability, complex motions, and crossing objects.We refer to this second slower version as the dynamic version.
In Figure 9, we display examples of both single and multiple objects removal, through several representative frames.
The video results can be fully viewed in the supplementary website.The first sequence BLACKSWAN (DAVIS-2016) shows that our method (dynamic version) can plausibly reproduce dynamic textures.In the second sequence COWS (DAVIS 2016), the method yields good results, with  a stable background and continuity of the geometrical structures, despite a large occlusion implying that some regions are covered through all the sequence.We then turn to the case of multiple objects removal.In the sequence CAMEL (DAVIS-2017), we show the removal of one static object, a challenging case since the background information is missing at places.On this example, the direct use of the inpainting method from [39] results in some undesired artifacts when the second camel enters the occlusion.By using multiple object segmentation masks to separate background and foreground, we can create a much more stable background.The last two examples are from an original video.This sequence again highlights that our method can deal with dynamic textures and hand-held cameras.
Comparison with state-of-the-art inpainting methods In these experiments, we compare our results with the state-of-the-art video inpainting methods [32] and [54].First, we provide a visual comparison between our optical flow-based pixel propagation (that is, the static approach) with the method of Huang et al. [32] using a video with a static background.Figure 10 shows some representative frames of the sequence HORSE-JUMP-Fig.9 Visual illustrations of our objects removal system.Fig. 10 Qualitative comparison with Huang's method [32].From top to bottom: our segmentation mask, result from [32] performed on manually segmented mask, our inpainting results performed on our mask.Fig. 11 Qualitative comparison with Huang's method [32] on video with dynamic background.From left to right: our segmentation mask, result from [32], our inpainting result performed on our mask.Fig. 12 Qualitative comparison with Newson et al's method [54].Top: our segmentation masks, red and green masks denote different objects, yellow region is the overlap region between two objects.Middle: results from [54] performed on our segmentation masks.Bottom: our inpainting results performed on the same masks.
HIGH.In this sequence, we get a comparable result using our simple optical flow-based pixel propagation approach.Our advantage is the considerable reduction of the computational time.With a not-optimized version of the code, our method takes approximately 30 minutes to finish while [32] takes about 3 hours to complete this sequence.
Next, we qualitatively compare our method with [32] when reconstructing dynamic backgrounds.We use the code released by the author on several sequences using the default parameters.In general, Huang et al. [32] fail to generate convincing dynamic textures.This can be explained by the fact that their algorithm relies on dense flow fields to guide the completion, these fields being often unreliable for dynamic texture.Moreover, they fill the hole by sampling only 2D patches from the source regions and therefore the periodic repetition of the background is not captured.Our method, on the other hand, fills the missing dynamic textures in a plausible way.
Figure 10 shows the representative frames of the reconstructed sequence TEDDY-BEAR, which is recorded indoor.This sequence is especially challenging because of the presence of both dynamic and static textures, as well as because of illumination changes.Our method yields a convincing reconstruction of the fire, contrarily to [32].The complete video can be seen in the supplemental material website.
We also compare our results with the video inpainting technique from [54].
Figure 12 shows some representative frames of the sequence PARK-COMPLEX, which is taken from [27] and is modified to focus on the moment where objects occlude each other.In this example, the method of [54] cannot reconstruct the moving man on the right which is occluded by the man on the left.This is because the background behind this man changes over time (from tree to wall).Since Newson et al's method [54] treats the background and the foreground similarly, the algorithm can not reconstruct the situation "man in front of the wall" because it never sees this situation before.Our method, by making use of the optical flow and thanks to the objects segmentation map, can reconstruct the "man" and the "wall" independently, yielding a plausible reconstruction.
Impact of the segmentation masks on the inpainting performances.In these experiments, we highlight the advantages of using the segmentation masks of multiple objects to improve the video inpainting results.
First, we emphasize the need for masks which fully cover the objects to be removed.Figure 13 (top) demonstrate the situation where some object details (the waving hand in this case) are not covered by the mask (here using the state-of-the-art OSVOS method) [8].This situation leads to a very unpleasant artifact when video inpainting is performed.Thanks to the smart dilation, introduced in the previous sections, our segmentation mask fully cover the object to be removed, yielding a more plausible video after the inpainting step.
Object segmentation masks can also be helpful for the video stabilization step.Indeed, in case of large foregrounds, these can have a strong effect on the stabilization procedure, yielding a bad stabilization of the background, which in turn yields bad inpainting results.In contrast, if the stabilization is applied only to the background, the final object removal results are much better.This situation is illustrated in the supplementary material.
To further investigate the advantage of using multiple segmentation masks to separate background/foreground in the video completion algorithm, we compare our method with the direct application of the inpainting method from [39], without separating objects and background.Representative frames of both approaches are shown in Figure 14.
Clearly, [39] produce artifacts when the moving objects (the two characters) overlap the occlusion, due to patches from these moving objects being propagated within the occlusion in the nearest neighbor search step.Our method, on the other hand, does not suffer from this problem because we reconstruct background and moving objects separately.This way, the background is more stable, and the moving objects are well reconstructed.

Conclusion, limitations and discussion
In this paper, we have provided a full system performing object removal in videos.The input of the system is made of a few strokes provided by the user to indicate the objects to be removed.To the best of our knowledge, this is the first system of this kind, even though the Adobe company has recently announced to be developing such a tool, under the name Cloak.The approach can deal with multiple, possibly crossing objects, and can reproduce complex motions and dynamic textures.
Although our method achieves good visual results on different datasets, it still suffers from a few limitations.
Fig. 13 Results of object removal using masks computed by OSVOS (top) and ours (bottom).From left to right: Segmentation mask, the resulting object removal on one frame, zooms.We can see that when the segmentation masks do not fully cover the object (OSVOS), the resulting video contain visible artifacts (the hand of the man remains after object removal).

Fig. 14
The advantage of using the segmentation masks to separate background and foreground.Left: without separating background/foreground, the result have many artifacts.Right: the background and foreground are well reconstructed when being reconstructed independently.
First, parts of the objects to be edited may be ignored by the segmentation masks.In such cases, as already emphasized, the inpainting step of the algorithm will amplify the remaining parts, creating strong artifacts.This is an intrinsic problem of the semi-supervised object removal task and room remains for further improvement.Further, the system is still relatively slow, and in any case far from realtime.Accelerating the system could allow for interactive scenarios where the user can gradually correct the segmentation-inpainting loop.
The segmentation of shadows is still not flawlessly performed by our system, especially when the shadows are not strongly contrasted.It is a desirable property of the system to be able to deal with such cases.This problem can be seen in several examples provided in the supplementary material.
Concerning the inpainting module the user has to currently choose between the fast motion-based version (which works better for static backgrounds) and the slower patch-based version which is required in the presence of complex dynamic backgrounds.An integrated method that reunites the advantages of both would be preferable.Huang's method [32] makes a nice attempt in this direction, but its use of 2D patches is not sufficient to correctly inpaint complex dynamic textures, which are more plausibly inpainted by our 3D patch-based method.
Another limitation occurs in some cases where the background is not revealed, specifically when some semantic information should be used.Such difficult cases are gradually being solved for single images by using CNN-based inpainting schemes [33].While the training step of such methods is still out of reach for videos as of today, developing an object removal scheme fully relying on neural networks is an exciting research direction.

Fig. 1
Fig. 1 General pipeline of our object segmentation method.Given the input video and annotations in the first frame, our algorithm alternates two CNN-based semantic segmentation steps (multi-OSVOS network in green and Refining network in red) with 4 videotracking steps (depicted as blue blocks): (a) keyframe extraction, (b) mask propagation, (c) mask linking and (d) post processing.These steps are detailed in Section 3.2.

Fig. 2
Fig. 2 Two networks used in the general pipeline presented in Figure 1.Left: multi-OSVOS network, Right: refinement network.They serve different purposes: the multi-OSVOS network helps us separating background and objects while the refinement network is used to fine-tune a rough input mask.

Fig. 3
Fig. 3 Advantages of using the smart dilation mask, i.e. a smart border layer in the output map of our Multi-OSVOS network.(a) The border is obtained by simply dilating the output map of the network: some parts of the objects are not covered.(b) The border layer is learned by the network: the transition region is covered.

Fig. 5
Fig. 5 Region description, each region is described by a global histogram, a set of SURF keypoints (yellow points), and a set of vectors which connects each keypoint and the centroid of the region.

Fig. 6
Fig. 6 The global pipeline of the optical flow-based propagation approach for reconstructing a static background: From input video (a), forward/backward optical flow fields are estimated by FlowNet 2.0 (b), then are inpainted by an image inpainting algorithm (c).From these optical flow fields, pixels from the source region are propagated into the missing region (d).

Fig. 8
Fig. 8Precision-recall curves for different methods with different dilation sizes.
verify if this frame is a keyframe for object i ∈ {1, . . ., K} we proceed as follows: 1. Compute the connected components of O i .Let O i represent the largest connected component.2. Compute the set of connected components of the global foreground mask F and call it F. 80% and both O i and O are isolated from the remaining objects 3 then this is a keyframe for object i.
3. For each connected component O ∈ F computethe overlap ratio with the current object r i (O ) =|O i ∩O | |O | .If r i (O ) >