Lucid Data Dreaming for Video Object Segmentation

Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k–100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\,\times $$\end{document}–1000×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1000\,\times $$\end{document} less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize—“lucid dream” (in a lucid dream the sleeper is aware that he or she is dreaming and is sometimes able to control the course of the dream)—plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general “objectness” knowledge are required for the video object segmentation task.


Introduction
In the last years the field of object tracking in videos has transitioned from bounding box [30,32,31] to pixel-level tracking [34,52,46,69]. Given a first frame labelled with the Anna Khoreva¹ · Rodrigo Benenson 2 · Eddy Ilg 3 · Thomas Brox 3 · Bernt Schiele¹ ¹Max Planck Institute for Informatics, Germany 2 Google 3 University of Freiburg, Germany Figure 1: Starting from scarce annotations we synthesize indomain data to train a specialized pixel-level object tracker for each dataset or even each video.
foreground object masks, one aims to find the corresponding object pixels in future frames. Tracking objects at the pixel level enables a finer understanding of videos and is helpful for tasks such as video editing, rotoscoping, and summarisation.
Top performing results are currently obtained using convolutional networks (convnets) [27,6,28,3,21,41]. Like most deep learning techniques, convnets for pixel-level object tracking benefit from large amounts of training data. Current stateof-the-art methods rely, for instance, on pixel accurate foreground/background annotations of ∼2k video frames [27,6] or ∼ 10k images [28]. Labelling videos at the pixel level is a laborious task (compared e.g. to drawing bounding boxes for detection), and creating a large training set requires significant annotation effort.
In this work we aim to reduce the necessity for such large volumes of training data. It is traditionally assumed that convnets require large training sets to perform best. We show that for video object tracking having a larger training set is not automatically better and that improved results can be obtained by using 20× ∼ 100× less training data than previous approaches [6,28]. The main insight of our work is that for pixel-level object tracking using few training frames (1 ∼ 100) in the target domain is more useful than using large training volumes across domains (1k∼10k).
To ensure a sufficient amount of training data close to the target domain, we develop a new technique for synthesizing training data particularly tailored for the object tracking scenario. We call this data generation strategy "lucid dreaming", where the first frame and its annotation mask are used to generate plausible future frames of the videos. The goal is to produce a large training set of reasonably realistic images which capture the expected appearance variations in future video frames, and thus is, by design, close to the target domain.
Our approach is suitable for both single and multiple object tracking. Enabled by the proposed data generation strategy and the efficient use of optical flow, we are able to achieve high quality results while using only ∼100 individual annotated training frames. Moreover, in the extreme case with only a single annotated frame and zero pre-training (i.e. without ImageNet pre-training), we still obtain competitive tracking results.
In summary, our contributions are the following: 1. We propose "lucid data dreaming", an automated approach to synthesize training data for the convnet-based pixel-level object tracking that leads to top results for both single and multiple object tracking. 2. We conduct an extensive analysis to explore the factors contributing to our good results. 3. We show that training a convnet for object tracking can be done with only few annotated frames. We hope these results will affect the trend towards even larger training sets, and popularize the design of trackers with lighter training needs.

Related work
Box-level tracking. Classic work on video object tracking focused on bounding box tracking. Many of the insights from these works have been re-used for pixel-level tracking. Traditional box tracking smoothly updates across time a linear model over hand-crafted features [22,5,32]. Since then, convnets have been used as improved features [13,37,70], and eventually to drive the tracking itself [21,3,64,40,41]. Contrary to traditional box trackers (e.g. [22]) , convnetbased approaches need additional data for pre-training and learning the task.
Pixel-level tracking. In this paper we focus on generating a foreground versus background pixel-wise object labelling for each video frame starting from a first manually annotated frame. Multiple strategies have been proposed to solve this task.
Box-to-segment: First a box-level track is built, and a space-time grabcut-like approach is used to generate per frame segments [75].
Video saliency: Instead of tracking, these methods extract the main foreground object pixel-level space-time tube. Both hand-crafted models [16,43] or trained convnets [65,26] have been considered. Because these methods ignore the first frame annotation, they fail in videos where multiple salient objects move (e.g. flock of penguins).
Video segmentation methods partition the space-time volume, and then the tube overlapping most with the first frame annotation is selected as tracking output [19,47,7].
Mask propagation: Appearance similarity and motion smoothness across time is used to propagate the first frame annotation across the video [38,72,66]. These methods usually leverage optical flow and long term trajectories.
Convnets: following the trend in box-level tracking, recently convnets have been proposed for pixel-level tracking. [6] trains a generic object saliency network, and fine-tunes it per-video (using the first frame annotation) to make the output sensitive to the specific object instance being tracked. [28] uses a similar strategy, but also feeds the mask from the previous frame as guidance for the saliency network. Finally [27] mixes convnets with ideas of bilateral filtering. Our approach also builds upon convnets. What makes convnets particularly suitable for the task, is that they can learn what are the common statistics of appearance and motion patterns of objects, as well as what makes them distinctive from the background, and exploit this knowledge when tracking a particular object. This aspect gives convnets an edge over traditional techniques based on low-level hand-crafted features.
Our network architecture is similar to [6,28]. Other than implementation details, there are three differentiating factors. One, we use a different strategy for training: [6,27] rely on consecutive video training frames and [28] uses an external saliency dataset, while our approach focuses on using the first frame annotations provided with each targeted video benchmark without relying on external annotations. Two, our approach exploits optical flow better than these previous methods. Three, we describe an extension to seamlessly handle multiple object tracking.
Interactive video segmentation. Interactive segmentation [39,25,59,71] considers more diverse user inputs (e.g. strokes), and requires interactive processing speed rather than providing maximal quality. Albeit our technique can be adapted for varied inputs, we focus on maximizing quality for the noninteractive case with no-additional hints along the video.
Semantic labelling. Like other convnets in this space [27,6,28], our architecture builds upon the insights from the se-mantic labelling networks [78,36,74,2]. Because of this, the flurry of recent developments should directly translate into better tracking results. For the sake of comparison with previous work, we build upon the well established VGG Dee-pLab architecture [9]. Synthetic data. Like our approach, previous works have also explored synthesizing training data. Synthetic renderings [42], video game environment [54], mix-synthetic and real images [67,10,14] have shown promise, but require taskappropriate 3d models. Compositing real world images provides more realistic results, and has shown promise for object detection [17,63], text localization [20] and pose estimation [48].
The closest work to ours is [44], which also generates video-specific training data using the first frame annotations. They use human skeleton annotations to improve pose estimation, while we employ mask annotations to improve object tracking.

LucidTracker
Section 3.1 describes the network architecture used, and how RGB and optical flow information are fused to predict the next frame segmentation mask. Section 3.2 discusses different training modalities employed with the proposed object tracking system. In Section 4 we discuss the training data generation, and sections 5/6 report results for single/multiple object tracking.

Architecture
Approach. We model the pixel-level object tracking problem as a mask refinement task (mask: binary foreground/ background labelling of the image) based on appearance and motion cues. From frame t−1 to frame t the estimated mask M t−1 is propagated to frame t, and the new mask M t is computed as a function of the previous mask, the new image I t , and the optical flow F t , i.e. M t = f (I t , F t , M t−1 ). Since objects have a tendency to move smoothly through space in time, there are little changes from frame to frame and mask M t−1 can be seen as a rough estimate of M t . Thus we require our trained convnet to learn to refine rough masks into accurate masks. Fusing the complementary image I t and motion flow F t enables to exploits the information inherent to video and enables the model to segment well both static and moving objects.
Note that this approach is incremental, does a single forward pass over the video, and keeps no explicit model of the object appearance at frame t. In some experiments we adapt the model f per video, using the annotated first frame I 0 , M 0 . However, in contrast to traditional techniques [22], this model is not updated while we process the video frames, thus the only state evolving along the video is the mask M t−1 itself.
First frame. In the video object tracking task the mask for the first frame M 0 is given. This is the standard protocol of the benchmarks considered in sections 5 & 6. If only a bounding box is available on the first frame, then the mask could be estimated using grabcut-like techniques [55,62].
RGB image I. Typically a semantic labeller generates pixelwise labels based on the input image (e.g. M = g (I)). We use an augmented semantic labeller with an input layer modified to accept 4 channels (RGB + previous mask) so as to generate outputs based on the previous mask estimate, e.g. M t = f I (I t , M t−1 ). Our approach is general and can leverage any existing semantic labelling architecture. We select the DeepLabv2 architecture with VGG base network [9], which is comparable to [27,6,28]; FusionSeg [26] uses ResNet.
Optical flow F. We use flow in two complementary ways. First, to obtain a better initial estimate of M t we warp M t−1 using the flow F t : M t = f I (I t , w(M t−1 , F t )); we call this "mask warping". Second, we use flow as a direct source of information about the mask M t . As can be seen in Figure  2, when the object is moving relative to background, the flow magnitude F t provides a very reasonable estimate of the mask M t . We thus consider using a convnet specifically for mask estimation from flow: , and merge it with the image-only version by naive averaging We use the state-of-the-art optical flow estimation method FlowNet2.0 [23], which itself is a convnet that computes F t = h (I t−1 , I t ) and is trained on synthetic renderings of flying objects [42]. For the optical flow magnitude computation we subtract the median motion for each frame, average the magnitude of the forward and backward flow and scale the values per-frame to [0; 255], bringing it to the same range as RGB channels.
The loss function is the sum of cross-entropy terms over each pixel in the output map (all pixels are equally weighted).
In our experiments f I and f F are trained independently, via some of the modalities listed in Section 3.2. Our two streams architecture is illustrated in Figure 3a.
We also explored expanding our network to accept 5 input channels (RGB + previous mask + flow magnitude) in one stream: M t = f I+F (I t , F t , w(M t−1 , F t )), but did not observe much difference in the performance compared to naive averaging, see experiments in Section 5.4.3. Our one stream architecture is illustrated in Figure 3b. One stream network is more affordable to train and allows to easily add extra input channels, e.g. providing additionally semantic information about objects.  Multiple objects. The proposed framework can easily be extended to multiple object tracking. Instead of having one additional channel for the previous frame mask we provide the mask for each object in a separate channel, expanding the network to accept 3+N input channels (RGB + N object masks): where N is the number of objects annotated on the first frame.
For multiple object tracking task we employ a one-stream architecture for the experiments, using optical flow F and semantic segmentation S as additional input channels: . This allows to leverage the appearance model with semantic priors and motion information. See Figure 4 for an illustration.
In our preliminary results using a single architecture provides better results than tracking multiple objects separately, one at a time; and avoids the need to design a merging strategy amongst overlapping tracks.
Semantic labels S. To compute the pixel-level semantic labelling S t = h (I t ) we use the state-of-the-art convnet PSPNet [78], trained on Pascal VOC12 [15]. Pascal VOC12 annotates 20 categories, yet we want to track type of objects. S t can also provide information about unknown category instances by describing them as a spatial mixture of known ones (e.g. a sea lion might looks like a dog torso, and the head of cat). As long as the predictions are consistent through time, S t will provide a useful cue for tracking. Note that we only use S t for the multi-object tracking challenge, discussed in Section 6. In the same way as for the optical flow we scale S t to bring all the channels to the same range.
We additionally experiment with ensembles of different variants, that allows to make the system more robust to the challenges inherent in videos. For our main results for mul-  tiple object tracking task we consider the ensemble of four models: where we merge the outputs of the models by naive averaging. See Section 6 for more details.
Post-processing. As a final stage of our pipeline, we refine per-frame t the generated mask M t using DenseCRF [29]. This adjusts small image details that the network might not be able to handle. It is known by practitioners that Den-seCRF is quite sensitive to its parameters and can easily worsen results. We will use our lucid dreams to handle perdataset CRF-tuning too, see Section 3.2.
We refer to our full f I+F system as LucidTracker, and as LucidTracker − when no post-processing is used. The usage of S t or model ensemble will be explicitly stated.

Training modalities
Multiple modalities are available to train a tracker. Trainingfree approaches (e.g. BVS [38], SVT [72]) are fully handcrafted systems with hand-tuned parameters, and thus do not require training data. They can be used as-is over different datasets. Supervised methods can also be trained to generate a dataset-agnostic model that can be applied over different datasets. Instead of using a fixed model for all cases, it is also possible to obtain specialized per-dataset models, either via self-supervision [73,45,76,79] or by using the first frame annotation of each video in the dataset as Figure 4: Extension of LucidTracker to multiple objects. The previous frame mask for each object is provided in a separate channel. We additionally explore using optical flow F and semantic segmentation S as additional inputs. See §3.1.
training/tuning set. Finally, inspired by traditional tracking techniques, we also consider adapting the model weights to the specific video at hand, thus obtaining per-video models. Section 5 reports new results over these four training modalities (training-free, dataset-agnostic, per-dataset, and per-video).
Our LucidTracker obtains best results when first pretrained on ImageNet, then trained per-dataset using all data from first frame annotations together, and finally fine-tuned per-video for each evaluated sequence. The post-processing DenseCRF stage is automatically tuned per-dataset. The experimental section 5 details the effect of these training stages. Surprisingly, we can obtain reasonable performance even when training from only a single annotated frame (without ImageNet pre-training, i.e. zero pre-training); this results goes against the intuition that convnets require large training data to provide good results.
Unless otherwise stated, we fine-tune per-video models relying solely on the first frame I 0 and its annotation M 0 . This is in contrast to traditional techniques [22,5,32] which would update the appearance model at each frame I t .

Lucid data dreaming
To train the function f one would think of using ground truth data for M t−1 and M t (like [3,6,21]), however such data is expensive to annotate and rare. [6] thus trains on a set of 30 videos (∼ 2k frames) and requires the model to transfer across multiple tests sets. [28] side-steps the need for consecutive frames by generating synthetic masks M t−1 from a saliency dataset of ∼ 10k images with their corresponding mask M t . We propose a new data generation strategy to reach better results using only ∼ 100 individual training frames.
Ideally training data should be as similar as possible to the test data, even subtle differences may affect quality (e.g. training on static images for testing on videos under-performs [61]). To ensure our training data is in-domain, we propose to generate it by synthesizing samples from the provided annotated frame (first frame) in each target video. This is akin to "lucid dreaming" as we intentionally "dream" the desired data by creating sample images that are plausible hypothetical future frames of the video. The outcome of this process is a large set of frame pairs in the target domain (2.5k pairs per annotation) with known optical flow and mask annotations, see Figure 5.
Synthesis process. The target domain for a tracker is the set of future frames of the given video. Traditional data augmentation via small image perturbation is insufficient to cover the expect variations across time, thus a task specific strategy is needed. Across the video the tracked object might change in illumination, deform, translate, be occluded, show different point of views, and evolve on top of a dynamic background. All of these aspects should be captured when synthesizing future frames. We achieve this by cutting-out the foreground object, in-painting the background, perturbing both foreground and background, and finally recomposing the scene. This process is applied twice with randomly sampled transformation parameters, resulting in a pair of frames (I τ −1 , I τ ) with known pixel-level ground-truth mask annotations (M τ −1 , M τ ), optical flow F τ , and occlusion regions. The object position in I τ is uniformly sampled, but the changes between I τ −1 , I τ are kept small to mimic the usual evolution between consecutive frames. In more details, starting from an annotated image: 1. Illumination changes: we globally modify the image by randomly altering saturation S and value V (from HSV colour space) via x = a·x b +c, where a ∈ 1±0.05, b ∈ 1±0.3, and c ∈ ±0.07. 2. Fg/Bg split: the foreground object is removed from the image I 0 and a background image is created by inpainting the cut-out area [12]. 3. Object motion: we simulate motion and shape deformations by applying global translation as well as affine and non-rigid deformations to the foreground object. For I τ −1 the object is placed at any location within the image with a uniform distribution, and in I τ with a translation of ±10% of the object size relative to τ − 1. In both frames we apply random rotation ±30 • , scaling ±15% and thin-plate splines deformations [4] of ±10% of the object size. 4. Camera motion: We additionally transform the background using affine deformations to simulate camera view changes. We apply here random translation, rotation, and scaling within the same ranges as for the foreground object. 5. Fg/Bg merge: finally (I τ −1 , I τ ) are composed by blending the perturbed foreground with the perturbed background using Poisson matting [60]. Using the known transformation parameters we also synthesize ground-truth pixel-level mask annotations (M τ −1 , M τ ) and optical flow F τ . Figure 5 shows example results. Albeit our approach does not capture appearance changes due to point of view, occlusions, nor shadows, we see that already this rough modelling is effective to train our tracking models.
The number of synthesized images can be arbitrarily large. We generate 2.5k pairs per annotated video frame. This training data is, by design, in-domain with regard of the target video. The experimental section 5 shows that this strategy is more effective than using thousands of manually annotated images from close-by domains.
The same strategy for data synthesis can be employed for multiple object tracking task. Instead of manipulating a single object we handle multiple ones at the same time, applying independent transformations to each of them. We model occlusion between objects by adding a random depth ordering obtaining both partial and full occlusions in the training set. Including occlusions in the lucid dreams allows to better handle plausible interactions of objects in the future frames. See Figure 6 for examples of the generated data.

Single object tracking results
We present here a detailed empirical evaluation on three different datasets for the single object tracking task: given a first frame labelled with the foreground object mask, the goal is to find the corresponding object pixels in future frames. (Section 6 will discuss the multiple objects case.)

Experimental setup
Datasets. We evaluate our method on three video object segmentation datasets: DAVIS 16 [46], YouTubeObjects [52,24], and SegTrack v2 [34]. The goal is to track an object through all video frames given a foreground object mask in the first frame. These three datasets provide diverse challenges with a mix of high and low resolution web videos, single or multiple salient objects per video, videos with flocks of similar looking instances, longer (∼ 400 frames) and shorter (∼10 frames) sequences, as well as the usual tracking challenges such as occlusion, fast motion, illumination, view point changes, elastic deformation, etc.
The DAVIS 16 [46] video segmentation benchmark consists of 50 full-HD videos of diverse object categories with all frames annotated with pixel-level accuracy, where one single or two connected moving objects are separated from the background. The number of frames in each video varies from 25 to 104.
YouTubeObjects [52,24] includes web videos from 10 object categories. We use the subset of 126 video sequences with mask annotations provided by [24] for evaluation, where one single object or a group of objects of the same category are separated from the background. In contrast to DAVIS 16 these videos have a mix of static and moving objects. The number of frames in each video ranges from 2 to 401. SegTrack v2 [34] consists of 14 videos with multiple object annotations for each frame. For videos with multiple objects each object is treated as a separate problem, resulting in 24 sequences. The length of each video varies from 21 to 279 frames. The images in this dataset have low resolution and some compression artefacts, making it hard to track the object based on its appearance.
The main experimental work is done on DAVIS 16 , since it is the largest densely annotated dataset out of the three, and provides high quality/high resolution data. The videos for this dataset were chosen to represent diverse challenges, making it a good experimental playground.
We additionally report on the two other datasets as complementary test set results.
Evaluation metric. To measure the accuracy of video object tracking we use the mean intersection-over-union overlap (mIoU) between the per-frame ground truth object mask and the predicted segmentation, averaged across all video sequences. We have noticed disparate evaluation procedures used in previous work, and we report here a unified evaluation across datasets. When possible, we re-evaluated certain methods using results provided by their authors. For all three datasets we follow the DAVIS 16 evaluation protocol, excluding the first frame from evaluation and using all other frames from the video sequences, independent of object presence in the frame.
Training details. For training all the models we use SGD with mini-batches of 10 images and a fixed learning policy with initial learning rate of 10 −3 . The momentum and weight decay are set to 0.9 and 5 · 10 −4 , respectively.
Models using pre-training are initialized with weights trained for image classification on ImageNet [58]. We then train per-dataset for 40k iterations with the RGB+Mask branch f I and for 20k iterations for the Flow+Mask f F branch. When using a single stream architecture (Section 5.4.3), we use 40k iterations.
Models without ImageNet pre-training are initialized using the "Xavier" strategy [18]. The per-dataset training needs to be longer, using 100k iterations for the f I branch and 40k iterations for the f F branch.
For per-video fine-tuning 2k iterations are used for f I . To keep computing cost lower, the f F branch is kept fix across videos.
All training parameters are chosen based on DAVIS 16 results. We use identical parameters on YouTubeObjects and SegTrack v2 , showing the generalization of our approach.
It takes~3.5h to obtain each per-video model, including data generation, per-dataset training, per-video fine-tuning and per-dataset grid search of CRF parameters (averaged over DAVIS 16 , amortising the per-dataset training time over all videos). At test time our LucidTracker runs at~5s per frame, including the optical flow estimation with FlowNet2.0 [23] (~0.5s) and CRF post-processing [29] (~2s). Table 1 presents our main result and compares it to previous work. Our full system, LucidTracker, provides the best tracking quality across three datasets while being trained on each dataset using only one frame per video (50 frames for DAVIS 16 , 126 for YouTubeObjects, 24 for SegTrack v2 ), which is 20×∼100× less than the top competing methods. Ours is the first method to reach > 75 mIoU on all three datasets.

Key results
Oracles and baselines. Grabcut oracle computes grabcut [55] using the ground truth bounding boxes (box oracle). This oracle indicates that on the considered datasets separating foreground from background is not easy, even if a perfect box-level tracker was available. We provide three additional baselines. "Saliency" corresponds to using the generic (training-free) saliency method EQCut [1] over the RGB image I t . "Flow saliency" does the same, but over the optical flow magnitude F t . Results indicate that the objects being tracked are not particularly salient in the image. On DAVIS 16 motion saliency is a strong signal but not on the other two datasets. Saliency methods ignore the first frame annotation provided for the tracking task. We also consider the "Mask warping" baseline which  uses optical flow to propagate the mask estimate from t to t + 1 via simple warping M t = w(M t−1 , F t ). The bad results of this baseline indicate that the high quality flow [23] that we use is by itself insufficient to solve the tracking task, and that indeed our proposed convnet does the heavy lifting.
The large fluctuation of the relative baseline results across the three datasets empirically confirms that each of them presents unique challenges.
Comparison. Compared to flow propagation methods such as BVS, N15, ObjFlow, and STV, we obtain better results because we build per-video a stronger appearance model of the tracked object (embodied in the fine-tuned model). Compared to convnet learning methods such as VPN, OSVOS, MaskTrack, we require significantly less training data, yet obtain better results. Figure 7 provides qualitative results of LucidTracker across three different datasets. Our system is robust to various challenges present in videos. It handles well camera view changes, fast motion, object shape deformation, outof-view scenarios, multiple similar looking objects and even low quality video. We provide a detailed error analysis in section 5.5.
Conclusion. We show that top results can be obtained while using less training data. This shows that our lucid dreams leverage the available training data better. We report top results for this task while using only 24∼126 training frames.  Table 2: Ablation study of training modalities. ImageNet pre-training and per-video tuning provide additional improvement over per-dataset training. Even with one frame annotation for only per-video tuning we obtain good performance. See §5.3.1.

Ablation studies
In this section we explore in more details how the different ingredients contribute to our results.

Effect of training modalities
Training from a single frame. In the bottom row ("only per-video tuning"), the model is trained per-video without ImageNet pre-training nor per-dataset training, i.e. using a single annotated training frame. Our network is based on VGG16 [9] and contains ∼20M parameters, all effectively learnt from a single annotated image that is augmented to become 2.5k training samples (see Section 4). Even with such minimal amount of training data, we still obtain a surprisingly good performance (compare 80.5 on DAVIS 16 to others in Table 1). This shows how effective is, by itself, the proposed training strategy based on lucid dreaming of the data.  (e.g. without non-rigid deformations nor scene re-composition) and the full synthesis process described in Section 4. Having a sophisticated data generation process directly impacts the tracking quality.
Conclusion. Surprisingly, we discovered that per-video training from a single annotated frame provides already much of the information needed for the tracking task. Additionally using ImageNet pre-training, and per-dataset training, provide complementary gains. Table 3 shows the effect of optical flow on LucidTracker results. Comparing our full system to the "No OF" row, we see that the effect of optical flow varies across datasets, from minor improvement in YouTubeObjects, to major difference in SegTrack v2 . In this last dataset, using mask warping is particularly useful too. We additionally explored tuning the optical flow stream per-video, which resulted in a minor improvement (83.7→83.9 mIoU on DAVIS 16 ). Our "No OF" results can be compared to OSVOS [6] which does not use optical flow. However OSVOS uses a per-frame mask post-processing based on a boundary detector (trained on further external data), which provides ∼2 percent point gain. Accounting for this, our "No OF" (and no CRF) result matches theirs on DAVIS 16 and YouTubeObjects despite using significantly less training data (see Table  1, e.g. 79.8 − 2 ≈ 78.0 on DAVIS 16 ). Table 4 shows the effect of using different optical flow estimation methods. For LucidTracker results, FlowNet2.0 [23] was employed. We also explored using EpicFlow [53], as in [28]. Table 4 indicates that employing a robust optical flow estimation across datasets is crucial to the performance (FlowNet2.0 provides ∼ 1.5 − 15 points gain on each dataset). We found EpicFlow to be brittle when going across different datasets, providing improvement for DAVIS 16    Conclusion. The results show that flow provides a complementary signal to RGB image only and having a robust optical flow estimation across datasets is crucial. Despite its simplicity our fusion strategy (f I + f F ) provides gains on all datasets, and leads to competitive results.

Effect of CRF tuning
As a final stage of our pipeline, we refine the generated mask using DenseCRF [29] per frame. This captures small image details that the network might have missed. It is known by practitioners that DenseCRF is quite sensitive to its parameters and can easily worsen results. We use our lucid dreams to enable automatic per-dataset CRF-tuning. Following [9] we employ grid search scheme for tuning CRF parameters. Once the per-dataset tracking model is trained, we apply it over a subset of its training set (5 random images from the lucid dreams per video sequence), apply DenseCRF with the given parameters over this output, and then compare to the lucid dream ground truth.
The impact of the tuned parameter of DenseCRF postprocessing is shown in Table 5 and Figure 8. Table 5 indicates that without per-dataset tuning DenseCRF is underperforming. Our automated tuning procedure allows to obtain consistent gains without the need for case-by-case manual tuning.
Conclusion. Using default DenseCRF parameters will degrade performance. Our lucid dreams enable per-dataset CRFtuning which allows to further improve the results.

Additional experiments
Other than adding or removing ingredients, as in Section 5.3, we also want to understand how the training data itself affects the obtained results. Comparing the top and bottom parts of the table, we see that when the annotated images from the test set videos are not included, tracking quality drops drastically (e.g. 68.7→ 36.4 mIoU). Conversely, on subset of videos for which the first frame annotation is used for training, the quality is much higher and improves as the training samples become more and more specific (in-domain) to the target video (65. Training with an additional frame from each video (we added the last frame of each train video) significantly boosts the resulting within-video quality (e.g. row top-30-2 65.4 → 74.3 mIoU), because the training samples cover better the test domain.

Generalization across videos
Conclusion. These results show that, when using RGB information (I t ), increasing the number of training videos does not improve the resulting quality of our system. Even within a dataset, properly using the training sample(s) from within  each video matters more than collecting more videos to build a larger training set.

Generalization across datasets
Section 5.4.1 has explored the effect of changing the volume of training data within one dataset, Table 7 compares results when using different datasets for training. Results are obtained using a base model with RGB and flow (M t = f (I t , M t−1 ), no warping, no CRF), ImageNet pre-training, per-dataset training, and no per-video tuning to accentuate the effect of the training dataset. The best performance is obtained when training on the first frames of the target set. There is a noticeable ∼10 percent points drop when moving to the second best choice (e.g. 80.9 → 67.0 for DAVIS 16 ). Interestingly, when putting all the datasets together for training ("all-in-one" row, a datasetagnostic model) the results degrade, reinforcing the idea that "just adding more data" does not automatically make the performance better. Conclusion. Best results are obtained when using training data that focuses on the test video sequences, using similar datasets or combining multiple datasets degrades the performance for our system.

Experimenting with the convnet architecture
Section 3.1 and Figure 3 described two possible architectures to handle I t and F t . Previous experiments are all based on the two streams architecture. Table 8 compares two streams versus one stream, where the network to accepts 5 input channels (RGB + previous mask + flow magnitude) in one stream: M t = f I+F (I t , F t , w(M t−1 , F t )). Results are obtained using a base model with RGB and optical flow (no warping, no CRF), ImageNet pre-training, per-dataset training, and no per-video tuning.
We observe that both one stream and two stream architecture with naive averaging perform on par. Using a one  stream network makes the training more affordable and allows more easily to expand the architecture with additional input channels.
Conclusion. The lighter one stream network performs as well as a network with two streams. We will thus use the one stream architecture in Section 6. Table 9 presents an expanded evaluation on DAVIS 16 using evaluation metrics proposed in [46]. Three measures are used: region similarity in terms of intersection over union (J), contour accuracy (F, higher is better), and temporal instability of the masks (T, lower is better). We outperform the competitive methods [28,6] on all three measures. Table 10 reports the per-attribute based evaluation as defined in DAVIS 16 . LucidTracker is best on 13 out of 15 video attribute categories. This shows that LucidTracker can handle various video challenges present in DAVIS 16 .

Error analysis
We present the per-sequence and per-frame results of LucidTracker over DAVIS 16 in Figure 10. On the whole we observe that the proposed approach is quite robust, most video sequences reach an average performance above 80 mIoU.
However, by looking at per-frame results for each video (blue dots in Figure 10) one can see several frames where our approach has failed (IoU less than 50) to correctly track the object. Investigating closely those cases we notice conditions where LucidTracker is more likely to fail. The same behaviour was observed across all three datasets. A few representatives of failure cases are visualized in Figure 9.
Since we are using only the annotation of the first frame for training the tracker, a clear failure case is caused by dramatic view point changes of the object from its first frame appearance, as in row 5 of Figure 9. The proposed approach also under-performs when recovering from occlusions: it takes several frames for the full object mask to re-appear (rows 1-2 in Figure 9). This is mainly due to the convnet having learnt to follow-up the previous frame mask. Augmenting the lucid dreams with plausible occlusions might help mitigate this case. Another failure case occurs when two similar looking objects cross each other, as in row 6 in Figure 9. Here both cues: the previous frame guidance and learnt via per-video      tuning appearance, are no longer discriminative to correctly continue tracking. We also observe that the LucidTracker struggles to track the fine structures or details of the object, e.g. wheels of the bicycle or motorcycle in rows 1-2 in Figure 9. This is the issue of the underlying choice of the convnet architecture, due to the several pooling layers the spatial resolution is lost and hence the fine details of the object are missing. This issue can be mitigated by switching to more recent semantic labelling architectures (e.g. [49,8]).
Conclusion. LucidTracker shows robust performance across different videos. However, a few failure cases were observed due to the underlying convnet architecture, its training, or limited visibility of the object in the first frame.

Multiple object tracking results
We present here an empirical evaluation of LucidTracker for multiple object tracking task: given a first frame labelled with the masks of several object instances, one aims to find the corresponding masks of objects in future frames.

Experimental setup
Dataset. For multiple object tracking we use the 2017 DA-VIS Challenge on Video Object Segmentation 2 [51] (DAVIS 17 ). Compared to DAVIS 16 this is a larger, more challenging dataset, where the video sequences have multiple objects in the scene. Videos that have more than one visible object in DAVIS 16 have been re-annotated (the objects were divided by semantics) and the train and val sets were extended with more sequences. In addition, two other test sets (test-dev and test-challenge) were introduced. The complexity of the videos has increased with more distractors, occlusions, fast motion, smaller objects, and fine structures. Overall, DAVIS 17 consists of 150 sequences, totalling 10 474 annotated frames and 384 objects.
We evaluate our method on two test sets, the test-dev and test-challenge sets, each consists of 30 video sequences, on average ∼ 3 objects per sequence, the length of the sequences is ∼ 70 frames. For both test sets only the masks on the first frames are made public, the evaluation is done via an evaluation server.
Our experiments and ablation studies are done on the test-dev set.
Evaluation metric. The accuracy of multiple object tracking is evaluated using the region (J) and boundary (F) measures proposed by the organisers of the challenge. The average of J and F measures is used as overall performance score. Please refer to [51] for more details about the evaluation protocol.
Training details. All experiments in this section are done using the single stream architecture discussed in sections 3.1 and 5.4.3. For training the models we use SGD with minibatches of 10 images and a fixed learning policy with initial learning rate of 10 −3 . The momentum and weight decay are set to 0.9 and 5·10 −4 , respectively. All models are initialized with weights trained for image classification on ImageNet [58]. We then train per-video for 40k iterations.

Key results
Tables 11 and 12 presents the results of the 2017 DAVIS Challenge on test-dev and test-challenge sets [50]. Our main results for the multi-object tracking challenge are obtained via an ensemble of four different models (f I , f I+F , f I+S , f I+F +S ), see Section 3.1.
The proposed system, LucidTracker, provides the best tracking quality on the test-dev set and shows competitive performance on the test-challenge set, holding the second place in the competition. The full system is trained using the standard ImageNet pre-training initialization, Pascal VOC12 semantic annotations for the S t input (∼10k annotated images), and one annotated frame per test video, 30 frames total on each test set. As discussed in Section 6.3, even without S t LucidTracker obtains competitive results (only 2 score points drop).
The top entry lixx [35] uses a deeper convnet model (ImageNet pre-trained ResNet), a similar pixel-level tracking architecture, trains it over external segmentation data (using ∼ 120k pixel-level annotated images from MS-COCO and Pascal VOC for pre-training, and akin to [6] fine-tuning on the DAVIS 17 train and val sets, ∼10k annotated frames), and extends it with a box-level object detector (trained over MS-COCO and Pascal VOC, ∼ 500k bounding boxes) and a box-level object re-identification model trained over ∼ 60k box annotations (on both images and videos). We argue that our system reaches comparable results with a significantly lower amount of training data. Figure 11 provides qualitative results of LucidTracker on the test-dev set. The video results include successful handling of multiple objects, full and partial occlusions, distractors, small objects, and out-of-view scenarios.
Conclusion. We show that top results for multiple object tracking can be achieved via our approach that focuses on exploiting as much as possible the available annotation on the first video frame, rather than relying heavily on large external training data. We see that adding extra information (channels) to the system, either optical flow magnitude or semantic segmentation, or both, does provide 1 ∼ 2 percent point improvement. The results show that leveraging semantic priors and motion information provides a complementary signal to RGB image and both ingredients contribute to the tracking results.

Ablation study
Combining in ensemble four different models (f I+F +S + f I+F + f I+S + f I ) allows to enhance the results even further, bringing 3 percent point gain.   Conclusion. The results show that both flow and semantic priors provide a complementary signal to RGB image only. Despite its simplicity our ensemble strategy provides additional gain and leads to competitive results. Notice that even without the semantic segmentation signal S t our ensemble result is competitive.

Error analysis
We present the per-sequence results of LucidTracker on DAVIS 17 in Figure 12 (per frame results not available from evaluation server). We observe that this dataset is significantly more challenging than DAVIS 16 (compare to Figure   10), with only 1 /3 of the test videos above 80 mIoU. This shows that multiple object tracking is a much more challenging task than tracking a single object.
The failure cases discussed in Section 5.5 still apply to the multiple objects case. Additionally, on DAVIS 17 we observe a clear failure case when tracking similar looking object instances, where the object appearance is not discriminative to correctly track the object, resulting in label switches or bleeding of the label to other look-alike objects. Figure  13 illustrates this case. This issue could be mitigated by using object level instance identification modules, like [35], or by changing the training loss of the model to more severely penalize identity switches.
Conclusion. In the multiple object case the LucidTracker results remain robust across different videos. The overall results being lower than for the single object tracking case,   there is more room for future improvement in the multiple object pixel-level tracking task.

Conclusion
We have described a new convnet-based approach for pixellevel object tracking in videos. In contrast to previous work, we show that top results for single and multiple object tracking can be achieved without requiring external training datasets (neither annotated images nor videos). Even more, our experiments indicate that it is not always beneficial to use additional training data, synthesizing training samples close to the test domain is more effective than adding more training samples from related domains. Our extensive analysis decomposed the ingredients that contribute to our improved results, indicating that our new training strategy and the way we leverage additional cues such as semantic and motion priors are key.
Showing that training a convnet for object tracking can be done with only few (∼100) training samples changes the mindset regarding how much general knowledge about ob-jects is required to approach this problem [28,26], and more broadly how much training data is required to train large convnets depending on the task at hand.
We hope these new results will fuel the ongoing evolution of convnet techniques for single and multiple object tracking.