Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The need for weakly-supervised learning for semantic segmentation has been highlighted recently [24]. It is particularly important, as acquiring a training set by labeling images manually at the pixel level is significantly more expensive than assigning class labels at the image level. Recent segmentation approaches have used weak annotations in several forms: bounding boxes around objects [5, 6], image labels denoting the presence of a category [2, 3] or a combination of the two [1]. All these previous approaches only use annotation in images, i.e., bounding boxes, image tags, as a weak form of supervision. Naturally, additional cues would come in handy to address this challenging problem. As noted in [7], motion is one such cue for semantic segmentation, which helps us identify the extent of objects and their boundaries in the scene more accurately. To our knowledge, motion has not yet been leveraged for weakly-supervised semantic segmentation. In this work, we aim to fill this gap by learning an accurate segmentation model with the help of motion cues extracted from weakly-annotated videos.

Fig. 1.
figure 1

Comparison of state-of-the-art fully [8] and weakly [1] supervised methods with our weakly-supervised M-CNN model.

Our proposed framework is based on fully convolutional neural networks (FCNNs) [811], which extend deep CNNs, and are able to classify every pixel in an input image in a single forward pass. While FCNNs show state-of-the-art results on segmentation benchmark datasets, they require thousands of pixel-level annotated images to train on—a requirement that limits their utility. Recently, there have been some attempts [1, 3, 12, 13] to train FCNNs with weakly-annotated images, but they remain inferior in performance to their fully-supervised equivalents (see Fig. 1). In this paper, we develop a new CNN variant named M-CNN, which leverages motion cues in weakly-labeled videos, in the form of unsupervised motion segmentation, e.g., [14]. It builds on the architecture of FCNN by adding a motion segmentation based label inference step, as shown in Fig. 2. In other words, predictions from the FCNN layers and motion segmentation jointly determine the loss used to learn the network (see Sect. 3.2).

Our approach uses unsupervised motion segmentation from real-world videos, such as the YouTube-Objects [15] and the ImageNet-VID [16] datasets, to train the network. In this context, we are confronted with two main challenges. The first one is that even the best-performing algorithms cannot produce good motion segmentations consistently, and the second one is the ambiguity of video-level annotations, which cannot guarantee the presence of object in all the frames. We develop a novel scheme to address these challenges automatically without any manual annotations, apart from the labels assigned at the video level, denoting the presence of objects somewhere in the video. To this end, we use motion segmentations as soft constraints in the learning process, and also fine-tune our network with a small number of video shots to refine it.

We evaluated the proposed method on two related problems: semantic segmentation and video co-localization. When trained on weakly-annotated videos, M-CNN outperforms state-of-the-art EM-Adapt [1] on the PASCAL VOC 2012 image segmentation benchmark [17]. Furthermore, our trained model, despite using only 150 video labels, achieves performance similar to EM-Adapt trained on more than 10,000 VOC image labels. Augmenting our training set with 1,000 VOC images results in a further gain, achieving the best performance on VOC 2012 test set in the weakly-supervised setting (see Sect. 4.4). On the video co-localization task, where the goal is to localize common objects in a set of videos, M-CNN substantially outperforms a recent method [18] by over 16 % on the YouTube-Objects dataset.

The contributions of this work are twofold: (i) We present a novel CNN framework for segmentation that integrates motion cues in video as soft constraints. (ii) Experimental results show that our segmentation model learned from weakly-annotated videos can indeed be applied to evaluate on challenging benchmarks and achieves top performance on semantic segmentation as well as video co-localization tasks.

Fig. 2.
figure 2

Overview of our M-CNN framework, where we show only one frame from a video example for clarity. The soft potentials (foreground appearance) computed from motion segmentation and the FCNN predictions (category appearance) jointly determine the latent segmentation (inferred labels) to compute the loss, and thus the network update.

2 Related Work

In addition to fully-supervised segmentation approaches, such as [19, 20], several weakly-supervised methods have been proposed over the years: some of them use bounding boxes [5, 6], while others rely on image labels [2]. Traditional approaches for this task, such as [2], used a variety of hand-crafted visual features, namely, SIFT histograms, color, texture, in combination with a graphical or a parametric structured model. Such early attempts have been recently outperformed by FCNN methods, e.g., [1].

FCNN architecture [1, 3, 813, 21] adapts standard CNNs [22, 23] to handle input images of any arbitrary size by treating the fully connected layers as convolutions with kernels of appropriate size. This allows them to output scores for every pixel in the image. Most of these methods [811, 21] rely on strong pixel-level annotation to train the network.

Attempts [1, 3, 12, 13] to learn FCNNs for the weakly-supervised case use either a multiple instance learning (MIL) scheme [3, 12] or constraints on the distribution of pixel labels [1, 13] to define the loss function. For example, Pathak et al. [12] extend the MIL framework used for object detection [24, 25] to segmentation by treating the pixel with the highest prediction score for a category as its positive sample when computing the loss. Naturally, this approach is susceptible to standard issues suffered by MIL, like converging to the most discriminative parts of objects [24]. An alternative MIL strategy is used in [3], by introducing a soft aggregation function that translates pixel-level FCNN predictions into an image label distribution. The loss is then computed with respect to the image annotation label and backpropagated to update the network parameters. This strategy works better in practice than [12], but requires training images that contain only a single object, as well as explicit background images. Furthermore, it uses a complex post-processing step involving multi-scale segmentations when testing, which is critical to its performance.

Weakly-supervised FCNNs in [1, 13] define constraints on the predicted pixel labels. Papandreou et al. [1] presented an expectation maximization (EM) approach, which alternates between predicting pixel labels (E-step) and estimating FCNN parameters (M-step). Here, the label prediction step is moderated with cardinality constraints, i.e., at least 20 % of the pixels in an image need to be assigned to each of the image-label categories, and at least 40 % to the background. This approach was extended in [13] to include generic linear constraints on the label space, by formulating label prediction as a convex optimization problem. Both these methods showed excellent results on the VOC 2012 dataset, but are sensitive to the linear/cardinality constraints. We address this drawback in our M-CNN framework, where motion cues act as more precise constraints. Fig. 1 shows the improvement due to these constraints. We demonstrate that FCNNs can be trained with videos, unlike all the previous methods restricted to images, and achieve the best performance using much less training data more effectively.

Weakly-supervised learning is also related to webly-supervised learning. Methods following this recent trend [2629] are kick-started with either a small number of manually annotated examples, e.g., some fully-supervised training examples for the object detection task in [29], or automatically discovered “easy” samples [28], and then trained with a gradually increasing set of examples mined from web resources. However, none of them address the semantic segmentation problem. Other paradigms related to weakly-supervised learning, such as co-localization [15] and co-segmentation [30] require the video (or image) to contain a dominant object class. Co-localization methods aim to localize the common object with bounding boxes, whereas in co-segmentation, the goal is to estimate pixel-wise segment labels. Such approaches, e.g., [15, 31, 32], typically rely on a pre-computed candidate set of regions (or boxes) and choose the best one with an optimization scheme. Thus, they have no end-to-end learning mechanism and are inherently limited by the quality of the candidates.

3 Learning Semantic Segmentation from Video

We train our network by exploiting motion cues from video sequences. Specifically, we extract unsupervised motion segments from video, with algorithms such as [14], and use them in combination with the weak labels at the video level to learn the network. We sample frames from all the video sequences uniformly, and assign them the class label of the video. This collection forms our training dataset, along with their corresponding motion segments.

The parameters of M-CNN are updated with a standard mini-batch SGD, similar to other CNN approaches [1], with the gradient of a loss function. Here, the loss measures the discrepancy between the ground truth segmentation label and the label predicted at each pixel. Thus, in order to learn the network for the semantic segmentation task, we need pixel-level ground truth for all the training data. These pixel-level labels are naturally latent variables in the context of weakly-supervised learning. Now, the task is to estimate them for our weakly-labeled videos. An ideal scenario in this setting would be near-perfect motion segmentations, which can be directly used as object ground truth labels. However, in practice, not only are the segmentations far from perfect (see Fig. 3), but also fail to capture moving objects in many of the shots. This makes a direct usage of motion segmentation results suboptimal. To address this, we propose a novel scheme, where motion segments are only used as soft constraints to estimate the latent variables together with object appearance cues.

The other challenges when dealing with real-world video datasets, such as YouTube-Objects and ImageNet-VID, are related to the nature of video data itself. On one hand, not all parts of a video contain the object of interest. For instance, a video from a show reviewing boats may contain shots with the host talking about the boat, and showing it from the inside for a significant part—content that is unsuitable for learning a segmentation model for the VOC ‘boat’ category. On the other hand, a long video can contain many nearly identical object examples which leads to an imbalance in the training set. We address both problems by fine-tuning our M-CNN with an automatically selected, small subset of the training data.

3.1 Network Architecture

Our network is built on the DeepLab model for semantic image segmentation [8]. It is an FCNN, obtained by converting the fully-connected layers of the VGG-16 network [33] into convolutional layers. A few other changes are implemented to get a dense network output for an image at its full resolution efficiently. Our work builds on this network. We develop a more principled and effective label prediction scheme involving motion cues to estimate the latent variables, in contrast to the heuristic size constraints used in [1], which is based on DeepLab.

Fig. 3.
figure 3

Examples highlighting the importance of label prediction for handling imprecise motion segmentations (second column). The soft potentials computed from motion segments along with network predictions produce better labels (third column) to learn the network.

3.2 Estimating Latent Variables with Label Prediction

Given an image of N pixels, let \(\mathbf{p}\) denote the output of the softmax layer of the convolutional network. Then, \(p_i^l \in [0,1]\) is the prediction score of the network at pixel i for label l. The parameters of the network are updated with the gradient of the loss function, given by:

$$\begin{aligned} \mathcal {L}(\mathbf{x}, \mathbf{p}) = \sum _{i=1}^{N} \sum _{l=0}^{L} \delta (x_i - l) \log (p_i^l), \end{aligned}$$
(1)

where \(\mathbf{x}\) denotes ground truth segmentation labels in the fully-supervised case, \(\mathbf{p}\) is the current network prediction, and \(\delta (x_i - l)\) is the Dirac delta function, i.e., \(\delta (x_i - l) = 1\), if \(x_i = l\), and 0 otherwise. The segmentation label \(x_i\) of pixel i takes values from the label set \(\mathbf{L} = \{0, 1,\ldots , L\}\), containing the background class (0) and L object categories. Naturally, in the weakly-supervised case, ground truth segmentation labels are unavailable, and \(\mathbf{x}\) represents latent segmentation variables, which need to be estimated. We perform this estimation with soft motion segmentation cues in this paper.

Given the motion segmentation \(\mathbf{s} = \{s_i | i = 1,\ldots ,N\}\), where \(s_i \in \{0,1\}\) denotes whether a pixel i belongs to foreground (1) or background (0).Footnote 1 The regions assigned to foreground can represent multiple objects when the video is tagged with more than one object label. A simple way of transforming motion segmentation labels \(s_i\) into latent semantic segmentation labels \(x_i\) is with a hard assignment, i.e., \(x_i = s_i\). This hard assignment is limited to videos containing a single object label, and also makes the assumption that motion segments are accurate and can be used as they are. We will see in our experiments that this performs poorly when using real-world video datasets (cf. ‘M-CNN* hard’ in Table 1). We address this by using motion cues as soft constraints for estimating the label assignment \(\mathbf{x}\) in the following.

Inference of the Segmentation \(\mathbf{x}\). We compute the pixel-level segmentation \(\mathbf{x}\) as the minimum of an energy function \(E(\mathbf{x})\) defined by:

$$\begin{aligned} E(\mathbf{x}) = \sum _{i \in \mathcal {V}} \left( \psi _i^m(z_i) + \alpha \psi _i^{fc}(p_i^{x_i})\right) ~+~\sum _{(i,j) \in \mathcal {E}} \psi _{ij}(x_i,x_j), \end{aligned}$$
(2)

where \(\mathcal {V} = \{1, 2, \ldots , N\}\) is the set of all the pixels, \(z_i\) denotes the RGB color at pixel i and the set \(\mathcal {E}\) denotes all pairs of neighboring pixels in the image. Unary terms \(\psi _i^m\) and \(\psi _i^{fc}\) are computed from motion cues and current predictions of the network respectively, with \(\alpha \) being a scalar parameter balancing their impact. The pairwise term \(\psi _{ij}\) imposes a smoothness over the label space.

The first unary term \(\psi _i^m\) captures the appearance of all foreground objects obtained from motion segments. To this end, we learn two Gaussian mixture models (GMMs), one each for foreground and background, with RGB values of pixel colors, similar to standard segmentation methods [14, 34]. The foreground GMM is learned with RGB values of all the pixels assigned to foreground in the motion segmentation. The background GMM is learned in a similar fashion with the corresponding background pixels. Given the RGB values of a pixel i, \(\psi _i^m(z_i)\) is the negative log-likelihood of the corresponding GMM (background one for \(l=0\) and foreground otherwise). Using motion cues to generate this soft potential \(\psi _i^m\) helps us alleviate the issue of imperfect motion segmentation. The second unary term \(\psi _i^{fc}\) represents the learned object appearance model determined by the current network prediction \(p_i^{x_i}\) for pixel i, i.e., \(\psi _i^{fc}(p_i^{x_i}) = -\log (p_i^{x_i})\).

The pairwise term is based on a contrast-sensitive Potts model [34, 35] as:

$$\begin{aligned} \psi _{ij}(x_i,x_j) = \lambda (1 - \Delta (i,j)) (1 - \delta (x_i - x_j)) \frac{\exp (-\gamma ||z_i - z_j||^2)}{\text {dist}(i,j)}, \end{aligned}$$
(3)

where \(z_i\) and \(z_j\) are colors of pixels i and j, \(\lambda \) is a scalar parameter to balance the order of magnitude of the pairwise term with respect to the unary term, and \(\gamma \) is a scalar parameter set to 0.5 as in [14]. The function \(\text {dist}(i,j)\) is the Euclidean distance between pixels. The Dirac delta function \(\delta (x_i - x_j)\) ensures that the pairwise cost is only applicable when two neighboring pixels take different labels. In addition to this, we introduce the term \((1 - \Delta (i,j))\), where \(\Delta (i,j) = 1\) if pixels i and j both fall in the boundary region around the motion segment, and 0 otherwise. This accounts for the fact that motion segments may not always respect color boundaries, and allows the minimization algorithm to assign different labels to neighboring pixels around motion edges.

We minimize the energy function (2) with an iterative GrabCut-like [34] approach, wherein we first apply the alpha expansion algorithm [36] to get a multi-label solution, use it to re-estimate the (background and foreground) GMMs, and then repeat the two steps a few times. We highlight the importance of our label prediction technique with soft motion-cue constraints in Fig. 3. Here, the original, binary motion predictions are imprecise (bottom) or incorrect (top), whereas using them as soft constraints in combination with the network prediction results in a more accurate estimation of the latent segmentation variables.

3.3 Fine-Tuning M-CNN

We learn an initial M-CNN model from all the videos in the dataset which have sufficient motion information (see Sect. 4.2 for implementation details). To refine this model we add a fine-tuning step, which updates the parameters of the network with a small set of unique and reliable video examples. This set is built automatically by selecting one shot from each video sequence, whose motion segment has the highest overlap (intersection over union) score with the current M-CNN prediction. The intuition behind this selection criterion is that our MCNN has already learned to discriminate categories of interest from the background, and thus, its predictions will have the highest overlap with precise motion segmentations. This model refinement leverages the most reliable exemplars and avoids near duplicates, often occurring within one video. In Sect. 4.3 we demonstrate the importance of this step for dealing with real-world non-curated video data.

4 Results and Evaluation

4.1 Experimental Protocol

We trained our M-CNN in two settings. The first one is on purely video data, and the second on a combination of image and video data. We performed experiments primarily with the weakly-annotated videos in the YouTube-Objects v2.2 dataset [37]. Additionally, to demonstrate that our approach adapts to other datasets automatically, we used the ImageNet video (ImageNet-VID) dataset [16]. The weakly-annotated images to train our network jointly on image and video data were taken from the training part of the PASCAL VOC 2012 segmentation dataset [17] with their image tags only. We then evaluated variants of our method on the VOC 2012 segmentation validation and test sets.

The YouTube-Objects dataset consists of 10 classes, with 155 videos in total. Each video is annotated with one class label and is split automatically into shots, resulting in 2511 shots overall. For evaluation, one frame per shot is annotated with a bounding box in some of the shots. We use this exclusively for evaluating our video co-localization performance in Sect. 4.5. For experiments with ImageNet-VID, we use 795 training videos corresponding to the 10 classes in common with YouTube-Objects. ImageNet-VID has bounding box annotations produced semi-automatically for every frame in a video shot (2120 shots in total). We accumulate the labels over a shot and assign them as class labels for the entire shot. As in the case of YouTube-Objects, we only use class labels at the video level, and none of the available additional annotations.

The VOC 2012 dataset has 20 foreground object classes and a background category. It is split into 1464 training, 1449 validation and 1456 test images. For experiments dealing with the subset of 10 classes in common with YouTube-Objects (see the list in Table 1), we treat the remaining 10 from VOC as irrelevant classes. In other words, we exclude all the training/validation images which contain only the irrelevant categories. This results in 914 training and 909 validation images. In images that contain an irrelevant class together with any of the 10 classes in YouTube-Objects, we treat their corresponding pixels as background for evaluation. Some of the state-of-art methods [1, 13] use an augmented version of the VOC 2012 dataset, with over 10,000 additional training images [38]. Naturally the variants trained on this large dataset perform significantly better than those using the original VOC dataset. We do not use this augmented dataset in our work, but report state-of-the-art results due to our motion cues.

The segmentation performance of all the methods is measured as the intersection over union (IoU) score of the predicted segmentation and the ground truth. We compute IoU for each class as well as the average over all the classes, including background, following standard protocols [1, 17]. We also evaluate our segmentation results in the co-localization setting with the CorLoc measure [14, 15, 31], which is defined as the percentage of images with IoU score, between ground truth and predicted bounding boxes, more than 0.5.

4.2 Implementation Details

Motion Segmentation. In all our experiments we used [14], a state-of-the-art method for motion segmentation. We perform two pruning steps before training the network. First, we discard all shots with less than 20 frames (\(2 \times \) the batch size of our SGD training). Second, we remove shots without relevant motion information: (i) when there are nearly no motion segments, or (ii) a significant part of the frame is assigned to foreground. We prune them out by a simple criterion based on the size of the foreground segments. We keep only the shots where the estimated foreground occupies between 2.5 % and 50 % of the frame area in each frame, for at least 20 contiguous frames in the shot. In cases where motion segmentation fails in the middle of a shot, but recovers later, producing several valid sequences, we keep the longest one. These two steps combined remove about a third of the shots, with 1675 and 1691 shots remaining in YouTube-Objects and ImageNet-VID respectively. We sample 10 frames uniformly from each of these remaining shots to train the network.

Training. We use a mini-batch of size 10 for SGD, where each mini-batch consists of the 10 frame samples of one shot. Our CNN learning parameters follow the setting in [1]. The initial learning rate is set to 0.001 and multiplied by 0.1 after a fixed number of iterations. We use a momentum of 0.9 and a weight decay of 0.0005. Also, the loss term \(\delta (x_i - l) \log (p_i^l)\) computed for each object class l with \(\text {num}_l\) training samples, in (1), is weighted by \(\min _{j=1 \dots L}\text {num}_j/\text {num}_l\). This accounts for imbalanced number of training samples for each class in the dataset.

In the energy function (2), the parameter \(\alpha \), which controls the relative importance of the current network prediction and the soft motion cues, is set to 1 when training on the entire dataset. It is increased to 2 for fine-tuning, where the predictions are more reliable due to an improved network. We perform 4 iterations of the graph cut based inference algorithm, updating the GMMs at each step. The inference algorithm is either alpha expansion (for videos with multiple objects) or graph cut (when there is only one object label for the video). Following [14], we learn GMMs for a frame t with the motion segments from all the 10 frames in a batch, weighting each of them inversely according to their distance from t. The fine-tuning step is performed very selectively with the best shot for each video, where the average overlap is no less than 0.2. A systematic evaluation on the VOC 2012 validation set confirmed that the performance is not sensitive to the number of iterations and the \(\alpha \) parameter. More details on this and our implementation in the Caffe framework [39] are available online [40].

Table 1. Performance of M-CNN and EM-Adapt variants, trained with YouTube-Objects, on the VOC 2012 validation set. ‘*’ denotes the M-CNN models without fine-tuning. ‘M-CNN* hard’ is the variant without the label prediction step. ‘M-CNN’ is our complete method: with fine-tuning and label prediction.

4.3 Evaluation of M-CNN

We start by evaluating the different components of our M-CNN approach and compare to the state-of-the-art EM-Adapt method, see Table 1. We train EM-Adapt and M-CNN with the pruned shots from our YouTube-Objects training set in two network settings: large and small field of view (FOV). The large FOV is 224\(\times \)224, while the small FOV is 128\(\times \)128. We learn 5 models which vary in the order of the training samples and their variations (cropping, mirroring), and report the mean score and standard deviation.

The small FOV M-CNN without the fine-tuning step achieves an IoU of 26.7 %, whereas large FOV gives 33.6 % on the PASCAL VOC 2012 validation set. In contrast, EM-Adapt [1] trainedFootnote 2 on the same dataset performs poorly with large FOV. Furthermore, both the variants of EM-Adapt are lower in performance than our M-CNN. This is because EM-Adapt uses a heuristic (where background and foreground are constrained to a fraction of the image area) to estimate the latent segmentation labels, and fails to leverage the weak supervision in our training dataset effectively. Our observation on this failure of EM-Adapt is further supported by the analysis in [1], which notes that a large FOV network performs poorer than its small FOV counterpart when only a “small amount of supervision is leveraged”. The label prediction step (Sect. 3.2) proposed in our method leverages training data better than EM-Adapt, by optimizing an energy function involving soft motion constraints and network responses. We also evaluated the significance of using motion cues as soft constraints (M-CNN*) instead of introducing them as hard labels (M-CNN* hard), i.e., directly using motion segmentation result as latent labels \(\mathbf{x}\). ‘M-CNN* hard’ achieves 29.9 compared to 33.6 with soft constraints. We then take our best variant (M-CNN with large FOV) and fine-tune it, improving the performance further to 41.2 %. In all the remaining experiments, we use the best variants of EM-Adapt and M-CNN.

Table 2. Performance of our M-CNN variants on the VOC 2012 validation set is shown as IoU scores. We also compare with the best variants of EM-Adapt [1] trained on YouTube-Objects (YTube), ImageNet-VID (ImNet), VOC, and augmented VOC (VOC aug.) datasets. \(\dagger \) denotes the average result of 5 trained models.

4.4 Training on Weakly-Annotated Videos and Images

We also trained our M-CNN with weakly-annotated videos and images. To this end, we used images from the VOC 2012 training set. We added the 914 images from the VOC 2012 training set containing the 10 classes, and used only their weak annotations, i.e., image-level labels. In this setting, we first trained the network with the pruned video shots from YouTube-Objects, fine-tuned it with a subset of shots (as described in Sect. 3.3), and then performed a second fine-tuning step with these selected video shots and VOC images. To estimate the latent segmentation labels we use our optimization framework (Sect. 3.2) when the training sample is from the video dataset and the EM-Adapt label prediction step when it is from the VOC set. We can alternatively use our framework with only the network prediction component for images, but this is not viable when training on classes without video data, i.e., the remaining 10 classes in VOC. As shown in Table 2, using image data, with additional object instances, improves the IoU score from 41.2 to 47.2. In comparison, EM-Adapt re-trained for 10 classes on the original VOC 2012 achieves only 35.8. Augmenting the dataset with several additional training images [38], improves it to 40.2, but this remains considerably lower than our result. M-CNN trained with ImageNet-VID achieves 39.0 (ImNet in the table), which is comparable to our result with YouTube-Objects. The performance is significantly lower for the motorbike class (15.3 vs 32.4) owing to the small number of video shots available for training. In this case, we only have 67 shots compared to 272 from YouTube-Objects. Augmenting this dataset with VOC images boosts the performance to 43.7 (VOC+ImNet). Augmenting the training set with additional images (VOC aug.) further increases the performance.

Fig. 4.
figure 4

Sample results on the VOC 2012 validation set. Results of fully-supervised DeepLab [8], weakly-supervised EM-Adapt [1] trained on augmented VOC, and our weakly-supervised M-CNN trained on VOC+YouTube-Objects are shown in 2nd, 3rd and 4th columns respectively. (Best viewed in color.)

Qualitative Results. Figure 4 shows qualitative results of M-CNN (trained on VOC and YouTube-Objects) on a few sample images. These have much more accurate object boundaries than the best variant of EM-Adapt [1], which tends to localize the object well, but produces a ‘blob-like’ segmentation, see the last three rows in the figure in particular. The first two rows show example images containing multiple object categories. M-CNN recognizes object classes more accurately, e.g., cow in row 4, than EM-Adapt, which confuses cow (shown in green) with horse (shown in magenta). Furthermore, our segmentation results compare favorably with the fully-supervised DeepLab [8] approach (see rows 3–4), highlighting the impact of motion to learn segmentation. There is scope for further improvement, e.g., overcoming the confusion between similar classes in close proximity to each other, as in the challenging case in row 2 for cat vs dog.

Table 3. Evaluation on the VOC 2012 test set shown as IoU scores.

Comparison to the State of the Art. Table 3 shows evaluation on the VOC 2012 test set, with our M-CNN trained on 20 classes using image and video data for 10 classes and image data only for the other 10. We performed this by uploading our segmentation results to the evaluation server, as ground truth is not publicly available for the test set. We compare with several state-of-the-art methods with scores taken directly from the publications, except [1] without the post-processing CRF step. This result, shown as ‘[1]’ in the table, is with a model we trained on the VOC augmented dataset. We train M-CNN on all the 20 VOC classes with the model trained (and fine-tuned) on YouTube-Objects and perform a second fine-tuning step together with videos from YouTube-Objects and images from VOC. This achieves 39.8 mean IoU over all the 20 classes, and 49.6 on the 10 classes with video data. This result is significantly better than methods using only weak labels, which achieve 25.7 [12], 35.6 [13] and 35.2 [1]. The improvement shown by our M-CNN is more prominent when we consider the average over 10 classes where we use soft motion segmentation cues, and the background, with nearly 10 % and 9 % boost over [1, 13] respectively. We also show the evaluation of the model trained on ImageNet-VID in the table.

A few methods have used additional information in the training process, such as the size of objects (+ sz in the table), superpixel segmentation (+ sp), or post-processing steps, e.g., introducing a CRF with pairwise terms learned from fully-annotated data (+ CRF), or even strong or full supervision, such as bounding box (+ bb) or pixel-level segmentation (+ seg) annotations. Even though our pure weakly-supervised method is not directly comparable to these approaches, we have included these results in the table for completeness. Nevertheless, M-CNN outperforms some of these methods [1, 3], due to our effective learning scheme. Also from Table 3, the number of training samples used for M-CNN (number of videos shots + number of VOC training images) is significantly lower than those for all the other methods.

Table 4. Co-localization performance of M-CNN on the YouTube-Objects dataset. We report per class and average CorLoc scores, and compare with state-of-the-art unsupervised and weakly-supervised methods.

4.5 Co-localization

We perform co-localization in the standard setting, where videos contain a common object. Here, we use our M-CNN trained on the YouTube-Objects dataset with 10 categories. We evaluate it on all the frames in YouTube-Objects to obtain prediction scores \(\mathbf{p}_i\) for each pixel i. With these scores, we compute a foreground GMM by considering pixels with high predictions for the object category as foreground. A background GMM is also computed in a similar fashion. These form the unary term \(\psi ^m_i\) in the energy function (2). We then minimize this function with graph cut based inference to compute the binary (object vs background) segmentation labels. Since we estimate segmentations for all the video frames, we do this at the superpixel level [42] to reduce computation cost. We then extract the bounding box enclosing the largest connected component in each frame, and evaluate them following [15]. Quantitative results with this are summarized as CorLoc scores in Table 4. We observe that our result outperforms previous state of the art [18] by over 16 %. Performing this experiment with ImageNet-VID data we obtain 42.1 on average, in comparison to 37.9 of [14]. ImageNet-VID being a more challenging dataset than YouTube-Objects results in a lower performance for both these methods.

5 Summary

This paper introduces a novel weakly-supervised learning approach for semantic segmentation, which uses only class labels assigned to videos. It integrates motion cues computed from video as soft constraints into a fully convolutional neural network. Experimental results show that our soft motion constraints can handle noisy motion information and improve significantly over the heuristic size constraints used by state-of-the-art approaches for weakly-supervised semantic segmentation, i.e., EM-Adapt [1]. We show that our approach outperforms previous state of the art [1, 13] on the PASCAL VOC 2012 image segmentation dataset, thereby overcoming domain-shift issues typically seen when training on video and testing on images. Furthermore, our weakly-supervised method shows excellent results for video co-localization and improves over several methods [14, 18, 31].