1 Introduction

Unsupervised learning is one of the most difficult and interesting problems in computer vision and machine learning today. Many researchers believe that learning from large collections of unlabeled videos could help decode hard questions regarding the nature of intelligence and learning. Moreover, as unlabeled images and videos are easy to collect at relatively low cost, unsupervised learning could be of real practical value in many computer vision and robotics applications. In this article, we propose a novel approach to unsupervised learning that successfully tackles many of the challenges associated with this task. We present a system that is composed of two main pathways, one that performs unsupervised object discovery in videos or large image collections—the teacher branch, and the other—the student branch, which learns from the teacher to segment foreground objects in single images. The unsupervised learning process could continue over several generations of students and teachers. In Algorithm 1, we present the high level description of our method. We will use throughout the paper the terms “generation” and “iteration” of Algorithm 1 interchangeably. The key aspects of our approach, which ensure improvement in performance from one generation to the next, are: (1) the existence of an unsupervised selection module that is able to pick up good quality masks generated by the teacher and pass them for training to the next generation students; (2) training of multiple students with different architectures, able through their diversity to help train a better selection module for the next iteration and form together with the selection a more powerful teacher pathway at the next iteration and (3) access to larger quantities of, and potentially more complex, unlabeled data, which becomes more useful as the generations become stronger.

figure a

Our approach is general in the sense that the student or teacher pathways do not depend on a specific neural network architecture or implementation. Through many experiments and comparisons to state of the art methods, we also show that it is applicable to different tasks in computer vision, such as object discovery in video, unsupervised image segmentation, saliency detection and transfer learning. A preliminary version of our work, presenting an algorithm without learning over several generations and without experiments on saliency detection and transfer learning, appeared at ICCV 2017 (Croitoru et al. 2017).

In Fig. 1 we present a graphic overview of our full system. In the unsupervised training stage the student network (module A) learns, frame by frame, from an unsupervised teacher pathway (modules B and C) to produce similar object masks in single images. Module B discovers objects in images or videos, while module C selects which masks produced by module B are sufficiently good to be passed to module A for training. Thus, the student branch tries to imitate the output of module B for the frames selected by module C, having as input only a single image—the current frame, while the teacher can have access to an entire video sequence.

The strength of the trained student (module A) depends on the performance of the module B. However, as we see in experiments, the power of the selection module C contributes to the fact that the newly student will outperform its initial teacher module B. Therefore, throughout the paper we refer to B as the initial “teacher” and to both B and C together, as the full “teacher pathway”. The method presented in Algorithm 1 follows the main steps of the system as it learns from one iteration (generation) to the next. The steps are discussed in more detail in Sect. 3.

Fig. 1
figure 1

The dual student-teacher system proposed for unsupervised learning to segment foreground objects in images, functioning as presented in Algorithm 1. It has two pathways: along the teacher branch, an object discoverer in videos or large image collections (module B) detects foreground objects. The resulting soft masks are then filtered based on an unsupervised data selection procedure (module C). The resulting final set of pairs—input image (or video frame) and soft mask for that particular frame (which acts as an unsupervised label)—are used to train the student pathway (module A). The whole process can be repeated over several generations. At each generation several student CNNs are trained, then they collectively contribute to train a more powerful selection module C (modeled by a deep neural network, Sect. 4.3) and form an overall more powerful teacher pathway at the next iteration of the overall algorithm

During the first iteration of Algorithm 1, the unsupervised teacher (module B) has access to information over time—a video. In contrast, the student is deeper in structure, but it has access only to a single image—the current video frame. Thus, the information discovered by the teacher in time is captured by the student in added depth, over neural layers of abstraction. Several student nets with different architectures are trained at the first iteration. In order to use as supervisory signal only good quality masks, an unsupervised mask selection procedure (very simple at Iteration 1) is applied (module C), as explained in Sect. 4. Once several student nets are trained, they can form (in various ways, as explained in Sects. 4.1 and 5.1) the teacher pathway at the next iteration, along with a stronger unsupervised selection module C, represented by a deep neural network, EvalSeg-Net, trained as explained in detail Sect. 4.3. In short, EvalSeg-Net learns to predict the output masks agreement among the generally diverse students, which statistically takes place when the masks are of good quality. Thus EvalSeg-Net could be used as an unsupervised mask evaluation procedure and a strong selection module. Then, we run, at the next generation, the newly formed teacher pathway (modules B and C) on a larger set of unlabeled videos or collections of images, to produce supervisory signal for the next generation students. In experiments, we show that the improvement of both modules B and C at the next iterations, together with the increase in the amount of data, are all important, while not all necessary, for increasing accuracy at the next generation.

Note that, while at the first iteration the teacher pathway is required to receive video sequences as input, from the second generation on, it could receive as input large image collections, as well. Due to the very high computational and storage costs, required during training time, we limit our experiments to learning over two generations, but our algorithm is general and could run over many iterations. We show in extensive experiments that even two generations are sufficient to outperform the current state of the art on object discovery in videos and images. We also demonstrate experimentally a solid improvement from one generation to the next for each component involved: the individual students (module A), the teacher (module B), as well as the selection module C.

Now we enumerate the main contributions of our approach and also point out, where it is the case, the key contributions that were not published in our ICCV2017 conference paper (Croitoru et al. 2017):


We introduce a novel approach to unsupervised learning to segment foreground objects in images. The overview of our system and algorithm are presented in Fig. 1 and Algorithm 1. The system has two main pathways—one that acts as a teacher (module B) and discovers objects in videos or large collections of images followed by an unsupervised selection module C that filters out low quality masks, and the other that acts as student and learns from the teacher pathway to detect the foreground objects in single input images. We provide a general algorithm for unsupervised learning over several generations of students and teachers. In addition to our conference paper, we show how to learn an unsupervised mask selection deep network (EvalSeg-Net, see Sect. 4.3), which is important in improving the teacher pathway at the next iteration, over all cases tested: when the teacher (module B) is formed by a single student network, by all students combined into an ensemble, or by all students taken separately. The whole unsupervised training at the second generation is a novelty over the conference work, with significantly improved experimental results (see Sect. 5).


At the higher level, our proposed algorithm is sufficiently general to accommodate different implementations and neural network architectures. In this paper, we also provide a specific implementation which we describe in detail. We demonstrate its performance on three unsupervised learning tasks, namely video object discovery tested on YouTube Objects (Prest et al. 2012), unsupervised foreground segmentation in images tested on Object Discovery in Internet Images (Rubinstein et al. 2013) and saliency detection tested on Pascal-S (Li et al. 2014), on which we obtain state of the art results. We further apply our approach to a well-known transfer learning setup and obtain competitive results when compared to the top transfer learning methods in the field. We also compare experimentally our method to the work most related to ours Pathak et al. (2017), on both foreground segmentation and transfer learning tasks and show that our method obtains better results on foreground object segmentation, while theirs is more effective for transfer learning. To the best of our knowledge, we are one of the first two methods, along with the work of Pathak et al. (2017), to propose a system that learns to detect and segment foreground objects in images in unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time. Our experiments on image saliency and transfer learning are completely new and in addition to our conference paper.

2 Scientific Context

The literature on unsupervised learning follows two main directions. (1) One is to learn powerful features in an unsupervised way and then use them for transfer learning, within a supervised scheme and in combination with different classifiers, such as SVMs or CNNs (Radenović et al. 2016; Misra et al. 2016; Li et al. 2016). (2) The second direction is to discover, at test time, common patterns in unlabeled data, using clustering, feature matching or data mining formulations (Jain et al. 1999; Cho et al. 2015; Sivic et al. 2005).

Belonging to the first category and closely related to our work, the approach in Pathak et al. (2017) proposes a system in which a deep neural network learns to produce soft object masks from an unsupervised module that uses optical flow cues in video. The deep features learned in this manner are then applied to several transfer learning tasks. Their work, together with ours, are probably the first two that show ways to learn in an unsupervised fashion to segment objects in single images. While the two approaches are clearly different at the technical and algorithmic level, we also perform some interesting comparisons in the experiments Sect. 5.3, on both tasks of transfer learning and foreground object segmentation. Our results reveal that while their approach is better on transfer learning tasks, ours is more effective on unsupervised segmentation as tested on several datasets.

Recently, researchers have started to use the natural, spatial and temporal structure in images and videos as supervisory signals in unsupervised learning approaches that are considered to follow a self-supervised learning paradigm (Raina et al. 2007; Lee et al. 2017; Wang and Gupta 2015). Methods that fall into this category include those that learn to estimate the relative patch positions in images (Doersch et al. 2015), predict color channels (Larsson et al. 2016), solve jigsaw puzzles (Noroozi and Favaro 2016) and inpaint (Pathak et al. 2016). One trend is to use as supervisory signal, spatial and appearance information collected from raw single images. In such single-image cases the amount of information that can be learned is limited to a single moment in time, as opposed to the case of learning from video sequences. Using unlabeled videos as input is closer related to our work and includes learning to predict the temporal order of frames (Lee et al. 2017), generate the future frame (Finn et al. 2016; Xue et al. 2016; Goroshin et al. 2015) or learn from optical flow (Wang and Gupta 2015).

For most of these papers, the unsupervised learning scheme is only an intermediate step to train features that are eventually used on classic supervised learning tasks, such as object classification, object detection or action recognition. Such pre-trained features perform better than randomly initialized ones, as they contain valuable information implicit in the natural structure of the world used as supervisory signal. While the unsupervised features might not contain semantic, class-specific information (Bau et al. 2017), it is clear that they capture general objectness properties, useful for tasks such as segmenting the main objects in the scene or transfer-learning to specific supervised classification problems. In our work, we focus mostly on specific unsupervised tasks on which we perform extensive evaluations, but we also show some results on transfer learning experiments.

The second main approach to unsupervised learning includes methods for image co-segmentation (Joulin et al. 2010, 2012; Kim et al. 2011; Rubinstein et al. 2013; Kuettel et al. 2012; Vicente et al. 2011; Rubio et al. 2012; Leordeanu et al. 2012) and weakly supervised localization (Deselaers et al. 2012; Nguyen et al. 2009; Siva et al. 2013). Earlier methods are based on local features matching and detection of their co-occurrence patterns (Stretcu and Leordeanu 2015; Sivic et al. 2005; Leordeanu et al. 2005; Parikh and Chen 2007; Liu and Chen 2007), while more recent ones (Joulin et al. 2014; Rochan and Wang 2014; Prest et al. 2012) discover object tubes by linking candidate bounding boxes between frames with or without refining their location. Traditionally, the task of unsupervised learning from image sequences has been formulated as a feature matching or data clustering optimization problem, which is computationally very expensive due to its combinatorial nature.

There are also other papers (Lee et al. 2011; Cheng et al. 2017; Dutt Jain et al. 2017; Tokmakov et al. 2017) that tackle unsupervised learning tasks but are not fully unsupervised, using powerful features that are pre-trained in supervised fashion on large datasets, such as ImageNet (Russakovsky et al. 2015) or VOC2012 (Everingham et al. 2015). Such works take advantage of the rich source of supervised information learned from other datasets, through features trained to respond to general object properties over tens or hundreds of object categories. In another paper some amount of supervision is necessary, as in Tokmakov et al. (2016) where a system is proposed having a motion-CNN that learns from weakly annotated videos and optical flow cues to segment objects in video frames. One key difference from our work, is that their approach requires the class labels of the training video frames.

With respect to the end goal, our work is more related to the second research direction, on unsupervised discovery in video. However, unlike that research, we do not discover objects at test time, but during the unsupervised training process, when the student pathway learns to detect foreground objects. Therefore, from the learning perspective, our work is more related to the first research direction based on self-supervised training. While there are published methods that leverage spatiotemporal information in video, our method at the second iteration is able to learn even from collections of unrelated images. Related to our idea of improving segmentations over several iterations, there is the method proposed in Khoreva et al. (2017), which is not unsupervised as it requires the ground truth bounding box information in order to improve the segmentations over several iterations, in conjunction with a modified version of GrabCut (Rother et al. 2004).

3 Overall Approach

We propose a genuine unsupervised learning algorithm (see Algorithm 1) for foreground object segmentation that offers the possibility to improve over several iterations. Our method combines in complementary ways multiple modules that are well suited for this task.

It starts with a teacher (module B, Fig. 1) that discovers objects in unlabeled videos and produces a soft mask of the foreground object in each frame. There are several available methods for video discovery in the literature, with good performance (Borji et al. 2012; Cheng et al. 2015; Barnich and Van Droogenbroeck 2011). We chose the VideoPCA algorithm introduced as part of the system in Stretcu and Leordeanu (2015) because it is very fast (50–100 fps), uses very simple features (individual pixel colors) and it is completely unsupervised, with no usage of supervised pre-trained features. It learns how to separate the foreground from the background and it exploits the spatio-temporal consistency in appearance, shape, movement and location of objects, common in video shots, along with the contrasting properties, in size, shape, motion and location, between the main object and the background scene. Note that it would be much harder, at this first stage, to discover objects in collections of unrelated images, where there is no smooth variation in shape, appearance and location over time. Only at the second iteration of the algorithm, the simpler VideoPCA is replaced by a more powerful teacher which is able to discover objects in collections of images as well.

The resulting soft-masks of lower quality are then filtered out automatically (module C, Fig. 1), using at the first iteration a very simple automatic procedure. Next, the remaining ones are passed to a student ConvNet, which learns to predict object masks in single images. When several student nets of different architectures are learned, they give the possibility of learning a stronger selection network (module C, Fig. 1) and form a more powerful teacher pathway (modules B and C, Fig. 1) for the next generation. Then, the whole process is repeated. As discussed in Sect. 1, three key aspects contribute to improvement at the second iteration: learning of a more powerful teacher (module B), which could be formed by a single student model or an entire ensemble, learning a stronger selection module C (modeled by a EvalSeg-Net, Sect. 4.3) and last, but not least, increasing the amount of unlabeled data. As shown in the experiments Sect. 5.1, bringing in more data helps only at the second generation when both the teacher and the selection module are improved. In Algorithm 1 we enumerate concisely the main steps of our approach.

4 System Architecture

We detail the architecture and training process of our system, module by module, as seen in Fig. 1. We first present the student pathway (module A in Fig. 1), which takes as input an individual image (e.g. current frame in the video) and learns to predict foreground soft-masks from an unsupervised teacher. The teacher (represented by module B) and the selection module C, are explained in the next Sects. 4.2 and 4.3.

Fig. 2
figure 2

Different architectures for the “student” networks, each processing a single image. They are trained to predict the unsupervised label masks given by the teacher pathway, frame by frame. The architectures vary from the more classical baseline LowRes-Net (left), with low resolution output, to more recent architectures, such as the fully convolutional one (middle) and different types of U-Nets (right). For the U-Net architecture the blocks denoted with double arrows can be interchanged to obtain a new architecture. We noticed that on the task of bounding box fitting the simpler low-resolution network performed very well, while being outperformed by the U-Nets on fine object segmentation

4.1 Student Path (Module A): Single-Image Segmentation

The student pathway (module A in Fig. 1) consists of a deep convolutional network. We test different network architectures, some of which are commonly used in the recent literature on semantic image segmentation. We create a small pool of relatively diverse architectures, presented next.

The first convolutional network architecture for semantic segmentation that we test, is based on a more traditional CNN design. We term it LowRes-Net (see Fig. 2) due to its low resolution soft-mask output. It has ten layers (seven convolutional, two pooling and one fully connected) and skip connections. Skip connections have proved to offer a boost in performance, as shown in the literature (Raiko et al. 2012; Pinheiro et al. 2016). We also observed a similar improvement in our experiments when using skip connections. The LowRes-Net takes as input a \(128\times 128\) RGB image (along with its hue, saturation and derivatives w.r.t. x and y) and produces a \(32\times 32\) soft segmentation of the main objects present in the image. Because LowRes-Net has a fully connected layer at the top, we reduced the output resolution of the soft-segmentation mask, to limit memory cost. While the derivatives w.r.t x and y are in principle not needed (as they could be learned by appropriate filters during training), in our tests explicitly providing the derivatives along with HSV and by using skip-connections boosted the accuracy by over \(1\%\). The LowRes-Net has a total of 78M parameters, most of them being in the last, fully connected layer.

The second CNN architecture tested, termed FConv-Net, is fully convolutional (Long et al. 2015), as also presented in Fig. 2. It has a higher resolution output of \(128 \times 128\), with input size of \(256 \times 256\). Its main structure is derived from the basic LowRes-Net model. Different from LowRes-Net, it is missing the fully connected layer at the end and has more parameters in the convolutional layers, for a total of 13M parameters.

We also tested three different nets based on the U-Net (Ronneberger et al. 2015) architecture, which proved very effective in the semantic segmentation literature. Our U-Net networks are: (1) BasicU-Net, (2) DilateU-Net—similar to BasicU-Net but using atrous (dilated) convolutions (Yu and Koltun 2015) in the center module, and (3) DenseU-Net—with dense connections in the down and up modules (Jégou et al. 2017).

The BasicU-Net has 5 down modules with 2 convolutional layers each, with 32, 64, 128, 256 and 512 features maps, respectively. In the center module the BasicU-Net has two convolutional layers with 1024 feature maps each. The up modules have 3 convolutional layers and the same number of features maps as the corresponding down modules. The only difference between BasicU-Net and DilateU-Net is that the former has a different center module with 6 atrous convolutions and 512 feature maps each. Then, DenseU-Net has 4 down modules with 4 corresponding up modules. Each down and up module has 4 convolutions with skip-connections (as presented in Fig. 2). The modules have 12, 24, 48 and 64 features maps, respectively. The transition represents a convolution, having the role of reducing the output number of feature maps from each module. The BasicU-Net has 34M parameters, while the DilateU-Net has 18M parameters. DenseU-Net has only 3M parameters, but uses skip-connections inside the up and down blocks in order to make up for the difference in the number of parameters. All three U-Nets have \(256 \times 256\) input and \(256 \times 256\) output. All networks use ReLU activation functions. Please see Fig. 2 for more specific details regarding the architectures of the different models.

Given the current setup, the student nets do not learn to identify specific object classes. They will learn to softly segment the main foreground objects present, regardless of their particular category. The main difference in their performance is in their ability to produce fine object segmentations. While the LowRes-Net tends to provide a good support for estimating the object’s bounding box due to its simpler output, the other ConvNets (especially the U-Nets), with higher resolution, are better at finely segmenting objects. The different student architectures bring diversity to their outputs. Due to the different ways in which the particular models make mistakes, they are stronger when forming an ensemble and can also be used, as seen in Sect. 4.3, to train a different network for segmentation evaluation, used as the new selection module C. As explained later, that network, namely EvalSeg-Net, will learn to predict the output masks agreement among the students, which statistically takes place when the masks are of good quality. In experiments we also show that the student nets outperform their teacher and are able to detect objects from categories that were not seen during training.

Combining several student nets The student networks with different architectures produce varied results that differ qualitatively. While the bounding boxes computed from their soft-masks have similar accuracy, the actual soft-segmentation output looks different. They have different strengths, while making different kinds of mistakes. Their diversity will be the basis for creating the teacher pathway at the next generation (Sects. 4.2 and 4.3).

We experimented with the idea of using several student networks, by combining them to form an ensemble or by letting them produce separate independent segmentations for each image. In our final system we preferred the latter approach, which is more practical, easier to implement and gives the freedom of having the students run independently, in parallel with no need to synchronize their outputs. As shown in Sects. 4.2 and 4.3, together with the EvalSeg-Net used for selection, independent individual students from Iteration 1 will form the teacher pathway at the next generation. However, note that even a single student net along with the new EvalSeg-Net selector can be effectively used as next teacher pathway (See experimental Sect. 5.1, Table 6 and Fig. 7).

When forming an actual ensemble, which we term Multi-Net, the final output is the one obtained by multiplying pixel-wise the soft-masks produced by each individual student net. Thus, only positive pixels, on which all nets agree, survive to the final segmentation. As somehow expected, Multi-Net offers robust masks of higher precision than each individual network. However, it might lose details around the border of objects having a lower recall (see Fig. 5). We provide results of the Multi-Net ensemble only for comparison purposes. Please note, however, that in our final system the output of the ensemble was not used to train the students at the next generation. The students at the second iteration are all trained directly on outputs from individual students at the first iteration, filtered with EvalSeg-Net. As explained in more detail later in this Section, Multi-Net is used only to train the unsupervised selection network, EvalSeg-Net.

Technical details: training the students We treat foreground object segmentation as a multidimensional regression problem, where the soft mask given by the unsupervised video segmentation system acts as the desired output. Let \(\mathbf {I}\) be the input RGB image (a video frame) and \(\mathbf {Y}\) be the corresponding 0–255 valued soft segmentation given by the unsupervised teacher for that particular frame. The goal of our network is to predict a soft segmentation mask \(\hat{\mathbf {Y}}\) of width W and height H (where \(W=H=32\) for the basic architecture, \(W=H=128\) for fully convolutional architecture and \(W=H=256\) for U-Net architectures), that approximates as well as possible the mask \(\mathbf {Y}\). For each pixel in the output image, we predict a 0–255 value, so that the total difference between \(\mathbf {Y}\) and \(\hat{\mathbf {Y}}\) is minimized. Thus, given a set of N training examples, let \(\mathbf {I}^{(n)}\) be the input image (a video frame), \({\hat{\mathbf {Y}}}^{(n)}\) be the predicted output mask for \(\mathbf {I}^{(n)}\), \(\mathbf {Y}^{(n)}\) the soft segmentation mask (corresponding to \(\mathbf {I}^{(n)}\)) and \(\mathbf {w}\) the network parameters. \(\mathbf {Y}^{(n)}\) is produced by the video discoverer after processing the video that \(\mathbf {I}^{(n)}\) belongs to. Then, our loss is:


where \(\mathbf {Y}_{p}^{(n)}\) and \(\hat{\mathbf {Y}}_{p}^{(n)}\) denotes the p-th pixel from \(\mathbf {Y}^{(n)}\), respectively \(\hat{\mathbf {Y}}^{(n)}\).

We observed that in our tests, the L2 loss performed better than the cross-entropy loss, due to the fact that the soft-masks used as labels have real values, not discrete ones. Also, they are not perfect, so the idea of thresholding them for training does not perform as well as directly predicting their real values. We train our network using the Tensorflow (Abadi et al. 2015) framework with the Adam optimizer (Kingma and Ba 2014). All models are trained end-to-end using a fixed learning rate of 0.001 for 10 epochs. The training time for any given model is about 3–5 days on a Nvidia GeForce GTX 1080 GPU, for the first iteration and about 2 weeks for the second iteration students.

Post-processing The student CNN outputs a \(W \times H\) soft mask. In order to fairly compare our models with other methods, we have two different post processing steps: (1) bounding box fitting and (2) segmentation refinement. For fitting a box around the soft mask, we first up-sample the \(W \times H\) output to the original size of the image, then threshold the mask (validated on a small subset), determine the connected components and fit a tight box around each of the components. We perform segmentation refinement (point 2) in a single case, on the Object Discovery in Internet Images dataset as also specified in the experiments section. For that, we use the OpenCV implementation of GrabCut (Rother et al. 2004) to refine our soft mask, up-sampled to the original size. In all other tests we use the original output of the networks.

4.2 Teacher (Module B): Unsupervised Object Discovery

There are several methods available for discovering objects and salient regions in images and videos (Borji et al. 2012; Cheng et al. 2015; Hou and Zhang 2007; Jiang et al. 2013; Cucchiara et al. 2003; Barnich and Van Droogenbroeck 2011) with reasonably good performance. More recent methods for foreground objects discovery such as Papazoglou and Ferrari (2013) are both relatively fast and accurate, with runtime around 4 seconds per frame. However, that runtime is still long and prohibitive for training the student CNN that requires millions of images. For that reason we used at the first generation (Iteration 1 of Algorithm 1) for module B in Fig. 1, the VideoPCA algorithm, which is a part of the whole system introduced in Stretcu and Leordeanu (2015). It has lower accuracy than the full system, but it is much faster, running at 50–100 fps. At this speed we can produce one million unsupervised soft segmentations in a reasonable time of about 5–6 h.

VideoPCA The main idea behind VideoPCA is to model the background in video frames with Principal Component Analysis. It finds initial foreground regions as parts of the frames that are not reconstructed well with the PCA model. Foreground objects are smaller than the background, have contrasting appearance and more complex movements. They could be seen as outliers, within the larger background scene. That makes them less likely to be captured well by the first PCA components. Thus, for each frame, an initial soft-mask is produced from an error image, which is the difference between the original image and the PCA reconstruction. These error images are first smoothed with a large Gaussian filter and then thresholded. The binary masks obtained are used to learn color models of foreground and background, based on which individual pixels are classified as belonging to foreground or not. The object masks obtained are further multiplied with a large centered Gaussian, based on the assumption that foreground objects are often closer to the image center. These are the final masks produced by VideoPCA. For more technical details, the reader is invited to consult Stretcu and Leordeanu (2015). In this work, we use the method exactly as found onlineFootnote 1 without any parameter tuning.

Teacher at the next generation At the next iteration of Algorithm 1, VideoPCA (in module B) is replaced by student nets trained at the previous generation. We tested with three different ideas: one is to use a single student network and combine it with the more powerful selection module to form a stronger full teacher pathway (modules B and C). While this approach is very effective and proves the relevance of selection, it is not the most competitive. Using all student nets is always more powerful and this can be done in two ways, as discussed in the previous Section. One possibility is to create Multi-Net ensemble by multiplying their outputs and the other, equally powerful but easier to implement is to use all student nets independently and let each image the possibility to have several output masks, as separate (input image, soft mask) pairs for training the next generation. We prefer the latter approach which, in combination with the EvalSeg-Net network will constitute the full teacher pathway at the second iteration. Next, we present in detail how we perform mask selection and how we train EvalSeg-Net.

Fig. 3
figure 3

Quality of soft masks versus degree of selection at module C. When selectivity increases, the true quality of the training frames that pass through selection improves. At the first iteration of Algorithm 1 we select masks using a simple selection procedure based on the mean value of non-zero mask pixels. At the second iteration, we select masks using the much more powerful EvalSeg-Net. The plots are computed using results from the VID dataset, where there is an annotation for each input frame. Note the superior quality of masks selected at the second iteration (red vs. blue lines, in the left plot). We have also compared the simple “mean” based selection procedure used at iteration 1 (yellow line) with EvalSeg-Net used at iteration 2 (red line), on the same soft masks from iteration 2. The EvalSeg-Net is clearly more powerful, which justifies its use at the second iteration when it replaces the very simple “mean” based procedure (Color figure online)

4.3 Unsupervised Soft Masks Selection (Module C)

The performance of the student net is influenced by the quality of the soft masks provided as labels by the teacher branch. The cleaner the masks, the more chances the student has to learn to segment well objects in images. VideoPCA tends to produce good results if the object present in the video stands out well against the background scene, in terms of motion and appearance. However, if the object is occluded at some point, does not move w.r.t the scene or has a similar appearance to its background, the resulting soft masks might be poor. In the first generation, we used a simple measure of masks quality to select only the good soft-masks for training the student pathway, based on the following observation: when VideoPCA masks are close to the ground truth, the average of their nonzero values is usually high. Thus, when the discoverer is confident, it is more likely to be right. The average value of non-zero pixels in the soft mask is then used as a score indicator for each segmented frame. Only masks of certain quality according to this indicator are selected and used for training the student nets. This represents module C in Fig. 1 at the first generation of Algorithm 1. While being effective at iteration 1, the simple average value over all pixels cannot capture the goodness of a segmentation at the higher level of overall shape. At the next iterations, we therefore explore new ways to improve it.

Training EvalSeg-Net At the next iterations, we propose an unsupervised way for learning the EvalSeg-Net to estimate segmentation quality. As mentioned previously, Multi-Net provides masks of higher quality as it cancels errors from individual student nets. Thus, we use the cosine similarity between a given individual segmentation and the ensemble Multi-Net mask, as a cost for “goodness” of segmentation. Having this unsupervised segmentation cost we train the EvalSeg-Net deep neural net to predict it. As previously mentioned, this net acts as an automatic mask evaluation procedure, which in subsequent iterations becomes module C in Fig. 1, replacing the simple mask average value used at Iteration 1. Only masks that pass a certain threshold are used for training the student path. As it turns out in experiments, EvalSeg-Net becomes an effective selection procedure (module C) that improves the teacher pathway regardless of the teacher module B used.

The architecture of EvalSeg-Net is similar to LowRes-Net (Fig. 2), with the difference that the input channel containing image derivatives is replaced by the actual soft-segmentation that requires evaluation and it does not have skip connections. After the last fully connected layer (size 512) we add a last one-neuron layer to predict the segmentation quality score, which is a single real valued number.

Let \(\mathbf {I}\) be an input RGB image, \(\mathbf {S}\) an input soft-mask, \(\hat{\mathbf {Y}}=\prod _{i=1}^{5}{\hat{\mathbf {Y}}_{N_i}}\) be the output of our Multi-Net where \(\hat{\mathbf {Y}}_{N_i}\) denotes the output of network \(N_i\). We treat the segmentation “goodness” evaluation task as a regression problem where we want to predict the cosine similarity between \(\mathbf {S}\) and \(\hat{\mathbf {Y}}\). So, our loss for EvalSeg-Net is defined as follows:


where K represents the number of training examples and \(\hat{o}^{(k)}(\mathbf {w}, \mathbf {I}^{(k)}, \mathbf {S}^{(k)})\) represents the output of EvalSeg-Net for image \(\mathbf {I}^{(k)}\) and soft mask \(\mathbf {S}^{(k)}\).

Given a certain metric for segmentation evaluation (depending on the learning iteration), we keep only the soft masks above a threshold for each dataset [e.g. VID (Russakovsky et al. 2015), YTO (Prest et al. 2012), YouTube Bounding Boxes (Real et al. 2017)]. In the first iteration, this threshold was obtained by sorting the VideoPCA soft-masks based on their score and keeping only the top 10 percentile, while on the second iteration we validate a threshold (\(=0.8\)) on a small dataset and select each mask independently by using this threshold on the single value output of EvalSeg-Net.

Mask selection evaluation In Fig. 3 we present the dependency of segmentation performance w.r.t ground truth object boxes (used only for evaluation) versus the percentile p of masks kept after the automatic selection, for each generation. We notice the strong correlation between the percentage of frames kept and the quality of segmentations. It is also evident that the EvalSeg-Net is vastly superior to the simpler procedure used at iteration 1. EvalSeg-Net is able to correctly evaluate soft segmentations even in more complex cases (see Fig. 4).

Fig. 4
figure 4

Qualitative results of the unsupervised EvalSeg-Net used for measuring segmentation “goodness” and filtering bad masks (Module C, iteration 2). For each input image we present five soft-masks candidates (from first iteration students) along with their “goodness” scores given by EvalSeg-Net, in decreasing order of scores. Note the effectiveness of EvalSeg-Net at ranking soft segmentations

Even though we can expect to improve the quality of the unsupervised masks by drastically pruning them (e.g. keeping a smaller percentage), the fewer we are left with, the less training data we get, increasing the chance to overfit. We make up for the losses in training data by augmenting the set of training masks and by also enlarging the actual unlabeled training set at the second generation. There is a trade-off between level of selectivity and training data size: the more selective we are about what masks we accept for training, the more videos we need to collect and process through the teacher pathway, to obtain the sufficient training data size.

Data augmentation A drawback of the teacher at the first learning iteration (VideoPCA) is that it can only detect the main object if it is close to the center of the image. The assumption that the foreground is close to the center is often true and indeed helps that method, which has no deep learned knowledge, to produce soft masks with a relatively high precision. Not surprisingly, it often fails when the object is not in the center, therefore its recall is relatively low. Our data augmentation procedure addresses this limitation and can be concisely described as follows: randomly crop patches of the input image, covering 80% of the original image and scale up the patch to the expected input size. This produces slightly larger objects at locations that cover the whole image area, not just the center. As experiments show, the student net is able to see objects at different locations in the image, unlike its raw teacher (VideoPCA at iteration 1), which is strongly biased towards the image center.

At the second generation, the teacher branch is superior at detecting objects at various locations and scales in the image. Therefore, while artificial data augmentation remains useful (as it is usually the case in deep learning), its importance diminishes at the second iteration of learning (Algorithm 1). Adding more unlabeled data helps at both generations up to a point. If more difficult training cases are added, they improve learning only at the second generation, as discussed in the experimental Section (Table 5).

4.4 Implementation Pipeline

Now that we have presented in technical detail all major components of our system, we concisely present the actual steps taken in our experiments, in sequential order, and show how they relate to our general Algorithm 1 for unsupervised learning to segment foreground objects.

  1. 1.

    Run VideoPCA on input images from VID and YouTube Objects datasets (Algorithm 1, Iteration 1, Step 1)

  2. 2.

    Select VideoPCA masks using first generation selection procedure (Algorithm 1, Iteration 1, Step 2)

  3. 3.

    Train first generation student ConvNets on the selected masks, namely LowRes-Net, FConv-Net, BasicU-Net, DilateU-Net and DenseU-Net (Algorithm 1, Iteration 1, Step 3).

  4. 4.

    Create first generation student ensemble Multi-Net by multiplying the outputs of all students and train EvalSeg-Net to predict the similarity between a particular mask and the mask produced by Multi-Net. (Algorithm 1, Iteration 1, Step 4).

  5. 5.

    Add new data from YouTube Bounding Boxes. (Algorithm 1, Iteration 1, Step 5)

  6. 6.

    Return to Step 1, the teacher pathway: predict multiple soft-masks per input image on the enlarged unlabeled video set, using the student nets from Iteration 1 (Module B, Iteration 2), which will be then selected with EvalSeg-Net at Module C. (Algorithm 1, Iteration 2, Step 1)

  7. 7.

    Select only sufficiently good masks evaluated with EvalSeg-Net (Algorithm 1, Iteration 2, Step 2)

  8. 8.

    Train the second generation students on the newly selected masks. We use the same architectures as in Iteration 1 (Algorithm 1, Iteration 2, Step 3)

The method presented in the introduction sections (Algorithm 1) is a general algorithm for unsupervised learning from video to segment objects in single images. It presents a sequence of high level steps followed by different modules for an unsupervised learning system. The modules are complementary to each other and function in tandem, each focusing on a specific aspect of the unsupervised learning process. Thus, we have a module for generating data, where soft-masks are produced. There is a module that selects good quality masks. Then, we have a module for training the next generation students. While, our concept is first presented in high level terms, we also present a specific implementation that represents the first two iterations of the algorithm. While our implementation is costly during training, in terms of storage and computation time, at test time it is very fast.

Computation and storage costs During training, the computation time for passing through the teacher pathway during the first iteration of Algorithm 1 is about 2–3 days: it requires processing data from VID and YTO datasets, including running the VideoPCA module. Afterwards, training the first iteration students, with access to 6 GPUs, takes about 5 days: 6 GPUs are needed for training the 5 different student architectures, since training FConv-Net requires two GPUs in parallel. Next, training the EvalSeg-Net requires 4 additional days on one GPU. At the second iteration, processing the data through the teacher pathway takes about 1 week on 6 GPUs in parallel—it is more costly due to the larger training set from which only a small percent (about 10%) is kept after selection with EvalSeg-Net in order to have in the end 1M data for training. Finally, training the second generation students takes 2 additional weeks. In conclusion, the total computation time required for training, with full access to 6 GPUs is about 5 weeks, when everything is optimized. The total storage cost is about 4TB. At test time the student nets are fast, taking aprox 0.02 s per image, while the ensemble nets take around 0.15 s per image.

5 Experimental Analysis

In the first set of experiments we evaluate the impact of the different components of our system. We experimentally verify that at each iteration the students perform better than their teachers. Then, we test the ability of the system to improve from one generation to the next. We also test the effects of data selection and increasing training data size. Then, we compare the performances of each individual network and their combined ensembles.

In Sect. 5.2, we compare our algorithm to state of the art methods on object discovery in videos and images. We perform tests on three datasets: YouTube Objects (Prest et al. 2012), Object Discovery in Internet Images (Rubinstein et al. 2013) and Pascal-S (Li et al. 2014). In Sect.  5.3, we verify that our unsupervised deep features are also useful on a well-known transfer learning task for object detection on the Pascal VOC2012 dataset (Everingham et al. 2010).

Datasets Unsupervised learning requires large quantities of unlabeled video data. We have chosen for training data, videos from three large datasets: ImageNet VID dataset (Russakovsky et al. 2015), YouTube Objects (YTO) (Prest et al. 2012) and YouTube Bounding Boxes (YouTubeBB) (Real et al. 2017). VID is one of the largest video datasets publicly available, being fully annotated with ground truth bounding boxes. The dataset consists of about 4000 videos, having a total of about 1.2M frames. The videos contain objects that belong to 30 different classes. Each frame could have zero, one or multiple objects annotated. The benchmark challenge associated with this dataset focuses on the supervised object detection and recognition problem, which is different from the one that we tackle here. Our system is not trained to identify different object categories, so we do not report results compared to the state of the art on object class recognition and detection, on this dataset.

Table 1 Results of our networks and ensembles on YouTube Objects v1 (Prest et al. 2012) dataset (CorLoc metric) at both iterations (generations)

YouTube Objects (YTO) is a challenging video dataset with objects undergoing strong changes in appearance, scale and shape, going in and out of occlusion against a varying, often cluttered background. YTO is at its second version now and consists of about 2500 videos, having a total of about 700K frames. It is specifically created for unsupervised object discovery, so we perform comparisons to state of the art on this dataset.

YouTube Bounding Boxes (YTBB or YouTubeBB) is a large scale video dataset, having approximately 240k videos with single-object bounding box annotations. We use a subset of the large number of videos to augment our existing video database. In this dataset there are 23 types of object categories often undergoing strong changes in appearance, scale and shape, making it the most difficult dataset used in our foreground object segmentation setup.

For unsupervised training of our system we used approximately 200k frames (after selection) from videos chosen from each dataset (120k from VID and 80k from YTO), at learning iteration 1—those frames which survived after the data selection module. At the second learning iteration, besides improving the classifier, it is important to have access to larger quantities of new unlabeled data. Therefore, for training the second generation of classifiers we enlarge our training dataset to 1 million soft-masks, as follows: 600k frames from VID + YTO and 400k from the YouTubeBB dataset—those frames which survived after filtering with the EvalSeg-Net data selection module. For experiments presenting results without selection, the frames were randomly chosen from each set, VID, YTO or YouTubeBB, until the total of 1M was reached. We did not add more frames due to heavy computation and storage limitations.

Evaluation metrics We use different kinds of metrics in our experiments, which depend on the specific task that requires either bounding box fitting or fine segmentation:

  • CorLoc—for evaluating the detection of bounding boxes the most commonly used metric is CorLoc. It is defined as the percentage of images correctly localized according to the PASCAL criterion:\(\frac{B_p \cap B_{GT}}{B_p \cup B_{GT}} \ge 0.5\), where \(B_P\) is the predicted bounding box and \(B_{GT}\) is the ground truth bounding box.

  • F-\(\beta = \frac{(1-\beta ^{2}) precision \times recall}{\beta ^{2} \times precision + recall}\) for evaluating the segmentation score on Pascal-S dataset. We use the official evaluation code when reporting results. As in all previous works, we set \(\beta ^2=0.3\).

  • P-J metric P refers to the precision per pixel, while J is the Jaccard similarity (the intersection over union between the output mask the and ground truth segmentations). We use this metric only on Object Discovery in Internet Images. For computing the reported results we use the official evaluation code.

  • MAE—Mean Absolute Error is defined as the average pixel-wise difference between the predicted mask and the ground truth. Different from the other metrics, for this metric a lower value is better.

  • mean IoU score is defined as \(\frac{|G \cap Y|}{|G \cup Y|}\) where G represents the ground truth and Y the predicted mask.

  • mAP represents the mean average precision. It is used when reporting results for the transfer learning experiments on the Pascal VOC 2012 dataset.

Table 2 Results of our networks and ensemble on Object Discovery in Internet Images (Rubinstein et al. 2013) dataset (CorLoc metric) at both iterations
Table 3 Results of our networks and ensemble on Pascal-S (Li et al. 2014) dataset (F-\(\beta \) metric) at both iterations
Fig. 5
figure 5

Visual comparison between models at each iteration (generation). The Multi-Net, shown for comparison, represents the pixel-wise multiplication between the five models. Note the superior masks at the second generation students, with better shapes, fewer holes and sharper edges. Also note the relatively poorer recall of the ensemble Multi-Net, which produces smaller, eroded masks

5.1 Ablation Study

Student versus Teacher In Fig. 8 we present qualitative results on VID dataset as compared to VideoPCA and between iterations. We can see that the masks produced by VideoPCA are of lower quality, often having holes, non-smooth boundaries and strange shapes. In contrast, the students (at both iterations) learn more general shape and appearance characteristics of objects in images, reminding of the grouping principles governing the basis of visual perception as studied by the Gestalt psychologists (Rock and Palmer 1990) and the more recent work on the concept of “objectness” (Alexe et al. 2010). The object masks produced by the students are simpler, with very few holes, have nicer and smoother shapes and capture well the foreground-background contrast and organization. Another interesting observation is that the students are sometimes able to detect multiple objects, a feature that is less commonly achieved by the teacher.

A key fact in our experiments with learning over two generations is that every single module becomes better from one iteration to the next: all individual models and the selector (Module C), all improve and each contributes, in a complementary way, along with the addition of extra unlabelled data, to the overall improvement at the next iteration. The result suggests that we can repeat the process over several iterations and continue to improve. It is also encouraging that the individual nets, which see a single image, are able to generalize and detect objects better than what the initial VideoPCA teacher discovers in videos.

As seen in Tables 1, 2, 3 and Fig. 5 at the second generation we obtain a clear gain over the first, on all experiments and datasets. In Fig. 3, left plot shows the significant improvement of the unsupervised selection network at iteration 2 (EvalSeg-Net) vs the simple selection procedure (based on the mean value of white mask pixels) used at iteration 1.

Our proposed algorithm starts from a completely unsupervised object discoverer in video (VideoPCA) and is able to train neural nets for foreground object segmentation, while improving their accuracy over two generations. It uses the students from iteration 1 as teachers at iteration 2. At the second iteration, it also uses more unlabeled training data and it is better at automatically filtering out poor quality segmentations.

Training data size versus Learning iteration Next we consider the influence of increasing the data size from one iteration to the next vs. learning from a more powerful teacher pathway. In order to better understand the importance of each, we have tested our models at each iteration with two training data sets: a smaller set consisting of 200k images (only from VID + YTO datasets) and a larger dataset formed by increasing the dataset size to 1M frames by adding frames from VID, YTO and also from YouTubeBB. Each generation of student nets is trained using the teacher and selection method corresponding to that particular iteration. We present the results in Table 4 (mean CorLoc and standard deviation over five students).

The results are interesting: at Iteration 2, as expected we obtain better accuracy when adding more data (by \(1.1\%\)). However, at Iteration 1 adding more data helps initially (as seen in Table 5 when using the LowRes-Net model), but as data becomes more difficult the performance may drop. We have a similar change in performance on tests on image segmentation on the Object Discovery dataset, using the same three training sets on Iteration 1 as in Table 5, where the LowRes-Net model initially improves performance (meanP from 87.7 to \(88.4\%\) and meanJ from 61.2 to \(62.3\%\)), then it starts losing accuracy as data increases from 200k to 1M frames (meanP goes down to \(86.8\%\) and meanJ goes down to \(60.7\%\)).

This phenomenon, related to observations in Tokmakov et al. (2016) when working with more difficult images, is probably due to the weaker teacher path at iteration 1. Images in YouTubeBB are significantly more difficult than the ones from the initial 200k frames set. Neither VideoPCA, nor the very simple selection method used along the teacher pathway, at Iteration 1, are powerful enough to cope with these images and produce good training masks. Therefore, even though we have more masks to train on, their quality is poorer and the overall result degrades.

Table 4 Influence of adding more unlabeled data, tested on YTO dataset
Table 5 Influence of adding more unlabeled training data at iteration 1 and iteration 2 for the LowRes-Net student net, evaluated on YTO with the CorLoc metric

On the other hand, the second iteration, with a stronger teacher, which is able to produce and select good masks on the more difficult frames from YouTubeBB set, is able to take advantage of the extra large amounts of unlabeled data. It is important to increase the data from one generation to the next in order to avoid simply imitating the teacher of the previous generation. The idea of increasing the data size and complexity in stages, from fewer easy cases to many and more complex ones, is also related to insights from curriculum learning (Bengio et al. 2009).

Fig. 6
figure 6

Impact of data selection for both iterations. Data selection (module C) strongly affects the results at each iteration. The results from iteration 2 with no selection are only slightly better than the ones from iteration 1 with selection. Note that the students trained with selection at iteration 1 become the teacher at the second iteration (module B), without selection. The slight improvement is due to the increase in the training data size. The results represent the average over 10 classes on YouTube Objects using CorLoc percentage metric

Fig. 7
figure 7

Comparison across two generations (blue line—first iteration; red line—second iteration) when the individual model DilateU-Net trained at Iteration 1, becomes the teacher for the second generation. DilateU-Net is helped along the teacher pathway at the second iteration, by the EvalSeg-Net selection module, which explains the improvement from one iteration to the next. Note that in this case DilateU-Net improves while being trained, from scratch, on its own good masks allowed to pass by EvalSeg-Net. Also note that individual students (for which we report average values) outperform the teacher on both iterations. The plots are computed over results on the YouTube Objects dataset using the CorLoc metric (percentage) (Color figure online)

Data selection versus Teacher Data selection is important (see Figs. 3, 6 and 7 and Table 6). The more selective we are, when accepting or rejecting soft-masks used for training, the better the end result. Also note that being more selective means decreasing the training set. There is a trade-off between selectivity and training data size.

Table 6 Different results, averaged over all student nets after being trained at Iteration 2 with different teachers, with or without data selection by EvalSeg-Net

We study the impact of data selection (Module C) along the teacher pathway w.r.t to the masks produced by the teacher (Module B), which could be a group of students or a single student net learned at the previous iteration. We want to better understand the roles of the two modules in learning and how they can work best in combination. We did the following experiments: (1) we trained all our student models at iteration 2 with soft-segmentations extracted from Multi-Net created from students trained at iteration one (active module B), but no data selection applied (no module C); (2) then we performed the same experiment as above, but with data selection using EvalSeg-Net (active module C), such that only the masks that passed through selection were used for training; (3) we trained all our models with soft-segmentation masks obtained from a single student, DilateU-Net (active module B) and selected using EvalSeg-Net (active module C); and (4) we used as teacher all student models acting independently, with EvalSeg-Net active, such that for each input image we could have several masks. As stated before this setup is our choice in the final system.

Table 7 Results on Youtube Objects dataset, versions v1 (Prest et al. 2012)–first eight entries- and v2.2 (Kalogeiton et al. 2016)–last five entries-

For these experiments we used a small set of 200k images for training. We report average CorLoc on YTO dataset for all 5 students trained at Iteration 2 with different choices of teacher pathways (Table 6). The results indicate the power of data selection, which could overcome the advantage brought by an ensemble. The ensemble is generally stronger than each individual, as it outputs the mask that represents the multiplication of each student soft segmentation. While its output is more robust to noises, it does not guarantee agreement between student models nor quality. In fact, the final mask obtained by multiplication could be destroyed in the process. For example, in the case when a good mask existed among the students, that would be lost through multiplication. This is a limitation of an ensemble which could be overcome by our approach in which all students are allowed to speak, independently and separately. The EvalSeg-Net, which is a mask selection network trained to predict the agreement among the student models, brings in novel, complementary information and whose output is strongly correlated with the goodness of segmentation (Figs. 3, 6). Such a network could be used to select only good masks. Thus, any teacher, being it a single model or an ensemble, in combination with the selection module is more powerful than without.

The performance of each trained student is boosted through the selection process by \(0.8\%\) on average, when the Multi-Net is used as teacher. The relevance of selection could also be seen in the fact that even a simpler teacher with a single model and no ensemble (third row) can be more effective than the ensemble by itself (by \(0.2\%\)). The role of selection is again evident when we compare the average results of models at Iteration 1 and those at Iteration 2 when trained by a single model from Iteration 1 (DilateU-Net) with selection, with an increase by \(3.7\%\) (compare results in Tables 6 and 7).

Fig. 8
figure 8

Qualitative results on the VID dataset (Russakovsky et al. 2015) on input image (a) as compared to the iteration 1 teacher–VideoPCA (b). For each iteration, we show results of the best individual models (c, d), in terms of CorLoc metric. Note the superior quality of our models compared to the VideoPCA. We also present the ground truth bounding boxes (e). For more qualitative results please visit our project page https://sites.google.com/view/unsupervisedlearningfromvideo

Maybe the most conclusive result in favor of selection is when the student model (DilateU-Net) itself improves its own performance when trained (from scratch) on its own outputs from Iteration 1, used as teacher, with selection (third row in Table 6), by no less than \(3.89\%\), increasing the CorLoc on YTO from \(61.8\%\) at Iteration 1, to \(65.7\%\) at Iteration 2. This improvement can also be seen in Fig. 7 where we presented the results having DilateU-Net acting as a teacher in the second iteration.

The fourth row presents the case when we do not use the Multi-Net ensemble (as teacher), and let all segmentations from all models pass through selection, as explained in Sect. 4.3. As we see, the performance of this approach is almost identical to that of using the ensemble with selection (compare rows 2 and 4 in in Table 6). As previously mentioned, this approach is more effective: if for Multi-Net we need to wait for all 5 models to produce an output until we consider a mask for selection, in this “All models” case we can pass masks through the selection process as they are produced, in parallel, without having to synchronize all five. For this reason, as discussed previously this is our first choice, generally being referred to as “our proposed approach” in the paper and tested in the next Sections.

Analysis of different network architectures As seen in Tables 1, 2 and 3 different network architecture yield different results, while the ensemble always outperforms individual models. Our experiments show that different architectures are better at different tasks. LowRes-Net, for example, performs well on the task of box fitting since that does not require a fine sharp object mask. On the other hand, when evaluating the exact segmentation, nets with higher resolution output, such as the ones based on the U-Net design which are more specialized for this task, perform better. Among those, qualitatively, we observed that DenseU-Net produces masks with fewer “holes” when compared to DilateU-Net, which turns out to be the top model for segmentation. The quantitative differences between architectures are shown in Tables 1, 2 and 3, while the qualitative differences can be seen in Figs. 5 and 8.

5.2 Experiments on Foreground Segmentation

Object discovery in video We first performed comparisons with methods specifically designed for object discovery in video. For that, we choose the YouTube Objects dataset and compare it to the best methods on this dataset in the literature (Table 7). Evaluations are conducted on both versions of YouTube Objects dataset, YTOv1 (Prest et al. 2012) and YTOv2.2 (Kalogeiton et al. 2016). On YTOv1 we follow the same experimental setup as (Jun Koh et al. 2016; Prest et al. 2012), by running experiments on all annotated frames from the training split. We have not included in Table 7 the results reported by Stretcu and Leordeanu (2015) because they use a different setup, testing on all videos from YTOv1. It is important to stress out, again, the fact that while the methods presented here for comparison have access to whole video shots, ours only needs a single image at test time. Despite this limitation, our method outperforms the others on 7 out of 10 classes and has the best overall average performance. Note that even our baseline LowRes-Net at the first iteration achieves top performance. The feed-forward CNN processes each image in 0.02 s, being at least one to two orders of magnitude faster than all other methods (see Table 7). We also mention that in all our comparisons, while our system is faster at test time, it takes much longer during its training phase and requires large quantities of unsupervised training data.

Object discovery in images We compare our system against other methods that perform object discovery in images. We use two different datasets for this comparison: Object Discovery in Internet Images and Pascal-S datasets. We report results using metrics that are commonly used for these tasks, as presented at the beginning of the experimental section.

Object Discovery in Internet Images is a representative benchmark for foreground object detection in single images. This set contains internet images and it is annotated with high detail segmentation masks. In order to enable comparison with previous methods, we use the 100 images subsets provided for each of the three categories: airplane, car and horse. The methods evaluated on this dataset in the literature, aim to either discover the bounding box of the main object in a given image or its fine segmentation mask. We evaluate our system on both. Note that different from other works, we do not need a collection of images during test time, since each image can be processed independently by our system. Therefore, unlike other methods, our performance is not affected by the structure of the image collection or the number of classes of interest being present in the collection.

Table 8 Results on the object discovery in internet images (Rubinstein et al. 2013) dataset (CorLoc metric)
Table 9 Results on the object discovery in internet images (Rubinstein et al. 2013) dataset using (P, J metric) on segmentation evaluation

In Table 8 we present the performance of our method as compared to other unsupervised object discovery methods in terms of CorLoc on the Object Discovery dataset. We compare our predicted box against the tight box fitted around the ground-truth segmentation as done in Cho et al. (2015), Tang et al. (2014). Our system can be considered in the mixed class category: it does not depend on the structure of the image collection. It treats each image independently. The performance of the other algorithms degrades as the number of main categories increases in the collection (some are not even tested by their authors on the mixed-class case), which is not the case with our approach.

We obtain state of the art results on all classes, improving by \(6\%\) over the method of Cho et al. (2015). When the method in Cho et al. (2015) is allowed to see a collection of images that are limited to a single majority class, its performance improves and it is equal with ours on one class. However, our method has no other information necessary besides the input image, at test time.

We also tested our method on the task of fine foreground object segmentation and compared to the best performers in the literature on the Object Discovery dataset in Table 9. For refining our soft masks we apply the GrabCut method, as it is available in OpenCV. We evaluate based on the same P, J evaluation metric as described by Rubinstein et al. (2013)—the higher P and J, the better. In Figs. 9 and 10 we present some qualitative results for each class. As mentioned previously, these segmentation experiments on Object Discovery in Internet Images are the only ones on which we apply GrabCut as a post-processing step, as also used by all other methods presented in Table 9.

Another important dataset used for the evaluation of a related task, that of salient object detection, is Pascal-S dataset, consisting of 850 images annotated with segmentation mask. As seen from Table 10 we achieve top results on all three metrics against methods that do not use any supervised pre-trained features. Being a foreground object detection method, our approach is usually biased towards the main object in the image—even though it can also detect multiple ones. Images in Pascal-S usually have more objects, so we consider our results very encouraging being close to approaches that use features pre-trained in a supervised manner. Also note that we did not use GrabCut for these experiments.

On single image experiments, our system was trained, as discussed before on other, video datasets (VID, YTO and YTBB). It has not previously seen any of the images in Pascal-S or Object Discovery datasets during training.

Fig. 9
figure 9

Qualitative results on the Object Discovery dataset on input image (a) as compared to (b) Rubinstein et al. (2013). For both iterations, we present the results of the top model (c, d), without using GrabCut. We also present the results when GrabCut is used with the top model (e) and the ground truth segmentation (f). Note that our models are able to segment objects from classes that were not present in the training set (examples on the right side). Also, note that the initial VideoPCA teacher cannot be applied on single images

Fig. 10
figure 10

Qualitative results on the Object Discovery in Internet Images (Rubinstein et al. 2013) dataset. For each example we show the input RGB image and immediately below our segmentation result, with GrabCut post processing for obtaining a hard segmentation. Note that our method produces good quality segmentation results, even in images with cluttered background (Color figure online)

Table 10 Results on the PASCAL-S dataset compared against other unsupervised methods
Table 11 Comparison with state of the art on VOC 2012 Everingham et al. (2010) using Fast R-CNN initialized with different methods
Table 12 Comparison with Pathak et al. (2017) on unsupervised object discovery

5.3 Experiments on Transfer Learning

While the main focus of the paper is unsupervised learning of foreground object segmentation, we also want to test the usefulness of our features in a transfer learning setup, namely for the task of multiple object detection. For this purpose, we follow the well known experimental setup used in the recent transfer learning literature, in which an AlexNet-like (Krizhevsky et al. 2012) network, initialized in an unsupervised way, is fine-tuned on supervised object detection within the Fast R-CNN framework (Girshick 2015).Footnote 2 We closely follow the work of Pathak et al. (2017), with code, documentation and training data available online, which, as mentioned in the related work section, also starts by learning from videos to segment objects in single images in an unsupervised fashion. In these experiments, we adapted in the same way the last part of AlexNet in order to produce a soft segmentation of the image (instead of an output class). We used the same base architecture as the methods we compared against, to make sure that the results come from the learned features, not from the architecture we used.

Initial unsupervised training for object segmentation We used the adapted AlexNet-based model (which we term AlexNet-Seg) described in this section as a student in our unsupervised learning framework at iteration 2. Thus, the AlexNet-Seg will be trained by the unsupervised teacher pathway at our Iteration 2—in this case, the teacher will be a single network, namely DilateU-Net at module B, combined with the EvalSeg-Net mask selector, at module C. In order to see how the actual training data influences the final transfer learning outcome, we experimented with both our data and the data used by Pathak et al. (2017) which is obtained from YFCC100m (Thomee et al. 2015) dataset, having 1.6M frames. The results are presented in Table 13.

As it is, our unsupervised learning system prefers to segment main, foreground objects in images and it is less versatile on segmenting complex images containing many objects. Since the images in Pascal VOC2012 contain complex scenes with multiple objects and the final transfer learning task is of multiple object detection, we also tested the case when we adapted our system to better cope with multiple objects. For that, we divide each training image into 5 large patches (a grid with one image at each corner and one in the middle, each crop being about 60% of the original size for both dimensions) which we pass through the teacher pathway at Iteration 2. The results are combined into a single image, by superimposing the soft masks and taking the maximum over all, at each location in the original image. Thus, we obtain soft segmentations that better capture multiple objects in the input image. Note that the original image passed through the teacher pathway without the 5-point grid division is referred to as the Single Object (SO) teacher in Tables 12 and 13 and Fig. 11, while the 5-grid version just described is referred to as the Multiple Object (MO) teacher in the same Tables and Figures. We train the AlexNet-Seg student on these multiple object soft-segs. We thus transfer knowledge from our unsupervised student models to AlexNet-Seg and prepare it for the next task, of supervised object detection.

Transferring to object detection As the other methods we compare to, we conduct transfer learning experiments on the Pascal VOC 2012 (Everingham et al. 2010) dataset. We train on the train split of VOC 2012 and we report our results on the validation split. We also use multi scale training and testing and remove difficult objects during training. We report the comparisons results in Table 11. We see that the unsupervised knowledge learned by our approach is indeed useful for transfer learning, as our results are in the top 3 among current published methods. This is interesting, as our unsupervised learning algorithm is mainly designed for foreground object segmentation, not classification.

Table 13 Comparison between different types of training images and training data for the AlexNet-Seg student we used
Fig. 11
figure 11

Representative visual results on Pascal VOC 2012. We show the output of the unsupervised teacher pathway used in transfer learning for both the original case (SO—DilateU-Net) and for the multiple objects scenario (MO—combined DilateU-Net outputs combined on a 5-grid), the trained students for both cases (AlexNet-Seg(SO) and AlexNet-Seg(MO)), as well as the output of Pathak et al. (2017)

Foreground segmentation versus Multiple object detection The transfer learning task, which we test our approach on, is both about localization (detection) and classification. In this context, as already discussed in the introduction section, the work of Pathak et al. (2017) is most related to ours in the sense that they also learn from video to segment objects in single images in an unsupervised manner. They do so by using a teacher that produces soft masks from optical flow in video. Beyond the theoretical connection between the two works, we wanted to better understand how the two approaches relate on actual foreground segmentation experiments. Their method generally produce masks that cover larger areas in the image than ours and are better suited for transfer learning experiments, as results show.

On the other hand, when tested on foreground segmentation tasks, our approach, in turn, seems to yield better results (see Table 12). The results we obtained are in agreement with observations made by the authors when testing their method on detecting main objects in single frames against human annotations (e.g. Precision: 29.9, Recall: 59.3, Mean IoU: 24.8). Their high recall and lower precision agree with our observations that their segmentation covers larger parts of the image, while ours provides sharper and smaller masks. This observation lead us towards extracting large crops on a 5-point grid and combining the results (termed AlexNet-Seg MO). As seen in experiments, taking multiple outputs over the grid eventually brought a relatively small improvement of \(0.6\%\) (see Table 13). In Fig. 11 we present qualitative results of our DilatedU-Net teacher for the AlexNet-Seg student trained with a single foreground object detected (termed SO Teacher) and our 5-grid multiple objects (termed MO Teacher) segmentation result. We also present the results of our AlexNet-Seg student on both cases as well as the outputs of Pathak et al. (2017) for comparison. While our method has better results on the task of segmentation, their method is more suited for transfer learning experiments. We suspect that their larger masks (Fig. 11), with lower precision but relatively high recall and high confidence values, could be more flexible and less conservative for the final transfer learning stage where multiple objects need to be detected over the whole image. At the same time ours is specialized in obtaining generally sharper and better quality foreground segmentation masks. Overall, the transfer learning experiments show that our approach is suited for such task, as we obtain a performance that is in the top three among the state of the art methods using the same experimental setup.

5.4 Concluding Remarks on Experiments

One of the interesting conclusions in our experimental analysis is that the system is able to improve its performance from iteration 1 to iteration 2. There are several factors that are involved in this outcome, which we believe are related through the following relationship: (1) Multiple students of diverse structures ensure diversity and somewhat independent mistakes; (2) In turn, point (1) makes possible the unsupervised training of a mask selection module that learns to predict agreements; (3) thus, the selection module at 2) becomes a good mask evaluation network; (4) once that evaluation network (from 3) is available, we can then add larger and potentially more complex data to select a larger set with good object masks of more interesting cases at the next iteration; (5) finally, (4) ensures the improvement at the next iteration and now we could return to point (1).

6 Short Discussion on Unsupervised Learning

The ultimate goal of unsupervised learning might not be about matching the performance of the supervised case but rather about reaching beyond the capabilities of the classical supervised scenario. An unsupervised system should be able to learn and recognize different object classes, such as animals, plants and man-made objects, as they evolve and change over time, from the past and into the unknown future. It should also be able to learn about new classes that might be formed, in relation to others, maybe known ones. We see this case as fundamentally different from the supervised one in which the classifier is forced to learn from a distribution of samples that is fixed and limited to a specific period of time—that when the human labeling was performed. Therefore, in the supervised learning paradigm a car from the future, should not be classified as car, because it is not a car, according to the supervised distribution of cars given at present training time, when human annotations are collected. On the other hand, a system that learns by itself should be able to track how cars have been changing in time and recognize such objects as “cars”—with no step by step human intervention.

Current unsupervised learning methods might still not be able to learn profound semantic information (Bau et al. 2017), but the ability to learn to segment foreground objects in an unsupervised fashion constitutes evidence that we are moving in the right direction. In order to understand and learn about semantic classes, the system would need to learn by itself about how such objects interact with each other and what role they play within the larger spatiotemporal story. While our unsupervised methods are still far from reaching this level of interpretation, the ability to learn about and detect objects that constitute the foreground within their local spatial context could constitute an important building block. It is an element that could be used to further learn about more complex interactions and behaviour in both space and time.

From the larger spatiotemporal perspective, unsupervised learning is about continuous learning and adaptation to huge quantities of data that are perpetually changing. Human annotation is extremely limited in an ocean of data and not able to provide the so called “ground truth” information continuously. Therefore, unsupervised learning, and especially its weaker version—learning from large quantities of data with minimal human intervention—will soon become a core part, larger than the supervised one, in the future of artificial intelligence.

7 Conclusions and Future Work

In this article, we present a novel and effective approach to learning from large collections of images and videos, in an unsupervised fashion, to segment foreground objects in single images. We present a relatively general algorithm for this task, which offers the possibility of learning several generations of students and teachers. We demonstrate in practice that the system improves its performance over the course of two generations. We also test the impact of the different system components on performance and show state of the art results on three different datasets, while also showing top performance on challenging transfer learning experiments. Our system is one of the first in the literature that learns to detect and segment foreground objects in images in an unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time.

The convolutional networks trained along the student pathway are able to learn general “objectness” characteristics, which include good form, closure, smooth contours, as well as contrast with the background. What the simpler initial VideoPCA teacher discovers over time, the deep, complex student is able to learn across several layers of image features at different levels of abstraction. Our results on transfer learning experiments are also encouraging and show additional cases in which such a system could be useful. In future work we plan to further grow our computational and storage capabilities to demonstrate the power of our unsupervised learning algorithm along many generations of student and teacher networks. We believe that our approach, tested here in extensive experiments, could bring a valuable contribution to computer vision research.