Unsupervised learning of foreground object detection

Unsupervised learning poses one of the most difficult challenges in computer vision today. The task has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled videos can be collected at relatively low cost. In this paper, we address the unsupervised learning problem in the context of detecting the main foreground objects in single images. We train a student deep network to predict the output of a teacher pathway that performs unsupervised object discovery in videos or large image collections. Our approach is different from published methods on unsupervised object discovery. We move the unsupervised learning phase during training time, then at test time we apply the standard feed-forward processing along the student pathway. This strategy has the benefit of allowing increased generalization possibilities during training, while remaining fast at testing. Our unsupervised learning algorithm can run over several generations of student-teacher training. Thus, a group of student networks trained in the first generation collectively create the teacher at the next generation. In experiments our method achieves top results on three current datasets for object discovery in video, unsupervised image segmentation and saliency detection. At test time the proposed system is fast, being one to two orders of magnitude faster than published unsupervised methods.


Introduction
Unsupervised learning is one of the most difficult and interesting problems in computer vision and machine learning today. Many researchers believe that learning from large collections of unlabeled videos could help decode hard questions regarding the nature of intelligence and learning. Moreover, as unlabeled videos are easy to collect at relatively low cost, unsupervised learning could be of real practical value in many computer vision and robotics applications. In this article we propose a novel approach to unsupervised learning that successfully tackles many of the challenges associated with this task. We present a system that is composed of two main pathways, one that performs unsupervised object discovery in videos or large image collections along the teacher branch, and the other, the student branch, which learns from the teacher to detect foreground objects in single images. Our approach is general in the sense that the student or teacher pathways do not depend on a specific neural network architecture or implementation. Also, our approach allows the unsupervised learning process to continue over several generations of students and teachers. In Algorithm 1 we present the high level description of our method. We will use throughout the paper the terms "generation" and "iteration" of Algorithm 1 interchangeably. A preliminary version of this work, without presenting the possibility of learning over several generations and with fewer experimental results appeared at ICCV 2017 (Croitoru et al (2017)).
In Figure 1 we present a graphic overview of our full system. In the unsupervised training stage the student network (module A) learns, frame by frame, from an unsupervised teacher pathway (modules B and C) to produce similar object masks in single images. The student branch tries to imitate for each frame the output of the teacher, while having as input only a single image -the current frame. The teacher on the other hand has access to an entire video sequence. The method presented in Algorithm 1 follows the main steps of the system as it learns from one iteration (generation) to the next. The steps are discussed in more detail in Section 3.
During the first iteration of Algorithm 1, the unsupervised teacher pathway has access to information over timea video. In contrast, the student is deeper in structure, but it has access only to a single image -the current video frame. Thus, the information discovered by the teacher in time is captured by the student in added depth, over neural layers of abstraction. Several student nets with different architectures are trained at the first iteration. In order to use as supervisory signal only good quality masks, an unsupervised mask selection procedure is applied, as explained in Section 4. Once several student nets are trained, their output is combined to form the teacher at the next iteration. Then, we run, at the next generation, the newly formed teacher on a larger set of unlabeled videos, to produce supervisory signal for the next generation students. Note that while at the first iteration the teacher pathway is required to receive video sequences as input, from the second generation on, it could receive as input large image collections, as well. Due to the very high computational and storage costs, required during training time, we limit our experiments to learning over two generations, but our algorithm is general and could run over many iterations. We show in extensive experiments that even two generations are sufficient to significantly outperform the current state of the art on object discovery in video and images. We also demonstrate a solid improvement from one generation to the next. Now we enumerate the main contributions of our approach: 1) We introduce a novel approach to unsupervised learning from videos to detect foreground objects in images. The overview of our system and algorithm are presented in Figure 1 and Algorithm 1. The system has two main pathwaysone that acts as a teacher and discovers objects in videos or large collections of images and the other that acts as student and learns from the teacher to detect the foreground objects in single input images. We provide a general algorithm for unsupervised learning over several generations of students and teachers. We experiment with different types of student nets and show how they collectively work together to form the teacher at the next generation. This is done in conjunction with a novel unsupervised soft-mask selection scheme. We demonstrate experimentally that within a generation the students are more powerful than their teachers, while both pathways improve significantly from one generation to the next. Fig. 1 The dual student-teacher system proposed for unsupervised learning to detect foreground objects in images, functioning as presented in Algorithm 1. It has two pathways: along the teacher branch, an object discoverer in videos or large image collections (module B) detects foreground objects. The resulting soft masks are then filtered based on an unsupervised data selection procedure (module C). The resulting final set of pairs -input image (a video frame) and soft mask for that particular frame (which acts as an unsupervised label) -are used to train the student pathway (module A). The whole process can be repeated over several generations. At each generation several student CNNs are trained, then they collectively contribute to form a more powerful teacher, at the next iteration of the overall algorithm. 2) At the higher level, our proposed algorithm is sufficiently general to accommodate different implementations and neural network architectures. In this paper, we also provide a specific implementation which we describe in detail. We demonstrate its performance on three recent datasets, namely YouTube Objects (Prest et al (2012)), Object Discovery in Internet Images (Rubinstein et al (2013)) and Pascal-S ), on which we obtain state of the art results. To our best knowledge, it is the first system that learns to detect and segment foreground objects in images in unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time.

Scientific context
The literature on unsupervised learning follows two main directions. 1) One is to learn powerful features in an unsupervised way and then use them for transfer learning, within a supervised scheme and in combination with different classifiers, such as SVMs or CNNs (Radenović et al (2016); Misra et al (2016); Li et al (2016)). 2) The second direction is to discover, at test time, common patterns in unlabeled data, using clustering, feature matching or data mining formulations (Jain et al (1999); Cho et al (2015); Sivic et al (2005)).
Belonging to the first category and closely related to our work, the approach in Pathak et al (2017) proposes a sys-tem in which a deep neural network learns to produce soft object masks from an unsupervised module that uses optical flow cues in video. The deep features learned in this manner are then applied to several transfer learning tasks. Different from their work, we provide a more general approach that could learn in an unsupervised manner over several generations. From an experimental point of view, while Pathak et al (2017) tests their work on a supervised transfer learning task, we evaluate ours on specific unsupervised foreground object detection and segmentation tasks and demonstrate state of the art performance, often by a large margin.
Recently, researchers have started to use the natural, spatial and temporal structure in images and videos as supervisory signals in unsupervised learning approaches that are considered to follow a self-supervised learning paradigm (Raina et al (2007); Lee et al (2017); Wang and Gupta (2015a)). Methods that fall into this category include those that learn to estimate the relative patch positions in images (Doersch et al (2015)), predict color channels (Larsson et al (2016)), solve jigsaw puzzles (Noroozi and Favaro (2016)) and inpaint (Pathak et al (2016)). One trend is to use as supervisory signal, spatial and appearance information collected from raw single images. In such single-image cases the amount of information that can be learned is limited to a single moment in time, as opposed to the case of learning from video sequences. Using unlabeled videos as input is closer related to our work and includes learning to predict the temporal order of frames (Lee et al (2017)), generate the future frame (Finn et al (2016); Xue et al (2016); Goroshin et al (2015)) or learn from optical flow (Wang and Gupta (2015b)).
For most of these papers, the unsupervised learning scheme is only an intermediate step to train features that are eventually used on classic supervised learning tasks, such as object classification, object detection or action recognition. Such pre-trained features perform better than randomly initialized ones, as they contain valuable semantic information implicit in the natural structure of the world used as supervisory signal. In our work, we focus mostly on specific unsupervised tasks on which we perform extensive evaluations, but we also show some results on transfer learning experiments.
The second main approach to unsupervised learning includes methods for image co-segmentation (Joulin et al  Leordeanu et al (2012)) and weakly supervised localization (Deselaers et al (2012);Nguyen et al (2009);Siva et al (2013)). Earlier methods are based on local feature matching and detection of their co-occurrence patterns (Stretcu and Leordeanu (2015); Sivic et al (2005); Leordeanu et al (2005); Parikh and Chen (2007); Liu and Chen (2007)), while more recent ones (Joulin et al (2014); Rochan and Wang (2014)) discover object tubes by linking candidate bounding boxes between frames with or without refining their location. Traditionally, the task of unsupervised learning from image sequences has been formulated as a feature matching or data clustering optimization problem, which is computationally very expensive due to its combinatorial nature.
There are also other papers (Lee et al (2011);Cheng et al (2017); Dutt Jain et al (2017); Tokmakov et al (2017)) that tackle unsupervised learning tasks but are not fully unsupervised, using powerful features that are pre-trained in supervised fashion on large datasets, such as ImageNet (Russakovsky et al (2015)) or VOC2012 (Everingham et al (2015)). Such works take advantage of the rich source of supervised information learned from other datasets, through features trained to respond to general object properties over tens or hundreds of object categories.
With respect to the end goal, our work is more related to the second research direction, on unsupervised discovery in video. However, unlike that research, we do not discover objects at test time, but during the unsupervised training process, when the student pathway learns to detect foreground objects. Therefore, from the learning perspective, our work is more related to the first research direction based on selfsupervised training.

Overall approach
We propose a genuine unsupervised learning algorithm for foreground object detection that offers the possibility to improve over several iterations. Our method combines in complementary ways multiple modules that are well suited for this task. It starts with a teacher pathway that discovers objects in unlabeled videos and produces a soft mask of the foreground object in each frame. The resulting soft-masks of lower quality are then filtered out automatically. Next, the remaining ones are passed to a student ConvNet, which learns to predict object masks in single images. When several student nets of different architectures are learned they form a new teacher for the next generation, then the whole process is repeated. At the next iteration we bring in more unlabeled data, we learn in an unsupervised fashion a better data selection mechanism and ultimately train more powerful student networks. In Algorithm 1 we enumerate concisely the main steps of our approach. Now we present the main algorithm in more detail. At Step 1 we start with an object discoverer in video sequences. There are several available methods for video discovery in the literature, with good performance (Borji et al (2012); Cheng et al (2015); Barnich and Van Droogenbroeck (2011)). We chose the VideoPCA algorithm introduced as part of the system in Stretcu and Leordeanu (2015) because

Algorithm 1 Unsupervised learning of foreground object detection
Step 1: perform unsupervised object discovery in unlabeled videos, along the teacher pathway (module B in Figure 1).
Step 2: automatically filter out poor soft masks produced at the previous step (module C in Figure 1).
Step 3: use the remaining masks as supervisory signal for training one or more student nets, along the student pathway (module A in Figure 1).
Step 4: use the ensemble of student nets to form a new teacher and learn a more powerful soft-mask selector, for the next iteration (referred to as a novel student-teacher generation).
Step 5: extend the unlabeled video dataset and return to Step 1 to train the next generation (note that from this step forward, the training dataset can also be extended with collections of unlabeled images, not just videos).
it is very fast (50-100 fps), uses very simple features (individual pixel colors) and it is completely unsupervised, with no usage of supervised pre-trained features. It learns how to separate the foreground from the background. It exploits the spatio-temporal consistency in appearance, shape, movement and location of objects, common in video shots, along with the contrasting properties, in size, shape, motion and location, between the main object and the background scene. Note that it would be much harder, at this first stage, to discover objects in collections of unrelated images, where there is no smooth variation in shape, appearance and location over time. Only at the second iteration of the algorithm, the simpler VideoPCA is replaced with a more powerful ensemble of student nets which is able to discover objects in collections of images as well.
The teacher branch produces soft foreground masks, one per each frame, which are not always of good quality. Thus, at Step 2, we use, during the first iteration, a simple and effective way to filter out poor masks. Only at the second iteration we are able to learn a more powerful soft-mask selector (see Section 4.2.1). The soft-masks that pass the filtering phase are then used (Algorithm 1, Step 3) to train the student pathway. As we want the student branch to learn general visual properties of objects in images, we limit its access to a single input image.
Our approach offers the possibility of improving performance by training a next generation of object detectors. In experiments, we found that there are three key aspects, which are effective at improving generalization at the next iteration: 1) we need to train several student nets (at module A), preferably of different architectures, which are stronger in combination than separately. Then, they become the teacher (module B) at the next iteration; 2) we train, also in an unsupervised fashion, a better soft-mask selector (module C); 3) it is preferred to increase the unlabeled training set at the next iteration, for improved generalization.
Having access to the complete training set at the very first iteration could be useful, but it is not optimal. At that stage, the teacher is still weak and imposes a certain limitation on how much could be learned from the data, no matter how large that data is. Getting access to a larger unlabeled training dataset is more effective at the second iteration, when the teacher pathway is significantly stronger. The idea of gradually increasing the complexity in the training set is also related to curriculum learning (Bengio et al (2009)), when we start with simpler cases then add more difficult ones. Increasing the strength of the teacher pathway improves the quality of the supervisory signal, while introducing more unlabeled data increases variety. Both act together in order to improve generalization.

System architecture
We, now, detail the architecture and training process of our system, module by module, as seen in Figure 1. We first present the student pathway (module A in Figure 1), which takes as input an individual image (e.g. current frame in the video) and learns to predict foreground soft-masks from an unsupervised teacher. The teacher pathway (represented by modules B and C in Figure 1), is explained in detail in the Section 4.2.

Student path: single-image segmentation
The student processing pathway (module A in Figure 1) consists of a deep convolutional network. We test different neural network architectures, some of which are commonly used in the recent literature on semantic image segmentation. We create a small pool of relatively diverse architectures, presented next.
The first convolutional network architecture for semantic segmentation that we test, is based on a more traditional CNN design. We term it LowRes-Net (see Figure 2) due to its low resolution soft-mask output. It has ten layers (seven convolutional, two pooling and one fully connected) and skip connections. Skip connections have proved to offer a boost in performance, as shown in the literature (Raiko et al (2012); Pinheiro et al (2016)). We also observed a similar improvement in our experiments when using skip connections. The LowRes-Net takes as input a 128 × 128 RGB image (along with its hue, saturation and derivatives w.r.t. x and y) and produces a 32 × 32 soft segmentation of the main objects present in the image. Because LowRes-Net has a fully connected layer at the top, we reduced the output resolution of the soft-segmentation mask, to limit memory cost. While the derviatives w.r.t x and y are in principle not needed (as they could be learned by appropriate filters during training), in our tests explicitly providing the derivatives along with HSV and by using skip-connections boosted the accuracy by over 1%. The LowRes-Net has a total of 78M parameters, most of them being in the last, fully connected layer. Different architectures for the "student" networks, each processing a single image. They are trained to predict the unsupervised label masks given by the teacher pathway, frame by frame. The architectures vary from the more classical baseline LowRes-Net (left), with low resolution output, to more recent architectures, such as the fully convolutional one (middle) and different types of U-Nets (right). For the U-Net architecture the blocks denoted with double arrows can be interchanged to obtain a new architecture. We noticed that on the task of bounding box fitting the simpler low-resolution network performed very well, while being outperformed by the U-Nets on fine object segmentation.
The second CNN architecture tested, termed FConv-Net, is fully convolutional (Long et al (2015)), as also presented in Figure 2. It has a higher resolution output of 128x128, with input size 256x256. Its main structure is derived from the basic LowRes-Net model. Different from LowRes-Net, it is missing the fully connected layer at the end and has more parameters in the convolutional layers, for a total of 13M parameters.
We also tested three different nets based on the U-Net (Ronneberger et al (2015)) architecture, which proved very effective in the semantic segmentation literature. Our U-net networks are: 1) BasicU-Net, 2) DilateU-Net -similar to BasicU-Net but using atrous (dilated) convolutions (Yu and Koltun (2015)) in the center module, and 3) DenseU-Netwith dense connections in the down and up modules (Jégou et al (2017)).
The BasicU-Net has 5 down modules with 2 convolutional layers each, with 32, 64, 128, 256 and 512 features maps, respectively. In the center module the BasicU-Net has two convolutional layers with 1024 feature maps each. The up modules have 3 convolutional layers and the same number of features maps as the corresponding down modules. The only difference between BasicU-Net and DilateU-Net is that the former has a different center module with 6 atrous convolutions and 512 feature maps each. Then, DenseU-Net has 4 down modules with 4 corresponding up modules. Each down and up module has 4 convolutions with skipconnections (as presented in Figure 2). The modules have 12, 24, 48 and 64 features maps, respectively. The transition represents a convolution, having the role of reducing the output number of feature maps from each module. The BasicU-Net has 34M parameters, while the DilateU-Net has 18M parameters. DenseU-Net has only 3M parameters, but uses skip-connections inside the up and down blocks in order to make up for the difference in the number of parameters. All three U-Nets have 256x256 input and same resolution output. All networks use ReLU activation functions. Please see Figure 2 for more specific details regarding the architectures of the different models.
Given the current setup, the student nets do not learn to identify specific object classes. They will learn to softly segment the main foreground objects present, regardless of their particular category. The main difference in their performance is in their ability to produce fine object segmentations. While the LowRes-Net tends to provide a good support for estimating the object's bounding box due to its simpler output, the other ConvNets (especially the U-Nets), with higher resolution, are better at finely segmenting objects. Due to the different ways in which the particular models make mistakes, they are always stronger when forming an ensemble. In experiments we also show that they outperform their teacher and are able to detect objects from categories that were not seen during training.

Student networks ensemble
The pool of student networks with different architectures produce varied results that differ qualitatively. While the bounding boxes computed from their soft-masks have similar accuracy, the actual soft-segmentation output looks differently. They have different strengths, while making different kinds of mistakes. The above observation immediately suggests that they should be stronger in combination, so we have experimented with the idea of combining them into an ensemble. We propose two types of ensembles.
The first one, termed Multi-Net, outputs a soft-mask that is obtained by multiplying pixel-wise the soft-masks produced by each individual student net. Thus, only positive pixels, on which all nets agree, survive to the final segmentation. Multi-Net offers robust masks of significantly higher quality. In Section 4.2.1 we show how Multi-Net can be effectively used to learn in an unsupervised fashion, a network (EvalSeg-Net) for evaluating the goodness of a specific segmentation. That network is an important part of the next generation teacher pathway and replaces module C at the next iteration.
The second approach to forming an ensemble is to use EvalSeg-Net in order to select the best soft-mask from the pool of masks generated by the student nets. We term this ensemble system, MultiSelect-Net. Quantitatively, MultiSelect-Net and Multi-Net perform similarly, but Multi-Net tends to produce fuzzier masks due to the additional multiplication of the student's soft-masks.

Training the student ConvNets
We treat foreground object segmentation as a multidimensional regression problem, where the soft mask given by the unsupervised video segmentation system acts as the desired output. Let I be the input RGB image (a video frame) and Y be the corresponding 0-255 valued soft segmentation given by the unsupervised teacher for that particular frame. The goal of our network is to predict a soft segmentation mask Y of width W and height H (where W = H = 32 for the basic architecture, W = H = 128 for fully convolutional architecture and W = H = 256 for U-Net architectures), that approximates as well as possible the mask Y. For each pixel in the output image, we predict a 0-255 value, so that the total difference between Y andŶ is minimized. Thus, given a set of N training examples, let I (n) be the input image (a video frame),Ŷ (n) be the predicted output mask for I (n) , Y (n) the soft segmentation mask (corresponding to I (n) ) and w the network parameters. Y (n) is produced by the video discoverer after processing the video that I (n) belongs to. Then, our loss is: We observed that in our tests, the L2 loss performed better than the cross-entropy loss, due to the fact that the soft-masks used as labels have real values, not discrete ones. Also, they are not perfect, so the idea of thresholding them for training does not perform as well as directly predicting their real values. We train our network using the Tensorflow (Abadi et al (2015)) framework with the Adam optimizer (Kingma and Ba (2014)). All models are trained end-to-end using a fixed learning rate of 0.001 for 10 epochs. The training time for any given model is about 3-5 days on a Nvidia GeForce GTX 1080 GPU, for the first iteration and about 2 weeks for the second iteration students.
Post-processing. The student CNN outputs a W × H soft mask. In order to fairly compare our models with other methods, we have two different post processing steps: 1) bounding box fitting and 2) segmentation refinement. For fitting a box around the soft mask, we first up-sample the W × H output to the original size of the image, then threshold the mask (validated on a small subset), determine the connected components and fit a tight box around each of the components. We perform segmentation refinement (point 2) in a single case, on the Internet Images Dataset as also specified in the experiments section. For that, we use the OpenCV implementation of GrabCut (Rother et al (2004)) to refine our soft mask, up-sampled to the original size. In all other tests we use the original output of the networks.

Teacher path: unsupervised discovery in video
There are several methods available for discovering objects and salient regions in images and videos (Borji et al (2012) (2003); Barnich and Van Droogenbroeck (2011)) with reasonably good performance. More recent methods for foreground objects discovery such as Papazoglou and Ferrari (2013) are both relatively fast and accurate, with runtime around 4 seconds per frame. However, that runtime is still long and prohibitive for training the student CNN that requires millions of images. For that reason we used at the first generation (Iteration 1 of Algorithm 1) for module B in Figure 1, the VideoPCA algorithm, which is a part of the whole system introduced in Stretcu and Leordeanu (2015). It has lower accuracy than the full system, but it is much faster, running at 50 − 100 fps. At this speed we can produce one million unsupervised soft segmentations in a reasonable time of about 5-6 hours.
VideoPCA. The main idea behind VideoPCA is to model the background in video frames with Principal Component Analysis. It finds initial foreground regions as parts of the frames that are not reconstructed well with the PCA model. Foreground objects are smaller than the background, have contrasting appearance and more complex movements. They could be seen as outliers, within the larger background scene. That makes them less likely to be captured well by the first PCA components. Thus, for each frame, an initial soft-mask is produced from an error image, which is the difference between the original image and the PCA reconstruction. These error images are first smoothed with a large Gaussian filter and then thresholded. The binary masks obtained are used to learn color models of foreground and background, based on which individual pixels are classified as belonging to foreground or not. The object masks obtained are further multiplied with a large centered Gaussian, based on the assumption that foreground objects are often closer to the image center. These are the final masks used in your system. For more technical details, the reader is invited to consult Stretcu and Leordeanu (2015). In this work, we use the method exactly as found online 1 without any parameter tuning.
Teacher pathway at the next generation: At the next iteration of Algorithm 1, VideoPCA (in module B) is replaced by the student nets trained at the previous iteration in the following way. While we could use as new module B any of the two ensembles Multi-Net or MultiSelect-Net, we preferred a simpler and more efficient approach. For each unlabeled training image we ran all student nets and obtain multiple soft-masks, without combining them to produce a single output per image. Therefore the new module B is the collection of all student nets acting in parallel. Then, their soft-masks are filtered independently (using a given threshold) by the new Module C in Figure 1, which is represented at the second iteration by EvalSeg-Net. Note that it is possible in this manner to obtain one, several or no soft segmentations for a given training image. This approach is fast and it offers the advantage of processing data in parallel over multiple GPUs, without having to wait for all student nets to finish for every input image. As our experiments demonstrate, the approach is also efficient, with significantly better results at the second generation.

Unsupervised soft masks selection
The performance of the student net is influenced by the quality of the soft masks provided as labels by the teacher branch. The cleaner the masks, the more chances the student has to learn to segment well objects in images. VideoPCA tends to produce good results if the object present in the video stands out well against the background scene, in terms of motion and appearance. However, if the object is occluded at some point, does not move w.r.t the scene or has a similar appearance to its background, the resulting soft masks might be poor. In the first generation, we used a simple measure of masks quality to select only the good soft-masks for training the student pathway, based on the following observation: when VideoPCA masks are close to the ground truth, the average of their nonzero values is usually high. Thus, when the discoverer is confident, it is more likely to be right. The average value of non-zero pixels in the soft mask is then used as a score indicator for each segmented frame. Only masks of certain quality according to this indicator are selected and used for training the student nets. This represents module C in Figure 1 at the first generation of Algorithm 1. While being effective at iteration 1, the simple average value over all pixels cannot capture the goodness of a segmentation at the higher level of overall shape. At the next iterations, we therefore explore new ways to improve it.
Consequently, at the next iterations we propose an unsupervised way for learning the EvalSeg-Net to estimate segmentation quality. As mentioned previously, Multi-Net provides masks of higher quality as it cancels errors from individual student nets. Thus, we use the cosine similarity between a given individual segmentation and the ensemble Multi-Net mask, as a cost for "goodness" of segmentation. Having this unsupervised segmentation cost we train the EvalSeg-Net deep neural net to predict it. As previously mentioned, this net acts as an automatic mask evaluation procedure, which in subsequent iterations becomes module C in Figure 1, replacing the simple mask average value used at Iteration 1. Only masks that pass a certain threshold are used for training the student path.
The architecture of EvalSeg-Net is similar to LowRes-Net (Figure 2), with the difference that the input channel containing image derivatives is replaced by the actual softsegmentation that requires evaluation and it does not have skip connections. Also, after the last fully connected layer (size 512) we add a last one-neuron layer to predict the segmentation quality score, which is a single real valued number.
Let I be an input RGB image, S an input soft-mask, Y = 5 i=1Ŷ Ni be the output of our Multi-Net whereŶ Ni denotes the output of network N i . We treat the segmentation Fig. 3 Purity of soft masks vs. degree of selection. When selectivity increases, the true purity of the training frames improves. Our automatic selection method is not perfect: some low quality masks may have high scores, while other good ones may be ranked lower. At the first iteration of Algorithm 1 we select masks obtained with VideoPCA, while at the second generation we selected masks obtained with the teacher at the second generation. The plots are computed using results from the VID dataset, where there is an annotation for each input frame. Note the significantly better quality of masks at the second iteration (red vs. blue lines, in the left plot). We have also compared the simple "mean" based selection procedure used at iteration 1 (yellow line) with EvalSeg-Net used at iteration 2 (red line), on the same soft masks from iteration 2. The EvalSeg-Net is more powerful, which justifies its use at the second iteration when it replaces the simple "mean" based procedure. "goodness" evaluation task as a regression problem where we want to predict the Cosine similarity between S andŶ. So, our loss for EvalSeg-Net is defined as follows: where K represents the number of training examples and o (k) (w, I (k) , S (k) ) represents the output of EvalSeg-Net for image I (k) and soft mask S (k) .
Given a certain metric for segmentation evaluation (depending on the learning iteration), we keep only the soft masks above a threshold for each dataset (e.g. VID (Russakovsky et al (2015)), YTO (Prest et al (2012)), Youtube Bounding Boxes (Real et al (2017))). In the first iteration this threshold was obtained by sorting the VideoPCA soft-masks based on their score and keeping only the top 10 percentile, while on the second iteration we validate a threshold (= 0.8) on a small dataset and select each mask independently by using this threshold on the single value output of EvalSeg-Net.
Mask selection evaluation. In Figure 3 we present the dependency of segmentation performance w.r.t ground truth object boxes (used only for evaluation) vs. the percentile p of masks kept after the automatic selection, for both generations. We notice the strong correlation between the percentage of frames kept and the quality of segmentations. It is also evident that the EValSeg-Net is vastly superior to the simpler procedure used at iteration 1. EvaSeg-Net is able to correctly evaluate soft segmentations even in more complex cases (see Figure 4).
Even though, we can expect to improve the quality of the unsupervised masks by drastically pruning them (e.g. keeping a smaller percentage), the fewer we are left with, the less training data we get, increasing the chance to overfit. We make up for the losses in training data by augmenting the set of training masks and by also enlarging the actual unlabeled training set at the second generation. There is a trade-off between level of selectivity and training data size: the more selective we are about what masks we accept for training, the more videos we need to collect and process through the teacher pathway, to obtain the sufficient training data size.
Data augmentation. A drawback of the teacher at the first learning iteration (VideoPCA) is that it can only detect the main object if it is close to the center of the image. The assumption that the foreground is close to the center is often true and indeed helps that method, which has no deep learned knowledge, to produce soft masks with a relatively high precision. Not surprisingly, it often fails when the object is not in the center, therefore its recall is relatively low. Our data augmentation procedure addresses this limitation and can be concisely described as follows: randomly crop patches of the input image, covering 80% of the original image and scale up the patch to the expected input size. This produces slightly larger objects at locations that cover the Fig. 4 Qualitative results of the unsupervised EvalSeg-Net used for measuring segmentation "goodness" and filtering bad masks (Module C, iteration 2). For each input image we present five soft-masks candidates (from first iteration students) along with their "goodness" scores given by EvalSeg-Net, in decreasing order of scores. Note the effectiveness of EvalSeg-Net at ranking soft segmentations. whole image area, not just the center. As experiments show, the student net is able to see objects at different locations in the image, unlike its raw teacher (VideoPCA at iteration 1), which is strongly biased towards the image center.
At the second generation, the teacher branch is significantly better at detecting objects at various locations and scales in the image. Therefore, while artificial data augmentation remains useful (as it is usually the case in deep learning), its importance diminishes at the second iteration of learning (Algorithm 1).

Implementation pipeline
Now that we have presented in technical detail all major components of our system, we concisely present the actual steps taken in our experiments, in sequential order, and show how they relate to our general Algorithm 1 for unsupervised learning to detect foreground objects in images.

Run VideoPCA on input images from VID and YouTube
Objects datasets (Algorithm 1, Iteration 1, Step 1) 2. Select VideoPCA masks using first generation selection procedure (Algorithm 1, Iteration 1, Step 2) 3. Train first generation student ConvNets on the selected masks, namely LowRes-Net, FConv-Net, BasicU-Net, DilateU-Net and DenseU-Net (Algorithm 1, Iteration 1, Step 3). 4. Create first generation student ensemble Multi-Net by multiplying the outputs of all students and train EvalSeg-Net to predict the similarity between a particular mask and the mask of Multi-Net. Create the second ensemble MultiSelect-Net by using EvalSeg-Net in combination with the student's masks (Algorithm 1, Iteration 1, Step 4). 5. Add new data from YouTube Bounding Boxes. (Algorithm 1, Iteration 1, Step 5) 6. Return to Step 1, the teacher pathway: predict multiple soft-masks per input image on the enlarged unlabeled video set, using the student nets from Iteration 1 (Module B, Iteration 2), which will be then selected with EvalSeg-Net at Module C. (Algorithm 1, Iteration 2, Step 1) 7. Select only sufficiently good masks evaluated with EvalSeg-Net (Algorithm 1, Iteration 2, Step 2) 8. Train the second generation students on the newly selected masks. We use the same architectures as in Iteration 1 (Algorithm 1, Iteration 2, Step 3) 9. Create the second generation student ensembles Multi-Net and MultiSelect-Net. (Algorithm 1, Iteration 2, Step 4) The method presented in the introduction sections (Algorithm 1) is a general algorithm for unsupervised learning from video to detect objects in single images. It presents a sequence of high level steps followed by different modules for an unsupervised learning system. The modules are complementary to each other and function in tandem, each focusing on a specific aspect of the unsupervised learning process. Thus, we have a module for generating data, where soft-masks are produced. There is a module that selects good quality masks. Then, we have a module for training the next generation classifiers. While, our concept is first presented in high level terms, we also present a specific implementation that represents the first two iterations of the algorithm. While our implementation is costly during training, in terms of storage and computation time, at test time it is very fast -0.02 sec per student net and 0.15 sec per student ensemble.
Computation and storage costs. During training, the computation time for passing through the teacher pathway during the first iteration of Algorithm 1 is about 2-3 days: it requires processing data from VID and YTO datasets, including running the VideoPCA module. Afterwards, training the first iteration students, with access to 6 GPUs, takes about 5 days -6 GPUs are needed for training the 5 different student architectures, since training FConv-Net requires two GPUs in parallel. Next, training the EvalSeg-Net requires 4 additional days on one GPU. At the second iteration, processing the data through the teacher pathway takes about 3 weeks on 6 GPUs in parallel -it is more costly due to the larger training set from which only a small percent (about 10 percent) is selected with EvalSeg-Net. Finally, training the second generation students takes 2 additional weeks. In conclusion, the total computation time required for training, with full access to 6 GPUs is about 7 weeks, when everything is optimized. The total storage cost is about 4TB. At test time the student nets are fast, taking 0.02 sec per image, while the ensemble nets take around 0.15 sec per image.

Experimental analysis
In the first set of experiments we evaluate the impact of the different components of our system. We experimentally verify that at each iteration the students perform better than their teachers. Then we test the ability of the system to improve from one generation to the next. We also test the effects of data selection and increasing training data size. Then, we compare the performances of each individual network and their combined ensembles.
In Section 5.2, we compare our algorithm to state of the art methods on object discovery in videos and images. We perform tests on three datasets: YouTube Objects (Prest et al (2012)), Object Detection in Internet images (Rubinstein et al (2013)) and Pascal-S ). In Section 5.3 we verify that our unsupervised deep features are also useful in different transfer learning tasks.
Datasets. Unsupervised learning requires large quantities of unlabeled video data. We have chosen for training data, videos from three large datasets: ImageNet VID dataset (Russakovsky et al (2015)), YouTube Objects (Prest et al (2012)) and YouTube Bounding Boxes (Real et al (2017)). VID is one of the largest video datasets publicly available, being fully annotated with ground truth bounding boxes. The dataset consists of about 4000 videos, having a total of about 1.2M frames. The videos contain objects that belong to 30 different classes. Each frame could have zero, one or multiple objects annotated. The benchmark challenge associated with this dataset focuses on the supervised object detection and recognition problem, which is different from the one that we tackle here. Our system is not trained to identify different object categories, so we do not report results compared to the state of the art on object class recognition and detection, on this dataset.
YouTube Objects (YTO) is a challenging video dataset with objects undergoing significant changes in appearance, scale and shape, going in and out of occlusion against a varying, often cluttered background. YTO is at its second version now and consists of about 2500 videos, having a total of about 700K frames. It is specifically created for unsupervised object discovery, so we perform comparisons to state of the art on this dataset.
For unsupervised training of our system we used approximately 190k frames from videos chosen from each dataset (120k from VID and 70k from YTO), at learning iteration 1 -those frames which survived after the data selection module. At the second learning iteration, besides improving the classifier, it is important to have access to larger quantities of new unlabeled data. Therefore, for training the second generation of classifiers we added to the unlabeled training set additional 1 million soft-masks, as follows: 600k frames from VID and 400k from the YouTube Bounding Boxes dataset -again, those frames which survived after filtering with the EvalSeg-Net data selection module. Before data selection videos were randomly chosen from each set, VID or YouTube Bounding Boxes, until the total of 1M was reached. We did not add more frames due to heavy computation and storage limitations. Evaluation metrics. We use different kinds of metrics in our experiments, which depend on the specific task that requires either bounding box fitting or fine segmentation: -CorLoc -for evaluating the detection of bounding boxes the most commonly used metric is CorLoc. It is defined as the percentage of images correctly localized according to the PASCAL criterion: Bp∩B GT Bp∪B GT ≥ 0.5, where B P is the predicted bounding box and B GT is the ground truth bounding box.
-F-β = (1−β 2 )precision×recall β 2 ×precision+recall for evaluating the segmentation score on Pascal-S dataset. We use the official evaluation code when reporting results. As in all previous works, we set β 2 = 0.3. -P-J metric P refers to the precision per pixel, while J is the Jaccard similarity (the intersection over union between the output mask the and ground truth segmentations). We use this metric only on Object Discovery in Internet images. For computing the reported results we use the official evaluation code. -MAE -Mean Absolute Error is defined as the average pixel-wise difference between the predicted mask and the ground truth. Different from the other metrics, for this metric a lower value is better. mean IoU score is defined as |G∩Y | |G∪Y | where G represents the ground truth and Y the predicted mask.

Evaluation of different system components
Student vs. teacher In Figure 8 we present qualitative results on VID dataset as compared to VideoPCA. We can see that the masks produced by VideoPCA are of lower quality, often having holes, non-smooth boundaries and strange  (2012)) dataset (CorLoc metric) at both iterations (generations). We present the average of CorLoc metric of all 10 classes from YTO dataset for each model and ensemble, as well as the average of all single models and the average of the ensembles. As it can be seen, at the second generation there is a clear increase in performance for all models. Also note that at the second generation a single model is able to outperform all the methods (single or ensemble) from the first generation.   (2014)) dataset (F-β metric), for all of our methods for first and second generations as well as their average performance. Note that in this case, since we evaluate actual segmentations and not bounding box fitting, nets with higher resolution output perform better (DenseU-Net, BasicU-Net and DilateU-Net). Again, ensembles outperform single models and the second iteration brings a clear gain in every case.
shapes. In contrast, the students learn more general shape and appearance characteristics of objects in images, reminding of the grouping principles governing the basis of visual perception as studied by the Gestalt psychologists (Rock and Palmer (1990)) and the more recent work on the concept of "objectness" (Alexe et al (2010)). The object masks produced by the students are simpler, with very few holes, have nicer and smoother shapes and capture well the foregroundbackground contrast and organization. Another interesting observation is that the students are able to detect multiple objects, a feature that is less commonly achieved by the teacher.
In Figure 5 we see comparative results between the average of individual models, the ensembles formed and the teacher. Note that the teacher at the next generation reported is the MultiSelect-Net ensemble from the first. We observe that the students at both iterations outperform their respective teachers, which is an interesting and positive outcome. It suggests that we can repeat the process over several iterations and continue to improve. It is also encouraging that the individual nets, which see a single image, are able to generalize and detect objects that are discovered by the teacher in sequences of images.
First vs. next generation. As seen in Tables 1, 2 3 and Figure 7 at the second generation we obtain a clear gain over the first, on all experiments and datasets. This result proves the value of our proposed algorithm that starts from a completely unsupervised object discoverer in video (VideoPCA) and is able to train neural nets for foreground object segmentation, while improving their accuracy over two generations. It uses the students from iteration 1 as teachers at iteration 2. At the second iteration, it also uses more unlabeled training data and it is better at automatically filtering out poor quality segmentations.
Impact of data selection. Data selection is important as seen in Figure 6. The more selective we are when we accept or reject soft-masks used for training, the better the end result. Also note that being more selective means decreasing the training set. There is a trade-off between selectivity and training data size.
Neural architecture vs. data. As seen in Tables 1, 2 and 3 different network architecture yield different results, while ensembles always outperform individual models. While the actual CNN architecture has a certain role in performance, another equally important aspect is that of data size. The more data we have the more selective we can afford to be and also the more we could generalize. It is important to increase the data from one generation to the next in order to avoid simply imitating the ensemble of the previous gen- Fig. 5 Comparison between the teacher, individual student nets and the ensembles, across two generations (blue line -first iteration; red linesecond iteration). Individual students (for which we report average values) outperform the teacher on both iterations, while the ensembles are even stronger than the individual nets. For the second iteration teacher we report the MultiSelect-Net version of the ensemble (since we consider this to be an upper bound). The plots are computed over results on the YouTube Objects dataset using the CorLoc metric (percentage). Fig. 6 Impact of data selection for both iterations. Data selection (module C) strongly affects the results at each iteration. Note that results from iteration 2 with no selection are slightly better than the ones from iteration 1 with selection. This happens because the unlabeled training data is increased and the second generation (iteration 2) teacher pathway is superior, providing better quality masks for training. The results represent the average over 10 classes on YouTube Objects using CorLoc percentage metric. eration. In Tables 4 and 5 we show additional tests with our baseline architecture, LowRes-Net, when trained with training sets of different sizes. It is obvious that adding new unlabeled data has a positive effect on performance. The idea of increasing the data in stages is also related to approaches in curriculum learning (Bengio et al (2009) where we first learn from easy cases then move to the more complex ones.
Analysis of different ConvNets. Our experiments show that different architectures are better at different tasks. LowRes-Net, for example, performs well on the task of box fitting since that does not require a fine sharp object mask. On the other hand, when evaluating the exact segmentation, nets with higher resolution output, which are more specialized for this task perform better. Overall, at the second generation, on box fitting the best single net on average is DilateU-Net and the top ensemble is MultiSelect-Net. However, when it comes to evaluating the actual segmentation the winner is DenseU-Net for single models and Multi-Net for ensembles. In our qualitative results we find that DenseU-Net produces masks with fewer "holes" when compared to DilateU-Net, after thresholding and, thus, it is better suited for segmentation evaluation. When evaluating the bounding box, these holes do not affect the box and the best model is DilateU-Net. Also, DenseU-Net tends to outputs a mask with higher confidence on the whole object, as opposed to the BasicU-Net and DilateU-Net that output masks with lower confidence around some regions of the object (such as the eyes or wheels). This could be another reason why DenseU-Net produces better segmentations. The model that struggles most during the first iteration is FConv-Net, with significant improvement at the second iteration when the unsupervised training masks are closer to the correct ones. Also note that the baseline LowRes-Net is a top model on box fitting at the first iteration. The quantitative differences between architectures are shown in Tables 1, 2 and 3, while the qualitative differences can be seen in Figure 7.  Table 5 Influence of adding more unlabeled data on the Object Discovery in Internet images dataset -PJ metric. The performance increases by about 1%. Fig. 7 Visual comparison between models at each iteration (generation). We marked with a purple dot the output of our MultiSelect-Net (the top selected student soft mask with EvalSeg-Net). The Multi-Net represents the pixel-wise multiplication between the five models. Note the superior masks at the second generation, with better shapes, fewer holes and sharper edges.

Comparisons with state of the art
Object discovery in video. We first performed comparisons with methods specifically designed for object discovery in video. For that, we choose the YouTube Objects dataset and compare it to the best methods on this dataset in the literature (Table 6). Evaluations are conducted on both versions of YouTube Objects dataset, YTOv1 (Prest et al (2012)) and YTOv2.2 (Kalogeiton et al (2016)). On YTOv1 we follow the same experimental setup as (Jun Koh et al (2016); Prest et al (2012)), by running experiments only on the training videos. We have not included in Table 6 the results reported by Stretcu and Leordeanu (2015) because they use a different setup, testing on all videos from YTOv1. It is important to stress out, again, the fact that while the methods presented here for comparison have access to whole video shots, ours only needs a single image at test time. Despite this limitation, our method outperforms the others on 7 out of 10 classes and has the best overall average performance. Note that even our baseline LowRes-Net at the first iteration achieves top performance. The feed-forward CNN processes each image in 0.02 sec, being at least one to two orders of magnitude faster than all other methods (see Table 6). We also mention that in all our comparisons, while our system is faster at test time, it takes much longer during its unsupervised training phase and requires large quantities of unsupervised training data.
Object discovery in images We compare our system against other methods that perform image discovery in images. We use two different datasets for this comparison: Object Discovery in Internet Images and Pascal-S datasets. We report results using metrics that are commonly used for these tasks, as presented at the beginning of the experimental section.
Object Discovery in Internet Images is a representative benchmark for foreground object detection in single images. This set contains internet images and it is annotated with high detail segmentation masks. In order to enable comparison with previous methods, we use the 100 images subsets provided for each of the three categories: airplane, car and horse. The methods evaluated on this dataset in the literature, aim to either discover the bounding box of the main object in a given image or its fine segmentation mask. We evaluate our system on both. Note that different from other works, we do not need a collection of images during test time, since each image can be processed independently by our system. Therefore, unlike other methods, our performance is not affected by the structure of the image collection or the number of classes of interest being present in the collection.
In Table 7 we present the performance of our method as compared to other unsupervised object discovery methods in terms of CorLoc on the Object Discovery dataset. We compare our predicted box against the tight box fitted around the ground-truth segmentation as done in Cho et al Fig. 8 Qualitative results on the VID dataset (Russakovsky et al (2015)). For each iteration we show results of the best individual and ensemble models, in terms of CorLoc metric. Note the superior quality of our models compared to the VideoPCA (iteration 1 teacher). We also present the ground truth bounding boxes. For more qualitative results please visit our project page https://sites.google.com/view/ unsupervisedlearningfromvideo  (2012)) and v2.2 (Kalogeiton et al (2016)). We achieve state of the art results on both versions. Please note that the baseline LowRes-Net already achieves top results on v1, while being close to the best on v2.2. We present results of the top individual and ensemble models and also keep the baseline LowRes-Net at both iterations, for reference. Note that complete results on this dataset v1 for all models are also presented in Table 1. (2015); Tang et al (2014). Our system can be considered in the mixed class category: it does not depend on the structure of the image collection. It treats each image independently. The performance of the other algorithms degrades as the number of main categories increases in the collection (some are not even tested by their authors on the mixed-class case), which is not the case with our approach.
We obtain state of the art results on all classes, improving by a significant margin over the method of Cho et al (2015). When the method in Cho et al (2015) is allowed to see a collection of images that are limited to a single majority class, its performance improves and it is equal with ours on one class. However, our method has no other information necessary besides the input image, at test time.
We also tested our method on the task of fine foreground object segmentation and compared to the best performers in the literature on the Object Discovery dataset in Table 8. For refining our soft masks we apply the GrabCut method, as it Fig. 9 Qualitative results on the Object Discovery dataset as compared to (B) Rubinstein et al (2013). For both iterations we present results of the top single and ensemble models (C-F), without using GrabCut. We also present results when GrabCut is used with the top ensemble (G). Note that our models are able to segment objects from classes that were not present in the training set (examples on the right side). Also, note that the initial VideoPCA teacher cannot be applied on single images.

Method
Airplane  (2013)) dataset (CorLoc metric). The results obtained in the first iteration are further improved in the second one. We present the best single and ensemble models, along with the baseline LowRes-Net at both iterations. Among the single models DilateU-Net is often the best when evaluating box fitting.
is available in OpenCV. We evaluate based on the same P, J evaluation metric as described by Rubinstein et al (2013) the higher P and J, the better. In Figure 9 and 10 we present some qualitative results for each class. As mentioned previously, these experiments on Object Discovery in Internet   (2013)) dataset using (P, J metric) on segmentation evaluation. We present results of the top single and ensemble models, along with LowRes-Net at both iterations. On the task of fine object segmentation the best individual model tends to be DenseU-Net as also mentioned in the text. Note that we applied GrabCut on these experiments only as a post-processing step, since all methods reported in this Table also used it.
Images are the only ones on which we apply GrabCut as a post-processing step, as also used by all competing methods presented in Table 8. Another important dataset used for the evaluation of a related task, that of salient object detection, is Pascal-S dataset, consisting of 850 images. As seen from Table 9 Fig. 10 Qualitative results on the Object Discovery in Internet Images Rubinstein et al (2013) dataset. For each example we show the input RGB image and immediately below our segmentation result, with GrabCut post processing for obtaining a hard segmentation. Note that our method produces good quality segmentation results, even in images with cluttered background. we achieve top results on all three metrics against methods that do not use any supervised pre-trained features. Being a foreground object detection method, our approach is usually biased towards the main object in the image -even though it can also detect multiple ones. Images in Pascal-S usually have more objects, so we consider our results very encouraging being close to approaches that use features pretrained in a supervised manner. Also note that we did not use GrabCut for these experiments.
On single image experiments, our system was trained, as discussed before on other, video datasets (VID, YTO and YTB). It has not previously seen any of the images in Pascal-S or Object Discovery datasets during training.

Transfer learning experiments
While the focus of the paper is foreground object detection in the unsupervised learning setup, we also want to verify the usefulness of our approach on transfer learning experiments. We design experiments to test two aspects of our system -the actual unsupervised features learned and the final output foreground mask. We perform tests on YouTube Objects v1 dataset, in a relatively standard supervised classification setup, by learning to classify individual video frames with the class given by their parent video shot -for a total of ten classes.
We use the frames from the YTO training videos for training and the ones from the YTO test videos for test- ing. We test on a frame by frame basis and report the average multiclass classification percentage -how often the correct class is chosen out of ten classes. This problem is difficult for several reasons: 1) the training and testing frames come from different videos in YTO, that vary significantly in appearance and background scene 2) the object of interest is not present in every frame, which makes the classification rely heavily on the contextual scene. 3) there are multiple objects in many frames, having a cluttered background, while the object of interest goes through different changes in scale, viewpoint and pose. We have two experimental setups for this task, one focused on the pre-trained features and the other on the foreground masks. In the first setup, we replace the last fully connected layer from our baseline model LowRes-Net with a classification part and freeze the network up to a given depth, using as pre-trained features the ones from the unsupervised learning task. Then, we fine-tune the end part on the given supervised classification task. In the second experimental setup we extract features from VGG network pre-trained (Simonyan and Zisserman (2014)) on ImageNet from different subwindows of the image, one being the bounding box given by the unsupervised LowRes-Net. Both tests that are presented next in more detail, prove that our approach is useful on transfer learning tasks.
Using the unsupervised features. In this experimental setup, we replace the last fully connected layer with classification part, composed of a reduction convolutional layer having four filters and a final fully connected layer with 10 neurons. We test various cases by freezing different parts of the LowRes network and fine-tune the rest on the supervised classification task. The results are presented in Figure 11.
They strongly suggest that the features learned in an unsupervised way from the middle of the network are best suited for semantic classification. The result clearly demonstrates the usefulness of the unsupervised features on the supervised classification task. In all cases when these features are used the results are improved ("concat", "conv2 2", "init pre-trained") except for one case, "conv3 3". This happens because the pretrained features used in this case are from the top level -when the final segmentation is produced. At that level the semantic information is already lost. On the contrary, when features are frozen at the middle of the network, the best results are obtained.
Using the detected foreground bounding box. In these experiments we extract 'fc7' VGG19 features, pre-trained on ImageNet, by passing through VGG19 different subwindows of the image rescaled appropriately, namely the whole image, the center box with height and width being half the original image size and the window cropped according to the bounding box produced by LowRes-Net. We concatenate such features taken from these windows in different combinations and pass them through a last fully connected layer with 10 neurons, which we train on the given classification task. We then, test the different combinations as shown in Table 10. When using features extracted from the boundingbox fitted with LowRes-Net (alone or in combination with the whole image), we obtain significantly better results compared to the case when windows are extracted from fixed locations only (middle box, whole image or in combination).

Fig. 11
Transfer learning with pre-trained unsupervised features: the "random freeze" case is when the network is randomly initialized then frozen and the classification part is trained; "random init" is when the whole network and the classification part are randomly initialized and then trained jointly, end-to-end; "concat", "conv2 2" and "conv3 3" refer to cases where the network trained in the unsupervised way is frozen up to and including that specified layer, and the rest, including the classification part, is then trained; the "init pre-trained" case is when the network is initialized with the pre-trained features from LowRes-Net then everything fine-tuned on the classification task. The results indicate that the optimal case is when unsupervised pre-trained features from the middle part are used, which are more likely to be relevant for different semantic classes than the last deep features that produce the final segmentation.  Table 10 Classification experiments using the foreground mask. Different sub-windows of the image are passed through VGG19 and features are extracted for the given classification task. Note that a significant boost is obtained when features are also extracted from the bounding box fitting based on the soft-mask predicted by LowRes-Net.
These results verify that the foreground segmentation mask detected with our models is, as expected, directly related to the main video class and constitutes a valuable source of information in image classification tasks. Overall, the classification experiments presented in this Section indicate that the features learned in an unsupervised manner with our algorithm contain relevant semantic information about object classes and could be useful for related supervised learning tasks.
6 Short discussion on unsupervised learning The ultimate goal of unsupervised learning might not be about matching the performance of the supervised case but rather about reaching beyond the capabilities of the classical supervised scenario. An unsupervised system should be able to learn and recognize different object classes, such as animals, plants and man-made objects, as they evolve and change over time, from the past and into the unknown future. It should also be able to learn about new classes that might be formed, in relation to others, maybe known ones. We see this case as fundamentally different from the supervised one in which the classifier is forced to learn from a distribution of samples that is fixed and limited to a specific period of time -that when the human labeling was performed.
Therefore, in the supervised learning paradigm a car from the future, should not be classified as car, because it is not a car, according to the supervised distribution of cars given at present training time, when human annotations are collected. On the other hand, a system that learns by itself should be able to track how cars have been changing in time and recognize such objects as "cars" -with no step by step human intervention.
From a temporal perspective, unsupervised learning is about continuous learning and adaptation to huge quantities of data that are perpetually changing. Human annotation is extremely limited in an ocean of data and not able to provide the so called "ground truth" information continuously. Therefore, unsupervised learning will soon become a core part, larger than the supervised one, in the future of artificial intelligence.

Conclusions and future work
In this article, we present a novel and effective approach to learning from video, in an unsupervised fashion, to detect foreground objects in single images. We present a relatively general algorithm for this task, which offers the possibility of learning several generations of students and teachers. We demonstrate in practice that the system improves its performance over the course of two generations. We also test the impact of the different system components on performance and show state of the art results on three different datasets. To our best knowledge, it is the first system that learns to detect and segment foreground objects in images in an unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time.
The convolutional networks trained along the student pathway are able to learn general "objectness" characteristics, which include good form, closure, smooth contours, as well as contrast with the background. What the simpler initial VideoPCA teacher discovers over time, the deep, complex student is able to learn across several layers of image features at different levels of abstraction. Our results on transfer learning experiments are also encouraging and show additional cases in which such a system could be useful. In future work we plan to further grow our computational and storage capabilities to demonstrate the power of our unsupervised learning algorithm along many generations of student and teacher networks. We believe that our approach, tested here in extensive experiments, will bring a valuable contribution to computer vision research.