Weakly Supervised Learning of Object Segmentations from Web-Scale Video

  • Glenn Hartmann
  • Matthias Grundmann
  • Judy Hoffman
  • David Tsai
  • Vivek Kwatra
  • Omid Madani
  • Sudheendra Vijayanarasimhan
  • Irfan Essa
  • James Rehg
  • Rahul Sukthankar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7583)


We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.


Object Segmentation Visual Concept Video Segmentation Multiple Instance Learn Video Stabilization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ramanan, D., Forsyth, D., Barnard, K.: Building models of animals from video. PAMI 28 (2006)Google Scholar
  2. 2.
    Ommer, B., Mader, T., Buhmann, J.: Seeing the objects behind the dots: Recognition in videos from a moving camera. IJCV 83 (2009)Google Scholar
  3. 3.
    Ali, K., Hasler, D., Fleuret, F.: FlowBoost—Appearance learning from sparsely annotated video. In: CVPR (2011)Google Scholar
  4. 4.
    Leistner, C., Godec, M., Schulter, S., Saffari, A., Werlberger, M., Bischof, H.: Improving classifiers with unlabeled weakly-related videos. In: CVPR (2011)Google Scholar
  5. 5.
    Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR (2012)Google Scholar
  6. 6.
    Kalal, Z., Matas, J., Mikolajczyk, K.: P-N Learning: Bootstrapping binary classifiers by structural constraints. In: CVPR (2010)Google Scholar
  7. 7.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  8. 8.
    Niebles, J.C., Han, B., Ferencz, A., Fei-Fei, L.: Extracting Moving People from Internet Videos. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 527–540. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  9. 9.
    Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)Google Scholar
  10. 10.
    Xiao, J., Shah, M.: Motion layer extraction in the presence of occlusion using graph cuts. PAMI 27, 1644–1659 (2005)CrossRefGoogle Scholar
  11. 11.
    Brox, T., Malik, J.: Object Segmentation by Long Term Analysis of Point Trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR (2011)Google Scholar
  13. 13.
    Grundmann, M., Kwatra, V., Essa, I.: Auto-directed video stabilization with robust L1 optimal camera paths. In: CVPR (2011)Google Scholar
  14. 14.
    Zha, Z.J., Hua, X.S., Mei, T., Wang, J., Qi, G.J., Wang, Z.: Joint multi-label multi-instance learning for image classification. In: CVPR (2008)Google Scholar
  15. 15.
    Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS (2005)Google Scholar
  16. 16.
    Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. PAMI 28, 1931–1947 (2006)CrossRefGoogle Scholar
  17. 17.
    Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR (2010)Google Scholar
  18. 18.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)Google Scholar
  19. 19.
    Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. PAMI 32, 2178–2190 (2010)CrossRefGoogle Scholar
  20. 20.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. JMLR 9, 1871–1874 (2008)zbMATHGoogle Scholar
  21. 21.
    Duchi, J., Singer, Y.: Boosting with structural sparsity. In: ICML (2009)Google Scholar
  22. 22.
    Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23, 1222–1239 (2001)CrossRefGoogle Scholar
  23. 23.
    Ojala, T., et al.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: ICPR (1994)Google Scholar
  24. 24.
    Wang, X., Han, T.: An HOG-LBP human detector with partial occlusion handling. In: ICCV (2009)Google Scholar
  25. 25.
    Chaudhry, R., et al.: Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems. In: CVPR (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Glenn Hartmann
    • 1
  • Matthias Grundmann
    • 2
  • Judy Hoffman
    • 3
  • David Tsai
    • 2
  • Vivek Kwatra
    • 1
  • Omid Madani
    • 1
  • Sudheendra Vijayanarasimhan
    • 1
  • Irfan Essa
    • 2
  • James Rehg
    • 2
  • Rahul Sukthankar
    • 1
  1. 1.Google ResearchUSA
  2. 2.Georgia Institute of TechnologyUSA
  3. 3.University of CaliforniaBerkeleyUSA

Personalised recommendations