Advertisement

Weakly-Supervised Semantic Segmentation Using Motion Cues

  • Pavel Tokmakov
  • Karteek Alahari
  • Cordelia Schmid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9908)

Abstract

Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations, they need additional constraints, such as the size of an object, to obtain reasonable performance. To address this issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. When trained on weakly-annotated videos, our method outperforms the state-of-the-art approach [1] on the PASCAL VOC 2012 image segmentation benchmark. We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images. Finally, M-CNN substantially outperforms recent approaches in a related task of video co-localization on the YouTube-Objects dataset.

Keywords

Motion Segment Convolutional Neural Network Soft Constraint Video Shot Motion Segmentation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This work was supported in part by the ERC advanced grant ALLEGRO, the MSR-Inria joint project, a Google research award and a Facebook gift. We gratefully acknowledge the support of NVIDIA with the donation of GPUs used for this research.

References

  1. 1.
    Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. In: ICCV (2015)Google Scholar
  2. 2.
    Vezhnevets, A., Ferrari, V., Buhmann, J.: Weakly supervised structured output learning for semantic segmentation. In: CVPR (2012)Google Scholar
  3. 3.
    Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015)Google Scholar
  4. 4.
    Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., Vijayanarasimhan, S., Essa, I., Rehg, J., Sukthankar, R.: Weakly supervised learning of object segmentations from web-scale video. In: ECCV (2012)Google Scholar
  5. 5.
    Monroy, A., Ommer, B.: Beyond bounding-boxes: learning object shape by model-driven grouping. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 580–593. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33712-3_42 Google Scholar
  6. 6.
    Wu, J., Zhao, Y., Zhu, J., Luo, S., Tu, Z.: MILCut: a sweeping line multiple instance learning paradigm for interactive image segmentation. In: CVPR (2014)Google Scholar
  7. 7.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15555-0_21 CrossRefGoogle Scholar
  8. 8.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  9. 9.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. PAMI 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  10. 10.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  11. 11.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  12. 12.
    Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class multiple instance learning. In: ICLR (2015)Google Scholar
  13. 13.
    Pathak, D., Krähenbühl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV (2015)Google Scholar
  14. 14.
    Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)Google Scholar
  15. 15.
    Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR (2012)Google Scholar
  16. 16.
  17. 17.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
  18. 18.
    Kwak, S., Cho, M., Laptev, I., Ponce, J., Schmid, C.: Unsupervised object discovery and tracking in video collections. In: ICCV (2015)Google Scholar
  19. 19.
    Carreira, J., Sminchisescu, C.: CPMC: automatic object segmentation using constrained parametric min-cuts. PAMI 34(7), 1312–1328 (2012)CrossRefGoogle Scholar
  20. 20.
    Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic segmentation with second-order pooling. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 430–443. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33786-4_32 CrossRefGoogle Scholar
  21. 21.
    Lin, G., Shen, C., van dan Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)Google Scholar
  22. 22.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  24. 24.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Multi-fold MIL training for weakly supervised object localization. In: CVPR (2014)Google Scholar
  25. 25.
    Russakovsky, O., Lin, Y., Yu, K., Fei-Fei, L.: Object-centric spatial pooling for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 1–15. Springer, Heidelberg (2012)Google Scholar
  26. 26.
    Chen, X., Shrivastava, A., Gupta, A.: NEIL: Extracting visual knowledge from web data. In: CVPR (2013)Google Scholar
  27. 27.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: Webly-supervised visual concept learning. In: CVPR (2014)Google Scholar
  28. 28.
    Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: ICCV (2015)Google Scholar
  29. 29.
    Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: A weakly-supervised approach for object detection. In: ICCV (2015)Google Scholar
  30. 30.
    Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching - incorporating a global constraint into MRFs. In: CVPR (2006)Google Scholar
  31. 31.
    Joulin, A., Tang, K., Fei-Fei, L.: Efficient image and video co-localization with Frank-wolfe algorithm. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 253–268. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_17 Google Scholar
  32. 32.
    Tang, K.D., Sukthankar, R., Yagnik, J., Li, F.: Discriminative segment annotation in weakly labeled video. In: CVPR (2013)Google Scholar
  33. 33.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  34. 34.
    Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004)CrossRefGoogle Scholar
  35. 35.
    Boykov, Y., Jolly., M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: ICCV (2001)Google Scholar
  36. 36.
    Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23(11), 1222–1239 (2001)CrossRefGoogle Scholar
  37. 37.
  38. 38.
    Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)Google Scholar
  39. 39.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM Multimedia (2014)Google Scholar
  40. 40.
  41. 41.
    Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: CVPR (2015)Google Scholar
  42. 42.
    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. PAMI 34(11), 2274–2282 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Pavel Tokmakov
    • 1
  • Karteek Alahari
    • 1
  • Cordelia Schmid
    • 1
  1. 1.InriaGrenobleFrance

Personalised recommendations