Advertisement

Weakly-Supervised Video Scene Co-parsing

  • Guangyu Zhong
  • Yi-Hsuan Tsai
  • Ming-Hsuan Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10111)

Abstract

In this paper, we propose a scene co-parsing framework to assign pixel-wise semantic labels in weakly-labeled videos, i.e., only video-level category labels are given. To exploit rich semantic information, we first collect all videos that share the same video-level labels and segment them into supervoxels. We then select representative supervoxels for each category via a supervoxel ranking process. This ranking problem is formulated with a submodular objective function and a scene-object classifier is incorporated to distinguish scenes and objects. To assign each supervoxel a semantic label, we match each supervoxel to these selected representatives in the feature domain. Each supervoxel is then associated with a series of category potentials and assigned to a semantic label with the maximum one. The proposed co-parsing framework extends scene parsing from single images to videos and exploits mutual information among a video collection. Experimental results on the Wild-8 and SUNY-24 datasets show that the proposed algorithm performs favorably against the state-of-the-art approaches.

Notes

Acknowledgments

This work is supported in part by the NSF CAREER grant #1149783, NSF IIS grant #1152576, and gifts from Adobe and Nvidia. G. Zhong is sponsored by China Scholarship Council and NSFC grant #61572099.

Supplementary material

416257_1_En_2_MOESM1_ESM.avi (17.5 mb)
Supplementary material 1 (avi 17937 KB)
416257_1_En_2_MOESM2_ESM.pdf (3.1 mb)
Supplementary material 2 (pdf 3193 KB)

References

  1. 1.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013)CrossRefGoogle Scholar
  2. 2.
    Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  3. 3.
    Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing via label transfer. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2368–2382 (2011)CrossRefGoogle Scholar
  4. 4.
    Liu, X., Zhao, Y., Zhu, S.C.: Single-view 3d scene parsing by attributed grammar. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  5. 5.
    Zhang, C., Wang, L., Yang, R.: Semantic segmentation of urban scenes using dense depth maps. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 708–721. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_51 CrossRefGoogle Scholar
  6. 6.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  7. 7.
    Tighe, J., Lazebnik, S.: Superparsing: scalable nonparametric image parsing with superpixels. Int. J. Comput. Vision 101, 329–349 (2013)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)Google Scholar
  9. 9.
    Wang, L., Hua, G., Sukthankar, R., Xue, J., Zheng, N.: Video object discovery and co-segmentation with extremely weak supervision. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 640–655. Springer, Cham (2014). doi: 10.1007/978-3-319-10593-2_42 Google Scholar
  10. 10.
    Zhang, Y., Chen, X., Li, J., Wang, C., Xia, C.: Semantic object segmentation via detection in weakly labeled video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  11. 11.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
  12. 12.
    Xu, C., Xiong, C., Corso, J.J.: Streaming hierarchical video segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 626–639. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33783-3_45 CrossRefGoogle Scholar
  13. 13.
    Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass video segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  14. 14.
    Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., Vijayanarasimhan, S., Essa, I., Rehg, J., Sukthankar, R.: Weakly supervised learning of object segmentations from web-scale video. In: Proceedings of the 12th European Conference on Computer Vision Workshop (2012)Google Scholar
  15. 15.
    Chen, A.Y., Corso, J.J.: Propagating multi-class pixel labels throughout video frames. In: Proceedings of Western New York Image Processing Workshop (2010)Google Scholar
  16. 16.
    Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: Proceedings of IEEE International Conference on Computer Vision (2011)Google Scholar
  17. 17.
    Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of IEEE International Conference on Computer Vision (2013)Google Scholar
  18. 18.
    Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: Proceedings of IEEE International Conference on Computer Vision (2013)Google Scholar
  19. 19.
    Jain, S.D., Grauman, K.: Supervoxel-consistent foreground propagation in video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 656–671. Springer, Cham (2014). doi: 10.1007/978-3-319-10593-2_43 Google Scholar
  20. 20.
    Wen, L., Du, D., Lei, Z., Li, S.Z., Yang, M.H.: Jots: joint online tracking and segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  21. 21.
    Nagaraja, N.S., Schmidt, F., Brox, T.: Video segmentation with just a few strokes. In: Proceedings of IEEE International Conference on Computer Vision (2015)Google Scholar
  22. 22.
    Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  23. 23.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15555-0_21 CrossRefGoogle Scholar
  24. 24.
    Tsai, Y.-H., Zhong, G., Yang, M.-H.: Semantic co-segmentation in videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 760–775. Springer, Cham (2016). doi: 10.1007/978-3-319-46493-0_46 CrossRefGoogle Scholar
  25. 25.
    Rubio, J.C., Serrat, J., López, A.: Video co-segmentation. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7725, pp. 13–24. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-37444-9_2 CrossRefGoogle Scholar
  26. 26.
    Chiu, W.C., Fritz, M.: Multi-class video co-segmentation with a generative multi-video model. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)Google Scholar
  27. 27.
    Fu, H., Xu, D., Zhang, B., Lin, S.: Object-based multiple foreground video co-segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  28. 28.
    Guo, J., Cheong, L.-F., Tan, R.T., Zhou, S.Z.: Consistent foreground co-segmentation. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 241–257. Springer, Cham (2015). doi: 10.1007/978-3-319-16817-3_16 Google Scholar
  29. 29.
    Zhang, D., Javed, O., Shah, M.: Video object co-segmentation by regulated maximum weight cliques. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 551–566. Springer, Cham (2014). doi: 10.1007/978-3-319-10584-0_36 Google Scholar
  30. 30.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th International Conference on Machine Learning (2011)Google Scholar
  31. 31.
    Munoz, D., Bagnell, J.A., Hebert, M.: Stacked hierarchical labeling. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 57–70. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15567-3_5 CrossRefGoogle Scholar
  32. 32.
    Yang, J., Price, B., Cohen, S., Yang, M.H.: Context driven scene parsing with attention to rare classes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  33. 33.
    Xu, J., Schwing, A.G., Urtasun, R.: Tell me what you see and I will show you where it is. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  34. 34.
    Chen, X., Jain, A., Davis, L.S.: Object co-labeling in multiple images. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (2014)Google Scholar
  35. 35.
    Galvão, R.D.: Uncapacitated facility location problems: contributions. Pesquisa Operacional 24, 7–38 (2004)CrossRefGoogle Scholar
  36. 36.
    Lazic, N., Givoni, I., Frey, B., Aarabi, P.: Floss: Facility location for subspace segmentation. In: Proceedings of IEEE International Conference on Computer Vision (2009)Google Scholar
  37. 37.
    Zhu, F., Jiang, Z., Shao, L.: Submodular object recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  38. 38.
    Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007)Google Scholar
  39. 39.
    Yang, F., Jiang, Z., Davis, L.S.: Submodular reranking with multiple feature modalities for image retrieval. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 19–34. Springer, Cham (2015). doi: 10.1007/978-3-319-16865-4_2 Google Scholar
  40. 40.
    Siva, P., Russell, C., Xiang, T.: In defence of negative mining for annotating weakly labelled data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 594–608. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33712-3_43 CrossRefGoogle Scholar
  41. 41.
    Vezhnevets, A., Ferrari, V., Buhmann, J.M.: Weakly supervised semantic segmentation with a multi-image model. In: Proceedings of IEEE International Conference on Computer Vision (2011)Google Scholar
  42. 42.
    Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Yang, M.H.: Sky is not the limit: Semantic-aware sky replacement. ACM Trans. Graph. (Proc. ACM SIGGRAPH) (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Guangyu Zhong
    • 1
    • 2
  • Yi-Hsuan Tsai
    • 1
  • Ming-Hsuan Yang
    • 1
  1. 1.UC MercedMercedUSA
  2. 2.Dalian University of TechnologyDalianChina

Personalised recommendations