Advertisement

Weakly Supervised Learning of Heterogeneous Concepts in Videos

  • Sohil ShahEmail author
  • Kuldeep Kulkarni
  • Arijit Biswas
  • Ankit Gandhi
  • Om Deshmukh
  • Larry S. Davis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9910)

Abstract

Typical textual descriptions that accompany online videos are ‘weak’: i.e., they mention the important heterogeneous concepts in the video but not their corresponding spatio-temporal locations. However, certain location constraints on these concepts can be inferred from the description. The goal of this paper is to present a generalization of the Indian Buffet Process (IBP) that can (a) systematically incorporate heterogeneous concepts in an integrated framework, and (b) enforce location constraints, for efficient classification and localization of the concepts in the videos. Finally, we develop posterior inference for the proposed formulation using mean-field variational approximation. Comparative evaluations on the Casablanca and the A2D datasets show that the proposed approach significantly outperforms other state-of-the-art techniques: 24 % relative improvement for pairwise concept classification in the Casablanca dataset and 9 % relative improvement for localization in the A2D dataset as compared to the most competitive baseline.

Notes

Acknowledgments

The work of L.S. Davis was supported by the US Office of Naval Research under grant N000141612713.

Supplementary material

Supplementary material 1 (mp4 673 KB)

Supplementary material 2 (mp4 2111 KB)

Supplementary material 3 (mp4 2934 KB)

Supplementary material 4 (mp4 2239 KB)

Supplementary material 5 (mp4 1194 KB)

419981_1_En_17_MOESM6_ESM.pdf (1.4 mb)
Supplementary material 6 (pdf 1430 KB)

References

  1. 1.
    Shi, Z., Yang, Y., Hospedales, T.M., Xiang, T.: Weakly supervised learning of objects, attributes and their associations. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 472–487. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10605-2_31 Google Scholar
  2. 2.
    Leung, T., Song, Y., Zhang, J.: Handling label noise in video classification via multiple instance learning. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2056–2063. IEEE (2011)Google Scholar
  3. 3.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free? weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  4. 4.
    Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_41 Google Scholar
  5. 5.
    Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2280–2287. IEEE (2013)Google Scholar
  6. 6.
    Ghahramani, Z., Griffiths, T.L.: Infinite latent feature models and the indian buffet process. Adv. Neural Inf. Proces. Syst. 18, 475–482 (2005)Google Scholar
  7. 7.
    Griffiths, T.L., Ghahramani, Z.: The indian buffet process: an introduction and review. J. Mach. Learn. Res. 12, 1185–1224 (2011)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Ozdemir, B., Davis, L.S.: A probabilistic framework for multimodal retrieval using integrative indian buffet process. In: Advances in Neural Information Processing Systems, pp. 2384–2392 (2014)Google Scholar
  9. 9.
    Yildirim, I., Jacobs, R.A.: A rational analysis of the acquisition of multisensory representations. Cogn. Sci. 36(2), 305–332 (2012)CrossRefGoogle Scholar
  10. 10.
    Gershman, S.J., Blei, D.M.: A tutorial on bayesian nonparametric models. J. Math. Psychol. 56(1), 1–12 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2264–2273 (2015)Google Scholar
  12. 12.
    Zhang, C., Platt, J.C., Viola, P.A.: Multiple instance boosting for object detection. Adv. Neural Inf. Process. Syst. 7, 1417–1424 (2005)Google Scholar
  13. 13.
    Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Proceedings of Advances in Neural Information Processing Systems, pp. 561–568 (2002)Google Scholar
  14. 14.
    Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. Pattern Anal. Mach. Intell. IEEE Trans. 32(2), 288–303 (2010)CrossRefGoogle Scholar
  15. 15.
    Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 983–990. IEEE (2009)Google Scholar
  16. 16.
    Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/Script: alignment and parsing of video and text transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 158–171. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88693-8_12 CrossRefGoogle Scholar
  17. 17.
    Bojanowski, P., Lagugie, R., Grave, E., Bach, F., Laptev, I., Ponce, J., Schmid, C.: Weakly-supervised alignment of video with text. In: ICCV, IEEE (2015)Google Scholar
  18. 18.
    Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3282–3289. IEEE (2012)Google Scholar
  19. 19.
    Bilen, H., Namboodiri, V.P., Van Gool, L.J.: Object and action classification with latent window parameters. Int. J. Comput. Vis. 106(3), 237–251 (2014)CrossRefGoogle Scholar
  20. 20.
    Tapaswi, M., Bauml, M., Stiefelhagen, R.: Book2movie: Aligning video scenes with book chapters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1827–1835 (2015)Google Scholar
  21. 21.
    Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)Google Scholar
  22. 22.
    Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with “their” names using coreference resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 95–110. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10590-1_7 Google Scholar
  23. 23.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  24. 24.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of The 32nd International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  25. 25.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)Google Scholar
  26. 26.
    Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604 (2015)Google Scholar
  27. 27.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  28. 28.
    Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall, J., Gehler, P., Leibe, B. (eds.) GCPR 2015. LNCS, vol. 9358, pp. 209–221. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24947-6_17 CrossRefGoogle Scholar
  29. 29.
    Cho, K., Courville, A., Bengio, Y.: Describing multimedia content using attention-based encoder-decoder networks. Multimedia IEEE Trans. 17(11), 1875–1886 (2015)CrossRefGoogle Scholar
  30. 30.
    Doshi, F., Miller, K., Gael, J.V., Teh, Y.W.: Variational inference for the indian buffet process. In: International Conference on Artificial Intelligence and Statistics, pp. 137–144 (2009)Google Scholar
  31. 31.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)zbMATHGoogle Scholar
  32. 32.
    Zellner, A.: Optimal information processing and bayes’s theorem. Am. Stat. 42(4), 278–280 (1988)MathSciNetGoogle Scholar
  33. 33.
    Zhu, J., Chen, N., Xing, E.P.: Bayesian inference with posterior regularization and applications to infinite latent svms. J. Mach. Learn. Res. 15(1), 1799–1847 (2014)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Ganchev, K., Graça, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Zhu, X., Ramanan, D.: Face detection, pose estimation and landmark estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  36. 36.
    Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is buffy-automatic naming of characters in tv video. In: BMVC. vol. 2. 6 (2006)Google Scholar
  37. 37.
    Parkhi, O.M., Simonyan, K., Vedaldi, A., Zisserman, A.: A compact and discriminative face track descriptor. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1693–1700. IEEE (2014)Google Scholar
  38. 38.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. IEEE (2013)Google Scholar
  39. 39.
    Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-temporal object detection proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 737–752. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10578-9_48 Google Scholar
  40. 40.
    Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: British Machine Vision Conference (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Sohil Shah
    • 1
    Email author
  • Kuldeep Kulkarni
    • 2
  • Arijit Biswas
    • 3
  • Ankit Gandhi
    • 4
  • Om Deshmukh
    • 4
  • Larry S. Davis
    • 1
  1. 1.University of MarylandCollege ParkUSA
  2. 2.Arizona State UniversityTempeUSA
  3. 3.Amazon Development Center IndiaBangaloreIndia
  4. 4.Xerox Research Centre IndiaBangaloreIndia

Personalised recommendations