Machine Vision and Applications

, Volume 25, Issue 1, pp 33–47 | Cite as

Discovering joint audio–visual codewords for video event detection

  • I-Hong Jhuo
  • Guangnan Ye
  • Shenghua Gao
  • Dong Liu
  • Yu-Gang Jiang
  • D. T. Lee
  • Shih-Fu Chang
Special Issue Paper

Abstract

Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bi-modal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bi-modal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.

Keywords

Bi-modal words Multimodal fusion  Multiple kernel learning Event detection 

Notes

Acknowledgments

This work is supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20071. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Bao, L., et al.: Informedia @ TRECVID 2011. In: TRECVID Workshop (2011)Google Scholar
  5. 5.
    Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)CrossRefGoogle Scholar
  6. 6.
    Boureau, Y.-L., Ponce, J., Lecun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning (2010)Google Scholar
  7. 7.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004)Google Scholar
  8. 8.
    Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia (2007)Google Scholar
  9. 9.
    Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM Conference on Knowledge Discovery and Data Mining (2001)Google Scholar
  10. 10.
    Gehler, P., Nowozin, S.: On feature combination for multiclass object detection. In: IEEE International Conference on Computer Vision (2009)Google Scholar
  11. 11.
    Jhuo, I.H., Lee, D.-T.: Boosting-based Multiple Kernel Learning forImage Re-ranking. In: ACM International Conference on Multimedia (2010)Google Scholar
  12. 12.
    Jiang, W., Cotton, C., Chang, S.-F., Ellis, D., Loui, A.: Short-term audio-visual atoms for generic video concept classification. In: ACM International Conference on Multimedia (2009)Google Scholar
  13. 13.
    Jiang, W., Loui, A.: Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. In: ACM International Conference on Multimedia (2011)Google Scholar
  14. 14.
    Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., Loui, A.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ACM International Conference on Multimedia Retrieval (2011)Google Scholar
  15. 15.
    Jiang, Y.-G., et al.: Columbia-ucf trecvid2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)Google Scholar
  16. 16.
    Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., Shah, M.: High-level event recognition in unconstrained videos. In: International Journal of Multimedia Information Retrieval, Vol. 2(2), pp. 73–101 (2012)Google Scholar
  17. 17.
    Kembhavi, A., Siddiquie, B., Miezianko, R., McCloskey, S., Davis, L.S.: Incremental multiple kernel learning for object recognition. In: IEEE International Conference on Computer Vision (2009)Google Scholar
  18. 18.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  19. 19.
    Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)CrossRefGoogle Scholar
  20. 20.
    Laptev, I., Marszlek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. IEEE Conf. Comput. Vision Pattern Recognit. 60(1), 63–86 (2008)Google Scholar
  21. 21.
    Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  22. 22.
    Lutkepohl, H.: Handbook of Matrices. Wiley, Chichester (1997)Google Scholar
  23. 23.
    Manning, C., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  24. 24.
    Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vision 60(1), 63–86 (2004)CrossRefGoogle Scholar
  25. 25.
    Natarajan, P., et al.: BBN VISER TRECVID 2011 multimedia event detection system. In: NIST TRECVID Workshop (2011)Google Scholar
  26. 26.
    Pan, S., Nu, X., Sun, J.T., Yang, Q., Chen, Z.: Co-clustering documents and words using bipartite spectral graph partitioning. In: International World Wide Web Conference (2010) Google Scholar
  27. 27.
    Pols, L.: Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Free University, Amsterdam (1966)Google Scholar
  28. 28.
    Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in visual and audio-visual speech processing (2004)Google Scholar
  29. 29.
    Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2512 (2009)MathSciNetGoogle Scholar
  30. 30.
    Vedaldi, A., Gulshan, V., Varma, M., Zisserman. A.: Multiple kernels for object detection. In: IEEE International Conference on Computer Vision (2009)Google Scholar
  31. 31.
    Wang, J.-C., Yang, Y.-H., Jhuo, I.-H., Lin, Y.-Y., Wang, H.-M.: The acousticvisual emotion Guassians model for automatic generation of music video. In: ACM International Conference on Multimedia (2012)Google Scholar
  32. 32.
    Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  33. 33.
    Ye, G., Jhuo, I.-H., Liu, D., Jiang, Y.G., Lee, D.-T., Chang, S.-F.: Joint audio-visual bi-modal codewords for video event detection. In: ACM International Conference on Multimedia Retrieval (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • I-Hong Jhuo
    • 1
  • Guangnan Ye
    • 3
  • Shenghua Gao
    • 4
  • Dong Liu
    • 3
  • Yu-Gang Jiang
    • 5
  • D. T. Lee
    • 1
    • 2
  • Shih-Fu Chang
    • 3
  1. 1.Department of Computer Science and Information EngineeringNational Taiwan UniversityTaipeiTaiwan
  2. 2.Department of Computer Science and EngineeringNational Chung Hsing UniversityTaichungTaiwan
  3. 3.Department of Electrical EngineeringColumbia UniversityNew YorkUSA
  4. 4.Advanced Digital Sciences CenterSingaporeSingapore
  5. 5.School of Computer ScienceFudan UniversityShanghaiChina

Personalised recommendations