Advertisement

Audio-Visual Event Localization in Unconstrained Videos

  • Yapeng Tian
  • Jing Shi
  • Bochen Li
  • Zhiyao Duan
  • Chenliang Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)

Abstract

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

Keywords

Audio-visual event Temporal localization Attention Fusion 

Notes

Acknowledgement

This work was supported by NSF BIGDATA 1741472. We gratefully acknowledge the gift donations of Markable, Inc., Tencent and the support of NVIDIA Corporation with the donation of the GPUs used for this research. This article solely reflects the opinions and conclusions of its authors and neither NSF, Markable, Tencent nor NVIDIA.

Supplementary material

474176_1_En_16_MOESM1_ESM.pdf (523 kb)
Supplementary material 1 (pdf 522 KB)

References

  1. 1.
    Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: Proceedings of ICML, pp. 1247–1255. PMLR (2013)Google Scholar
  2. 2.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV. IEEE (2017)Google Scholar
  3. 3.
    Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Springer, Heidelberg (2018)Google Scholar
  4. 4.
    Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: Proceedings of ICLR Workshop (2017)Google Scholar
  5. 5.
    Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: sentence-level lipreading. CoRR abs/1611.01599 (2016)Google Scholar
  6. 6.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of NIPS. Curran Associates, Inc. (2016)Google Scholar
  7. 7.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (2015)Google Scholar
  8. 8.
    Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI (2018)Google Scholar
  9. 9.
    Bulkin, D.A., Groh, J.M.: Seeing sounds: visual and auditory interactions in the brain. Curr. Opin. Neurobiol. 16(4), 415–419 (2006)CrossRefGoogle Scholar
  10. 10.
    Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: Proceedings of IJCNN. IEEE (2015)Google Scholar
  11. 11.
    Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of ACMMM Workshop. ACM (2017)Google Scholar
  12. 12.
    Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR. IEEE (2017)Google Scholar
  13. 13.
    Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)
  14. 14.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_47CrossRefGoogle Scholar
  15. 15.
    Fisher III., J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: Proceedings of NIPS. Curran Associates, Inc. (2001)Google Scholar
  16. 16.
    Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Springer, Heidelberg (2018)Google Scholar
  17. 17.
    Gaver, W.W.: What in the world do we hear?: An ecological approach to auditory event perception. Ecol. Psychol. 5(1), 1–29 (1993)CrossRefGoogle Scholar
  18. 18.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of ICASSP. IEEE (2017)Google Scholar
  19. 19.
    Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. JMLR 12(Jul), 2211–2268 (2011)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Gurban, M., Thiran, J.P., Drugman, T., Dutoit, T.: Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of ICMI. ACM (2008)Google Scholar
  21. 21.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of CVPR. IEEE (2006)Google Scholar
  22. 22.
    Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Proceedings of NIPS. Curran Associates, Inc. (2016)Google Scholar
  23. 23.
    Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1 (2013)CrossRefGoogle Scholar
  24. 24.
    Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: Proceedings of NIPS. Curran Associates, Inc. (2000)Google Scholar
  25. 25.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. In: Proceedings of ICASSP. IEEE (2017)Google Scholar
  26. 26.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  27. 27.
    Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  28. 28.
    Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. In: Proceedings of AAAI. AAAI Press (2018)Google Scholar
  29. 29.
    Kim, J.H., et al.: Multimodal residual learning for visual QA. In: Proceedings of NIPS. Curran Associates, Inc. (2016)Google Scholar
  30. 30.
    Lea, C., et al.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of CVPR. IEEE (2017)Google Scholar
  31. 31.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  32. 32.
    Li, B., Xu, C., Duan, Z.: Audio-visual source association for string ensembles through multi-modal vibrato analysis. In: Proceedings of SMC (2017)Google Scholar
  33. 33.
    Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
  34. 34.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of CVPR. IEEE (2017)Google Scholar
  35. 35.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Proceedings of NIPS. Curran Associates, Inc. (1998)Google Scholar
  36. 36.
    Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: Proceedings of EUSIPCO. IEEE (2016)Google Scholar
  37. 37.
    Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: Proceedings of ICASSP. IEEE (2015)Google Scholar
  38. 38.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of ICML. PMLR (2011)Google Scholar
  39. 39.
    Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of ICCV. IEEE (2013)Google Scholar
  40. 40.
    Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Springer, Heidelberg (2018)Google Scholar
  41. 41.
    Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  42. 42.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_48CrossRefGoogle Scholar
  43. 43.
    Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: Proceedings of ICASSP. IEEE (2016)Google Scholar
  44. 44.
    Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of EMNLP. Association for Computational Linguistics (2015)Google Scholar
  45. 45.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  46. 46.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE TSP 45(11), 2673–2681 (1997)Google Scholar
  47. 47.
    Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of CVPR. IEEE (2018)Google Scholar
  48. 48.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  49. 49.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)Google Scholar
  50. 50.
    Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: Proceedings of ICML Workshop. PMLR (2012)Google Scholar
  51. 51.
    Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of NIPS. Curran Associates, Inc. (2012)Google Scholar
  52. 52.
    Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i Nieto, X.: Cross-modal embeddings for video and audio retrieval. arXiv preprint arXiv:1801.02200 (2018)
  53. 53.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  54. 54.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV. IEEE (2015)Google Scholar
  55. 55.
    Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of CVPR. IEEE (2015)Google Scholar
  56. 56.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR. IEEE (2016)Google Scholar
  57. 57.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICLR (2015)Google Scholar
  58. 58.
    Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E.A., Luo, J.: Deep multimodal representation learning from temporal data. In: Proceedings of CVPR. IEEE (2017)Google Scholar
  59. 59.
    Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Springer, Heidelberg (2018)Google Scholar
  60. 60.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of ICCV. IEEE (2017)Google Scholar
  61. 61.
    Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR. IEEE (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of RochesterRochesterUSA

Personalised recommendations