Skip to main content

Audiovisual Annotation Procedure for Multi-view Field Recordings

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2019)

Abstract

Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aroyo, L., Welty, C.: Truth is a lie: crowd truth and the seven myths of human annotation. AI Mag. 36(1), 15–24 (2015)

    Article  Google Scholar 

  2. Auer, E., et al.: Automatic annotation of media field recordings. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pp. 31–34. University de Lisbon (2010)

    Google Scholar 

  3. Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)

    Article  Google Scholar 

  4. Chion, M.: Audio-Vision: Sound on Screen. Columbia University Press, New York (1994)

    Google Scholar 

  5. Ellis, D.P., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: ISMIR, Paris, France (2002)

    Google Scholar 

  6. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)

    Google Scholar 

  7. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017)

    Google Scholar 

  8. Iedema, R.: Multimodality, resemiotization: extending the analysis of discourse as multi-semiotic practice. Vis. Commun. 2(1), 29–57 (2003)

    Article  Google Scholar 

  9. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  10. Lefter, I., Rothkrantz, L.J.M., Burghouts, G., Yang, Z., Wiggers, P.: Addressing multimodality in overt aggression detection. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 25–32. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_4

    Chapter  Google Scholar 

  11. Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)

    Google Scholar 

  12. Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017)

    Article  Google Scholar 

  13. Malon, T., et al.: Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views (regular paper). In: ACM Multimedia Systems Conference (MMSys), Amsterdam, 12 June 2018–15 June 2018. ACM Multimedia Systems, June 2018

    Google Scholar 

  14. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)

    Article  Google Scholar 

  15. Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)

    Google Scholar 

  16. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE (2013)

    Google Scholar 

  17. Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)

    Article  Google Scholar 

  18. Pinquier, J., et al.: Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3192–3195. IEEE (2012)

    Google Scholar 

  19. Rohlfing, K., et al.: Comparison of multimodal annotation tools-workshop report. Gesprächforschung-Online-Zeitschrift zur Verbalen Interaktion 7, 99–123 (2006)

    Google Scholar 

  20. Russakovsky, O., et al.: ImageNet Large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  21. Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)

    Article  Google Scholar 

  22. Wang, X.: Intelligent multi-camera video surveillance: a review. Pattern Recogn. Lett. 34(1), 3–19 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrice Guyot .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guyot, P. et al. (2019). Audiovisual Annotation Procedure for Multi-view Field Recordings. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05710-7_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05709-1

  • Online ISBN: 978-3-030-05710-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics