Advertisement

Audiovisual Annotation Procedure for Multi-view Field Recordings

  • Patrice GuyotEmail author
  • Thierry Malon
  • Geoffrey Roman-Jimenez
  • Sylvie Chambon
  • Vincent Charvillat
  • Alain Crouzil
  • André Péninou
  • Julien Pinquier
  • Florence Sèdes
  • Christine Sénac
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11295)

Abstract

Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.

Keywords

Audiovisual Annotation Multi-view Multi-modal Field recording Multimedia Ground truth 

References

  1. 1.
    Aroyo, L., Welty, C.: Truth is a lie: crowd truth and the seven myths of human annotation. AI Mag. 36(1), 15–24 (2015)CrossRefGoogle Scholar
  2. 2.
    Auer, E., et al.: Automatic annotation of media field recordings. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pp. 31–34. University de Lisbon (2010)Google Scholar
  3. 3.
    Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)CrossRefGoogle Scholar
  4. 4.
    Chion, M.: Audio-Vision: Sound on Screen. Columbia University Press, New York (1994)Google Scholar
  5. 5.
    Ellis, D.P., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: ISMIR, Paris, France (2002)Google Scholar
  6. 6.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)Google Scholar
  7. 7.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017)Google Scholar
  8. 8.
    Iedema, R.: Multimodality, resemiotization: extending the analysis of discourse as multi-semiotic practice. Vis. Commun. 2(1), 29–57 (2003)CrossRefGoogle Scholar
  9. 9.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  10. 10.
    Lefter, I., Rothkrantz, L.J.M., Burghouts, G., Yang, Z., Wiggers, P.: Addressing multimodality in overt aggression detection. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 25–32. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23538-2_4CrossRefGoogle Scholar
  11. 11.
    Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)Google Scholar
  12. 12.
    Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017)CrossRefGoogle Scholar
  13. 13.
    Malon, T., et al.: Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views (regular paper). In: ACM Multimedia Systems Conference (MMSys), Amsterdam, 12 June 2018–15 June 2018. ACM Multimedia Systems, June 2018Google Scholar
  14. 14.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)CrossRefGoogle Scholar
  15. 15.
    Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)Google Scholar
  16. 16.
    Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE (2013)Google Scholar
  17. 17.
    Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)CrossRefGoogle Scholar
  18. 18.
    Pinquier, J., et al.: Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3192–3195. IEEE (2012)Google Scholar
  19. 19.
    Rohlfing, K., et al.: Comparison of multimodal annotation tools-workshop report. Gesprächforschung-Online-Zeitschrift zur Verbalen Interaktion 7, 99–123 (2006)Google Scholar
  20. 20.
    Russakovsky, O., et al.: ImageNet Large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)CrossRefGoogle Scholar
  22. 22.
    Wang, X.: Intelligent multi-camera video surveillance: a review. Pattern Recogn. Lett. 34(1), 3–19 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Patrice Guyot
    • 1
    Email author
  • Thierry Malon
    • 1
  • Geoffrey Roman-Jimenez
    • 1
  • Sylvie Chambon
    • 1
  • Vincent Charvillat
    • 1
  • Alain Crouzil
    • 1
  • André Péninou
    • 1
  • Julien Pinquier
    • 1
  • Florence Sèdes
    • 1
  • Christine Sénac
    • 1
  1. 1.IRIT, Université de Toulouse, CNRSToulouseFrance

Personalised recommendations