Audiovisual Annotation Procedure for Multi-view Field Recordings

Guyot, Patrice; Malon, Thierry; Roman-Jimenez, Geoffrey; Chambon, Sylvie; Charvillat, Vincent; Crouzil, Alain; Péninou, André; Pinquier, Julien; Sèdes, Florence; Sénac, Christine

doi:10.1007/978-3-030-05710-7_33

Patrice Guyot¹⁸,
Thierry Malon¹⁸,
Geoffrey Roman-Jimenez¹⁸,
Sylvie Chambon¹⁸,
Vincent Charvillat¹⁸,
Alain Crouzil¹⁸,
André Péninou¹⁸,
Julien Pinquier¹⁸,
Florence Sèdes¹⁸ &
…
Christine Sénac¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11295))

Included in the following conference series:

International Conference on Multimedia Modeling

2550 Accesses
1 Citations

Abstract

Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aroyo, L., Welty, C.: Truth is a lie: crowd truth and the seven myths of human annotation. AI Mag. 36(1), 15–24 (2015)
Article Google Scholar
Auer, E., et al.: Automatic annotation of media field recordings. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pp. 31–34. University de Lisbon (2010)
Google Scholar
Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)
Article Google Scholar
Chion, M.: Audio-Vision: Sound on Screen. Columbia University Press, New York (1994)
Google Scholar
Ellis, D.P., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: ISMIR, Paris, France (2002)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017)
Google Scholar
Iedema, R.: Multimodality, resemiotization: extending the analysis of discourse as multi-semiotic practice. Vis. Commun. 2(1), 29–57 (2003)
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lefter, I., Rothkrantz, L.J.M., Burghouts, G., Yang, Z., Wiggers, P.: Addressing multimodality in overt aggression detection. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 25–32. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_4
Chapter Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)
Google Scholar
Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017)
Article Google Scholar
Malon, T., et al.: Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views (regular paper). In: ACM Multimedia Systems Conference (MMSys), Amsterdam, 12 June 2018–15 June 2018. ACM Multimedia Systems, June 2018
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)
Article Google Scholar
Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)
Google Scholar
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE (2013)
Google Scholar
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)
Article Google Scholar
Pinquier, J., et al.: Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3192–3195. IEEE (2012)
Google Scholar
Rohlfing, K., et al.: Comparison of multimodal annotation tools-workshop report. Gesprächforschung-Online-Zeitschrift zur Verbalen Interaktion 7, 99–123 (2006)
Google Scholar
Russakovsky, O., et al.: ImageNet Large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)
Article Google Scholar
Wang, X.: Intelligent multi-camera video surveillance: a review. Pattern Recogn. Lett. 34(1), 3–19 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IRIT, Université de Toulouse, CNRS, Toulouse, France
Patrice Guyot, Thierry Malon, Geoffrey Roman-Jimenez, Sylvie Chambon, Vincent Charvillat, Alain Crouzil, André Péninou, Julien Pinquier, Florence Sèdes & Christine Sénac

Authors

Patrice Guyot
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Malon
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey Roman-Jimenez
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Chambon
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Charvillat
View author publications
You can also search for this author in PubMed Google Scholar
Alain Crouzil
View author publications
You can also search for this author in PubMed Google Scholar
André Péninou
View author publications
You can also search for this author in PubMed Google Scholar
Julien Pinquier
View author publications
You can also search for this author in PubMed Google Scholar
Florence Sèdes
View author publications
You can also search for this author in PubMed Google Scholar
Christine Sénac
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrice Guyot .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guyot, P. et al. (2019). Audiovisual Annotation Procedure for Multi-view Field Recordings. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-05710-7_33
Published: 08 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05709-1
Online ISBN: 978-3-030-05710-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics