Abstract
Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aroyo, L., Welty, C.: Truth is a lie: crowd truth and the seven myths of human annotation. AI Mag. 36(1), 15–24 (2015)
Auer, E., et al.: Automatic annotation of media field recordings. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), pp. 31–34. University de Lisbon (2010)
Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)
Chion, M.: Audio-Vision: Sound on Screen. Columbia University Press, New York (1994)
Ellis, D.P., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: ISMIR, Paris, France (2002)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017)
Iedema, R.: Multimodality, resemiotization: extending the analysis of discourse as multi-semiotic practice. Vis. Commun. 2(1), 29–57 (2003)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Lefter, I., Rothkrantz, L.J.M., Burghouts, G., Yang, Z., Wiggers, P.: Addressing multimodality in overt aggression detection. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 25–32. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_4
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)
Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2017)
Malon, T., et al.: Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views (regular paper). In: ACM Multimedia Systems Conference (MMSys), Amsterdam, 12 June 2018–15 June 2018. ACM Multimedia Systems, June 2018
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746 (1976)
Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE (2013)
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 521–535 (2014)
Pinquier, J., et al.: Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3192–3195. IEEE (2012)
Rohlfing, K., et al.: Comparison of multimodal annotation tools-workshop report. Gesprächforschung-Online-Zeitschrift zur Verbalen Interaktion 7, 99–123 (2006)
Russakovsky, O., et al.: ImageNet Large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)
Wang, X.: Intelligent multi-camera video surveillance: a review. Pattern Recogn. Lett. 34(1), 3–19 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Guyot, P. et al. (2019). Audiovisual Annotation Procedure for Multi-view Field Recordings. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11295. Springer, Cham. https://doi.org/10.1007/978-3-030-05710-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-05710-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05709-1
Online ISBN: 978-3-030-05710-7
eBook Packages: Computer ScienceComputer Science (R0)