Abstract
Modern computing systems are usually equipped with various input devices such as microphones or cameras, and hence the user of such a system can easily be identified. User identification is important in many human computer interaction (HCI) scenarios, such as speech recognition, activity recognition, transcription of meeting room data or affective computing. Here personalized models may significantly improve the performance of the overall recognition system. This paper deals with audio-visual user identification. The main processing steps are segmentation of the relevant parts from video and audio streams, extraction of meaningful features and construction of the overall classifier and fusion architectures. The proposed system has been evaluated on the MOBIO dataset, a benchmark database consisting of real-world recordings collected from mobile devices, e.g. cell-phones. Recognition rates of up to 92 % could be achieved for the proposed audio-visual classifier system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Trans. Multimed. 4(1), 23–37 (2002)
Duc, B., Fischer, S., Bigun, J.: Face authentication with Gabor information on deformable graphs. IEEE Trans. Image Process. 8(4), 504–516 (1999)
Faraj, M.I., Bigun, J.: Audio-visual person authentication using lip-motion from orientation maps. Pattern Recogn. Lett. 28(11), 1368–1382 (2007)
Freund, Y., Schapire, R.E.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(5), 771–780 (1999)
Fröba, B., Ernst, A.: Face detection with the modified census transform. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, FGR 2004, pp. 91–96. IEEE Computer Society, Washington, DC (2004)
Glodek, M., et al.: Multiple classifier systems for the classification of audio-visual emotional states. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011, Part II. LNCS, vol. 6975, pp. 359–368. Springer, Heidelberg (2011)
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07-49, University of Massachusetts, Amherst, October 2007
Jain, A., Hong, L., Pankanti, S., Bolle, R.: An identity-authentication system using fingerprints. Proc. IEEE 85(9), 1365–1388 (1997)
Jain, A., Ross, A.: Learning user-specific parameters in a multibiometric system. In: Proceedings of the International Conference on Image Processing, pp. 57–60 (2002)
Kächele, M., Glodek, M., Zharkov, D., Meudt, S., Schwenker, F.: Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In: De Marsico, M., Tabbone, A., Fred, A. (eds.) Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp. 671–678. SciTePress, Vienna (2014)
Kächele, M., Schwenker, F.: Cascaded fusion of dynamic, spatial, and textural feature sets for person-independent facial emotion recognition. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 4660–4665 (2014)
Küblbeck, B.F.C.: Robust face detection at video frame rate based on edge orientation features. In: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (2002)
Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 37, 1641–1648 (1989)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)
Matějka, P., Schwarz, P., Hermanský, H., Černocký, J.H.: Phoneme recognition using temporal patterns. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 198–205. Springer, Heidelberg (2003)
McCool, C., Marcel, S., Hadid, A., Pietikainen, M., Matejka, P., Cernocky, J., Poh, N., Kittler, J., Larcher, A., Levy, C., Matrouf, D., Bonastre, J.F., Tresadern, P., Cootes, T.: Bi-modal person recognition on a mobile phone: using mobile phone data. In: IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 635–640, (July 2012)
Scherer, S., Glodek, M., Layher, G., Schels, M., Schmidt, M., Brosch, T., Tschechne, S., Schwenker, F., Neumann, H., Palm, G.: A generic framework for the inference of user states in human computer interaction. J. Multimodal User Interfaces 6(3–4), 117–141 (2012)
Schwarz, P.: Phoneme recognition based on long temporal context. Technical report, University of Brno, Faculty of Information Technology BUT (2009)
Schwenker, F., Sachs, A., Palm, G., Kestler, H.A.: Orientation histograms for face recognition. In: Schwenker, F., Marinai, S. (eds.) ANNPR 2006. LNCS (LNAI), vol. 4087, pp. 253–259. Springer, Heidelberg (2006)
Strauß, P.M., Hoffmann, H., Minker, W., Neumann, H., Palm, G., Scherer, S., Schwenker, F., Traue, H., Walter, W., Weidenbacher, U.: Wizard-of-Oz data collection for perception and interaction in multi-user environments. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 2014–2017 (2006)
Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 31(2), 153–163 (2013)
Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.O. (ed.) ECCV 1994. Lecture Notes in Computer Science, vol. 801, pp. 151–158. Springer, Heidelberg (1994)
Acknowledgement
The work has been partially supported by Transregional Collaborative Research Center SFB/TRR 62 Companion-Technology for Cognitive Technical Systems funded by the German Research Foundation (DFG) and by a scholarship of the Landesgraduiertenförderung Baden-Württemberg at Ulm University (M. Kächele).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kächele, M., Meudt, S., Schwarz, A., Schwenker, F. (2015). Audio-Visual User Identification in HCI Scenarios. In: Schwenker, F., Scherer, S., Morency, LP. (eds) Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction. MPRSS 2014. Lecture Notes in Computer Science(), vol 8869. Springer, Cham. https://doi.org/10.1007/978-3-319-14899-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-14899-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14898-4
Online ISBN: 978-3-319-14899-1
eBook Packages: Computer ScienceComputer Science (R0)