Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks

Pavlović, Vladimir; Garg, Ashutosh; Rehg, James M.

doi:10.1007/3-540-40063-X_41

Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks

Vladimir Pavlović⁷,
Ashutosh Garg⁸ &
James M. Rehg⁷

Conference paper
First Online: 26 October 2001

919 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1948))

Abstract

Inferring users’ actions and intentions forms an integral part of design and development of any human-computer interface. The presence of noisy and at times ambiguous sensory data makes this problem challenging. We formulate a framework for temporal fusion of multiple sensors using input-output dynamic Bayesian networks (IODBNs).We find that contextual information about the state of the computer interface, used as an input to the DBN, and sensor distributions learned from data are crucial for good detection performance. Nevertheless, classical DBN learning methods can cause such models to fail when the data exhibits complex behavior. To further improve the detection rate we formulate an errorfeedback learning strategy for DBNs. We apply this framework to the problem of audio/visual speaker detection in an interactive kiosk application using “off- the-shelf” visual and audio sensors (face, skin, texture, mouth motion, and silence detectors). Detection results obtained in this setup demonstrate numerous benefits of our learning-based framework.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Y. Bengio and P. Frasconi, “An input-output HMM architecture,” in Advances in Neural Information Processing Systems 7, pp. 427–434, Cambridge, MA: MIT Press, 1995.
Google Scholar
M. Brand, N. Oliver, and A. Pentland, “Coupled hidden markov models for complex action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (San Juan, PR), pp. 994–999, 1997.
Google Scholar
A. Garg, V. Pavlovic, J. Rehg, and T. S. Huang, “Audio-visual speaker detection using dynamic Bayesian networks,” in Proc. of 4rd Intl Conf. Automatic Face and Gesture Rec., (Grenbole, France), pp. 374–471, 2000.
Google Scholar
S. Intille and A. Bobick, “Representation and visual recognition of complex, multi-agent actions using belief networks,” Tech. Rep. 454, MIT Media Lab, Cambridge, MA, 1998.
Google Scholar
V. Pavlovic, A. Garg, J. Rehg, and T. S. Huang, “Multimodal speaker detection using error feedback dynamic Bayesian networks.” To appear in Computer Vision and Pattern Recognition 2000.
Google Scholar
J. M. Rehg, M. Loughlin, and K. Waters, “Vision for a smart kiosk,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (Puerto Rico), pp. 690–696, 1997.
Google Scholar
J.M. Rehg, K. P. Murphy, and P.W. Fieguth, “Vision-based speaker detection using bayesian networks,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (Ft. Collins, CO), pp. 110–116, 1999.
Google Scholar
H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (San Francisco, CA), pp. 203–208, 1996.
Google Scholar
R. E. Schapire and Y. Singer, “Improved boosting algorithms using cofidence rated predictions.” To appear in Machine Learning.
Google Scholar
J. Yang and A. Waibel, “A real-time face tracker,” in Proc. of 3rd Workshop on Appl. of Comp. Vision, (Sarasota, FL), pp. 142–147, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

Compaq Cambridge Research Lab, 02142, MA, Cambridge
Vladimir Pavlović & James M. Rehg
ECE Department and Beckman Institute, University of Illinois, 61801, IL, Urbana
Ashutosh Garg

Authors

Vladimir Pavlović
View author publications
You can also search for this author in PubMed Google Scholar
Ashutosh Garg
View author publications
You can also search for this author in PubMed Google Scholar
James M. Rehg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, 100080, Beijing, China
Tieniu Tan
Computer Department, Media Laboratory, Tsinghua University, 100084, Beijing, China
Yuanchun Shi
Institute of Computing Technology, Chinese Academy of Sciences, P.O.Box 2704, 100080, Beijing, China
Wen Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pavlović, V., Garg, A., Rehg, J.M. (2000). Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks. In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_41

Download citation

DOI: https://doi.org/10.1007/3-540-40063-X_41
Published: 26 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41180-2
Online ISBN: 978-3-540-40063-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics