Abstract
Multimodal approaches are proposed for segmenting multiple speakers using geometric or statistical techniques. When multiple microphones and cameras are available, 3-D audiovisual tracking is used for source segmentation and array processing. With just a single camera and microphone, an information theoretic criteria separates speakers in a video sequence and associates relevant portions of the audio signal. Results are shown for each approach, and an initial integration effort is discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Becker, S. (1992). An Information-Theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto.
Bub, U., Hunke, M., and Waibel, A. (1995). Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–851.
Casey, M., Gardner, W., and Basu, S. (1995). Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE). In Proceedings of the 99th Convention of the Audio Engineering Society (AES). Preprint 4052.
Checka, N., Wilson, K., Rangarajan, V., and Darrell, T. (2003). A Probabilistic Framework for Multi-modal Multi-person Tracking. In Proceedings of Workshop on Multi-Object Tracking. http://www.ai.mit.edu/projects/vip/papers/checka-et-al-womot.pdf.
Collobert, M., Feraud, R., LeTourneur, G., Bernier, O., Viallet, J. E., Mahieux, Y., and Collobert, D. (1996). LISTEN: A System for Locating and Tracking Individual Speakers. In Proceedings of Second International Conference on Face and Gesture Recognition, pages 283–288.
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley & Sons, Inc.
Darrell, T., Demirdjian, D., Checka, N., and Felzenszwalb, P. (2001). Plan-View Trajectory Estimation with Dense Stereo Background Models. In Proceedings of International Conference on Computer Vision, volume 2, pages 628–635.
Darrell, T., Gordon, G. G., Harville, M., and Woodfill, J. (2000). Integrated Person Tracking Using Stereo, Color, and Pattern Detection. International Journal of Computer Vision, 37(2):199–207.
Deco, G. and Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. New York: Springer Verlag.
Fisher, J. W. III and Darrell, T. (2002). Probabalistic Models and Informative Subspaces for Audiovisual Correspondence. In Heyden, A., Sparr, G., Nielsen, M., and Johansen, P., editors, Proceedings of the Seventh European Conference on Computer Vision (ECCV), volume 3, pages 592–603. Springer Lecture Notes in Computer Science 2352.
Fisher, J. W. III, Darrell, T., Freeman, W. T., and Viola, P. (2000). Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 1–7. MIT Press.
Fisher, J. W. III and Principe, J. C. (1997). Entropy Manipulation of Arbitrary Nonlinear Mappings. In Principe, J. C., editor, Proceedings of IEEE Workshop on Neural Networks for Signal Processing VII, pages 14–23.
Fisher, J. W. III and Principe, J. C. (1998). A Methodology for Information Theoretic Feature Extraction. In Stuberud, A., editor, Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pages 1712–1716.
Hershey, J. and Movellan, J. (1999). Using Audio-Visual Synchrony to Locate Sounds. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems (NIPS), volume 12, pages 813–819. MIT Press.
Ivanov, Y. A., Bobick, A. F., and Liu, J. (2000). Fast Lighting Independent Background Subtraction. International Journal of Computer Vision, 37(2):199–207.
Krumm, J., Harris, S., Meyers, B., Brummit, B., Hale, M., and Shafer, S. (2000). Multi-Camera Multi-Person Tracking for Easyliving. In Proceedings of the Third IEEE Workshop on Visual Surveillance, pages 3–10.
Mahalanobis, A., Kumar, B., and Casasent, D. (1987). Minimum Average Correlation Energy Filters. Applied Optics, 26(17):3633–3640.
Meier, U., Stiefelhagen, R., Yang, J., and Waibel, A. (1999). Towards Unrestricted Lipreading. In Proceedings of the Second International Conference on Multimodal Interfaces (ICMI), Hong Kong.
Plumbley, M. (1991). On Information Theory and Unsupervised Neural Networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK.
Plumbley, M. and Fallside, S. (1988). An Information-Theoretic Approach to Unsupervised Connectionist Models. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. San Mateo, CA: Morgan Kaufman.
Silverman, H. F., Patterson, W. R., and Flanagan, J. L. (1998). The Huge Microphone Array. IEEE Concurrency, pages 36–46.
Siracusa, M., Morency, L.-P., Wilson, K., Fisher, J., and Darrell, T. (2003). A Multi-Modal Approach for Determining Speaker Location and Focus. In Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI), pages 77–80, Vancouver, Canada. http://www.ai.mit.edu/projects/vip/papers/Siracusa_icmi2003.pdf.
Slaney, M. and Covell, M. (2000). FaceSync: A linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 814–820. MIT Press.
van Veen, B. D. and Buckley, K. M. (1988). Beamforming: A Versatile Approach to Spatial Filtering. IEEE Acoustics, Speech, and Signal Processing (ASSP) Magazine, 5(2):4–24.
Viberg, M. and Krim, H. (1997). Two Decades of Statistical Array Processing. In Proceedings of the 31st Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 775–777.
Wang, C. and Brandstein, M. (1999). Multi-Source Face Tracking with Audio and Visual Data. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pages 169–174, Copenhagen, Denmark.
Wolff, G., Prasad, K. V., Stork, D. G., and Hennecke, M. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In Cowan, J., Tesauro, G., and Alspector, J., editors, Proceedings of Neural Information Processing Systems (NIPS-6), pages 1027–1034.
Zue, V., Glass, J., Polifroni, J., Pao, C., Hazen, T., and Hetherington, L. (2000). Jupiter: A Telephone-Based Conversational Interface for Weather Information. IEEE Transactions on Speech and Audio Processing, 8(1):85–96.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer
About this chapter
Cite this chapter
Darrell, T., Fisher, J.W., Wilson, K.W., Siracusa, M.R. (2005). Geometric and Statistical Approaches to Audiovisual Segmentation. In: van Kuppevelt, J.C.J., Dybkjær, L., Bernsen, N.O. (eds) Advances in Natural Multimodal Dialogue Systems. Text, Speech and Language Technology, vol 30. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3933-6_8
Download citation
DOI: https://doi.org/10.1007/1-4020-3933-6_8
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-3932-4
Online ISBN: 978-1-4020-3933-1
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)