Geometric and Statistical Approaches to Audiovisual Segmentation

Darrell, Trevor; Fisher, John W.; Wilson, Kevin W.; Siracusa, Michael R.

doi:10.1007/1-4020-3933-6_8

Trevor Darrell¹³,
John W. Fisher III¹³,
Kevin W. Wilson¹³ &
…
Michael R. Siracusa¹³

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 30))

736 Accesses

Abstract

Multimodal approaches are proposed for segmenting multiple speakers using geometric or statistical techniques. When multiple microphones and cameras are available, 3-D audiovisual tracking is used for source segmentation and array processing. With just a single camera and microphone, an information theoretic criteria separates speakers in a video sequence and associates relevant portions of the audio signal. Results are shown for each approach, and an initial integration effort is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Becker, S. (1992). An Information-Theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto.
Google Scholar
Bub, U., Hunke, M., and Waibel, A. (1995). Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–851.
Google Scholar
Casey, M., Gardner, W., and Basu, S. (1995). Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE). In Proceedings of the 99th Convention of the Audio Engineering Society (AES). Preprint 4052.
Google Scholar
Checka, N., Wilson, K., Rangarajan, V., and Darrell, T. (2003). A Probabilistic Framework for Multi-modal Multi-person Tracking. In Proceedings of Workshop on Multi-Object Tracking. http://www.ai.mit.edu/projects/vip/papers/checka-et-al-womot.pdf.
Google Scholar
Collobert, M., Feraud, R., LeTourneur, G., Bernier, O., Viallet, J. E., Mahieux, Y., and Collobert, D. (1996). LISTEN: A System for Locating and Tracking Individual Speakers. In Proceedings of Second International Conference on Face and Gesture Recognition, pages 283–288.
Google Scholar
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley & Sons, Inc.
Google Scholar
Darrell, T., Demirdjian, D., Checka, N., and Felzenszwalb, P. (2001). Plan-View Trajectory Estimation with Dense Stereo Background Models. In Proceedings of International Conference on Computer Vision, volume 2, pages 628–635.
Google Scholar
Darrell, T., Gordon, G. G., Harville, M., and Woodfill, J. (2000). Integrated Person Tracking Using Stereo, Color, and Pattern Detection. International Journal of Computer Vision, 37(2):199–207.
Article Google Scholar
Deco, G. and Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. New York: Springer Verlag.
Google Scholar
Fisher, J. W. III and Darrell, T. (2002). Probabalistic Models and Informative Subspaces for Audiovisual Correspondence. In Heyden, A., Sparr, G., Nielsen, M., and Johansen, P., editors, Proceedings of the Seventh European Conference on Computer Vision (ECCV), volume 3, pages 592–603. Springer Lecture Notes in Computer Science 2352.
Google Scholar
Fisher, J. W. III, Darrell, T., Freeman, W. T., and Viola, P. (2000). Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 1–7. MIT Press.
Google Scholar
Fisher, J. W. III and Principe, J. C. (1997). Entropy Manipulation of Arbitrary Nonlinear Mappings. In Principe, J. C., editor, Proceedings of IEEE Workshop on Neural Networks for Signal Processing VII, pages 14–23.
Google Scholar
Fisher, J. W. III and Principe, J. C. (1998). A Methodology for Information Theoretic Feature Extraction. In Stuberud, A., editor, Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pages 1712–1716.
Google Scholar
Hershey, J. and Movellan, J. (1999). Using Audio-Visual Synchrony to Locate Sounds. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems (NIPS), volume 12, pages 813–819. MIT Press.
Google Scholar
Ivanov, Y. A., Bobick, A. F., and Liu, J. (2000). Fast Lighting Independent Background Subtraction. International Journal of Computer Vision, 37(2):199–207.
Article Google Scholar
Krumm, J., Harris, S., Meyers, B., Brummit, B., Hale, M., and Shafer, S. (2000). Multi-Camera Multi-Person Tracking for Easyliving. In Proceedings of the Third IEEE Workshop on Visual Surveillance, pages 3–10.
Google Scholar
Mahalanobis, A., Kumar, B., and Casasent, D. (1987). Minimum Average Correlation Energy Filters. Applied Optics, 26(17):3633–3640.
Article Google Scholar
Meier, U., Stiefelhagen, R., Yang, J., and Waibel, A. (1999). Towards Unrestricted Lipreading. In Proceedings of the Second International Conference on Multimodal Interfaces (ICMI), Hong Kong.
Google Scholar
Plumbley, M. (1991). On Information Theory and Unsupervised Neural Networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK.
Google Scholar
Plumbley, M. and Fallside, S. (1988). An Information-Theoretic Approach to Unsupervised Connectionist Models. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. San Mateo, CA: Morgan Kaufman.
Google Scholar
Silverman, H. F., Patterson, W. R., and Flanagan, J. L. (1998). The Huge Microphone Array. IEEE Concurrency, pages 36–46.
Google Scholar
Siracusa, M., Morency, L.-P., Wilson, K., Fisher, J., and Darrell, T. (2003). A Multi-Modal Approach for Determining Speaker Location and Focus. In Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI), pages 77–80, Vancouver, Canada. http://www.ai.mit.edu/projects/vip/papers/Siracusa_icmi2003.pdf.
Google Scholar
Slaney, M. and Covell, M. (2000). FaceSync: A linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 814–820. MIT Press.
Google Scholar
van Veen, B. D. and Buckley, K. M. (1988). Beamforming: A Versatile Approach to Spatial Filtering. IEEE Acoustics, Speech, and Signal Processing (ASSP) Magazine, 5(2):4–24.
Google Scholar
Viberg, M. and Krim, H. (1997). Two Decades of Statistical Array Processing. In Proceedings of the 31st Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 775–777.
Google Scholar
Wang, C. and Brandstein, M. (1999). Multi-Source Face Tracking with Audio and Visual Data. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pages 169–174, Copenhagen, Denmark.
Google Scholar
Wolff, G., Prasad, K. V., Stork, D. G., and Hennecke, M. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In Cowan, J., Tesauro, G., and Alspector, J., editors, Proceedings of Neural Information Processing Systems (NIPS-6), pages 1027–1034.
Google Scholar
Zue, V., Glass, J., Polifroni, J., Pao, C., Hazen, T., and Hetherington, L. (2000). Jupiter: A Telephone-Based Conversational Interface for Weather Information. IEEE Transactions on Speech and Audio Processing, 8(1):85–96.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory M.I.T., Cambridge, MA, 02139, USA
Trevor Darrell, John W. Fisher III, Kevin W. Wilson & Michael R. Siracusa

Authors

Trevor Darrell
View author publications
You can also search for this author in PubMed Google Scholar
John W. Fisher III
View author publications
You can also search for this author in PubMed Google Scholar
Kevin W. Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Siracusa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Waalre, The Netherlands
Jan C. J. van Kuppevelt
University of Southern Denmark, Odense, Denmark
Laila Dybkjær & Niels Ole Bernsen &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Darrell, T., Fisher, J.W., Wilson, K.W., Siracusa, M.R. (2005). Geometric and Statistical Approaches to Audiovisual Segmentation. In: van Kuppevelt, J.C.J., Dybkjær, L., Bernsen, N.O. (eds) Advances in Natural Multimodal Dialogue Systems. Text, Speech and Language Technology, vol 30. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3933-6_8

Download citation

DOI: https://doi.org/10.1007/1-4020-3933-6_8
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-3932-4
Online ISBN: 978-1-4020-3933-1
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics