Skip to main content

Geometric and Statistical Approaches to Audiovisual Segmentation

  • Chapter
Advances in Natural Multimodal Dialogue Systems

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 30))

  • 736 Accesses

Abstract

Multimodal approaches are proposed for segmenting multiple speakers using geometric or statistical techniques. When multiple microphones and cameras are available, 3-D audiovisual tracking is used for source segmentation and array processing. With just a single camera and microphone, an information theoretic criteria separates speakers in a video sequence and associates relevant portions of the audio signal. Results are shown for each approach, and an initial integration effort is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Becker, S. (1992). An Information-Theoretic Unsupervised Learning Algorithm for Neural Networks. PhD thesis, University of Toronto.

    Google Scholar 

  • Bub, U., Hunke, M., and Waibel, A. (1995). Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–851.

    Google Scholar 

  • Casey, M., Gardner, W., and Basu, S. (1995). Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE). In Proceedings of the 99th Convention of the Audio Engineering Society (AES). Preprint 4052.

    Google Scholar 

  • Checka, N., Wilson, K., Rangarajan, V., and Darrell, T. (2003). A Probabilistic Framework for Multi-modal Multi-person Tracking. In Proceedings of Workshop on Multi-Object Tracking. http://www.ai.mit.edu/projects/vip/papers/checka-et-al-womot.pdf.

    Google Scholar 

  • Collobert, M., Feraud, R., LeTourneur, G., Bernier, O., Viallet, J. E., Mahieux, Y., and Collobert, D. (1996). LISTEN: A System for Locating and Tracking Individual Speakers. In Proceedings of Second International Conference on Face and Gesture Recognition, pages 283–288.

    Google Scholar 

  • Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley & Sons, Inc.

    Google Scholar 

  • Darrell, T., Demirdjian, D., Checka, N., and Felzenszwalb, P. (2001). Plan-View Trajectory Estimation with Dense Stereo Background Models. In Proceedings of International Conference on Computer Vision, volume 2, pages 628–635.

    Google Scholar 

  • Darrell, T., Gordon, G. G., Harville, M., and Woodfill, J. (2000). Integrated Person Tracking Using Stereo, Color, and Pattern Detection. International Journal of Computer Vision, 37(2):199–207.

    Article  Google Scholar 

  • Deco, G. and Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. New York: Springer Verlag.

    Google Scholar 

  • Fisher, J. W. III and Darrell, T. (2002). Probabalistic Models and Informative Subspaces for Audiovisual Correspondence. In Heyden, A., Sparr, G., Nielsen, M., and Johansen, P., editors, Proceedings of the Seventh European Conference on Computer Vision (ECCV), volume 3, pages 592–603. Springer Lecture Notes in Computer Science 2352.

    Google Scholar 

  • Fisher, J. W. III, Darrell, T., Freeman, W. T., and Viola, P. (2000). Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 1–7. MIT Press.

    Google Scholar 

  • Fisher, J. W. III and Principe, J. C. (1997). Entropy Manipulation of Arbitrary Nonlinear Mappings. In Principe, J. C., editor, Proceedings of IEEE Workshop on Neural Networks for Signal Processing VII, pages 14–23.

    Google Scholar 

  • Fisher, J. W. III and Principe, J. C. (1998). A Methodology for Information Theoretic Feature Extraction. In Stuberud, A., editor, Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pages 1712–1716.

    Google Scholar 

  • Hershey, J. and Movellan, J. (1999). Using Audio-Visual Synchrony to Locate Sounds. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems (NIPS), volume 12, pages 813–819. MIT Press.

    Google Scholar 

  • Ivanov, Y. A., Bobick, A. F., and Liu, J. (2000). Fast Lighting Independent Background Subtraction. International Journal of Computer Vision, 37(2):199–207.

    Article  Google Scholar 

  • Krumm, J., Harris, S., Meyers, B., Brummit, B., Hale, M., and Shafer, S. (2000). Multi-Camera Multi-Person Tracking for Easyliving. In Proceedings of the Third IEEE Workshop on Visual Surveillance, pages 3–10.

    Google Scholar 

  • Mahalanobis, A., Kumar, B., and Casasent, D. (1987). Minimum Average Correlation Energy Filters. Applied Optics, 26(17):3633–3640.

    Article  Google Scholar 

  • Meier, U., Stiefelhagen, R., Yang, J., and Waibel, A. (1999). Towards Unrestricted Lipreading. In Proceedings of the Second International Conference on Multimodal Interfaces (ICMI), Hong Kong.

    Google Scholar 

  • Plumbley, M. (1991). On Information Theory and Unsupervised Neural Networks. Technical Report CUED/F-INFENG/TR. 78, Cambridge University Engineering Department, UK.

    Google Scholar 

  • Plumbley, M. and Fallside, S. (1988). An Information-Theoretic Approach to Unsupervised Connectionist Models. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proceedings of the 1988 Connectionists Models Summer School, pages 239–245. San Mateo, CA: Morgan Kaufman.

    Google Scholar 

  • Silverman, H. F., Patterson, W. R., and Flanagan, J. L. (1998). The Huge Microphone Array. IEEE Concurrency, pages 36–46.

    Google Scholar 

  • Siracusa, M., Morency, L.-P., Wilson, K., Fisher, J., and Darrell, T. (2003). A Multi-Modal Approach for Determining Speaker Location and Focus. In Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI), pages 77–80, Vancouver, Canada. http://www.ai.mit.edu/projects/vip/papers/Siracusa_icmi2003.pdf.

    Google Scholar 

  • Slaney, M. and Covell, M. (2000). FaceSync: A linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 814–820. MIT Press.

    Google Scholar 

  • van Veen, B. D. and Buckley, K. M. (1988). Beamforming: A Versatile Approach to Spatial Filtering. IEEE Acoustics, Speech, and Signal Processing (ASSP) Magazine, 5(2):4–24.

    Google Scholar 

  • Viberg, M. and Krim, H. (1997). Two Decades of Statistical Array Processing. In Proceedings of the 31st Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 775–777.

    Google Scholar 

  • Wang, C. and Brandstein, M. (1999). Multi-Source Face Tracking with Audio and Visual Data. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pages 169–174, Copenhagen, Denmark.

    Google Scholar 

  • Wolff, G., Prasad, K. V., Stork, D. G., and Hennecke, M. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In Cowan, J., Tesauro, G., and Alspector, J., editors, Proceedings of Neural Information Processing Systems (NIPS-6), pages 1027–1034.

    Google Scholar 

  • Zue, V., Glass, J., Polifroni, J., Pao, C., Hazen, T., and Hetherington, L. (2000). Jupiter: A Telephone-Based Conversational Interface for Weather Information. IEEE Transactions on Speech and Audio Processing, 8(1):85–96.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer

About this chapter

Cite this chapter

Darrell, T., Fisher, J.W., Wilson, K.W., Siracusa, M.R. (2005). Geometric and Statistical Approaches to Audiovisual Segmentation. In: van Kuppevelt, J.C.J., Dybkjær, L., Bernsen, N.O. (eds) Advances in Natural Multimodal Dialogue Systems. Text, Speech and Language Technology, vol 30. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3933-6_8

Download citation

Publish with us

Policies and ethics