Skip to main content

The Research Context

  • Chapter
  • First Online:
Cognitively Inspired Audiovisual Speech Filtering

Part of the book series: SpringerBriefs in Cognitive Computation ((BRIEFSCC,volume 5))

  • 323 Accesses

Abstract

This chapter presents a literature review that places the research proposed in this book in context, building on the background presented in the previous chapters. Firstly, the overall speech processing domain is briefly discussed. This review presents examples of listening devices using directional microphones, array microphones, noise reduction algorithms, and rule based automatic decision making, demonstrating that the multimodal two stage framework presented later in this book has established precedent in the context of real world hearing aid devices. The other significant aspect vital to the research context of this work is the field of audiovisual speech filtering. This chapter presents a review of multimodal speech enhancement, with a discussion of the initial early stage audiovisual speech filtering systems in the literature, and the subsequent development and diversification of this field. A number of different state of the art speech filtering systems are examined and reviewed in depth, particularly multimodal beamforming and Wiener filtering. Finally, several audiovisual speech databases are evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. K. Chung, Challenges and recent developments in hearing aids. Part i. Speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)

    Article  Google Scholar 

  2. T. Ricketts, H. Mueller, Making sense of directional microphone hearing aids. Am. J. Audiol. 8(2), 117 (1999)

    Article  Google Scholar 

  3. M. Valente, Use of microphone technology to improve user performance in noise, Textbook of Hearing Aid Amplification (Singular Thomason Learning, San Diego, 2000), p. 247

    Google Scholar 

  4. F. Kuk, D. Keenan, C. Lau, C. Ludvigsen, Performance of a fully adaptive directional microphone to signals presented from various azimuths. J. Am. Acad. Audiol. 16(6), 333–347 (2005)

    Article  Google Scholar 

  5. M. Cord, R. Surr, B. Walden, L. Olson, Performance of directional microphone hearing aids in everyday life. J. Am. Acad. Audiol. 13(6), 295–307 (2002)

    Google Scholar 

  6. M. Cord, R. Surr, B. Walden, O. Dyrlund, Relationship between laboratory measures of directional advantage and everyday success with directional microphone hearing aids. J. Am. Acad. Audiol. 15(5), 353–364 (2004)

    Article  Google Scholar 

  7. T. Ricketts, P. Henry, Evaluation of an adaptive, directional-microphone hearing aid: evaluación de un auxiliar auditivo de micrófono direccional adaptable. Int. J. Audiol. 41(2), 100–112 (2002)

    Article  Google Scholar 

  8. R. Bentler, C. Palmer, A. Dittberner, Hearing-in-noise: comparison of listeners with normal and (aided) impaired hearing. J. Am. Acad. Audiol. 15(3), 216–225 (2004)

    Article  Google Scholar 

  9. L. Mens, Speech understanding in noise with an eyeglass hearing aid: asymmetric fitting and the head shadow benefit of anterior microphones. Int. J. Audiol. 50(1), 27–33 (2011)

    Article  Google Scholar 

  10. L. Christensen, D. Helmink, W. Soede, M. Killion, Complaints about hearing in noise: a new answer. Hear. Rev. 9(6), 34–36 (2002)

    Google Scholar 

  11. S. Laugesen, T. Schmidtke, Improving on the speech-in-noise problem with wireless array technology. News from Oticon (2004), pp. 3–23

    Google Scholar 

  12. S. Rosen, Temporal information in speech: acoustic, auditory and linguistic aspects. Philos. Trans.: Biol. Sci. 336, 367–373 (1992)

    Article  Google Scholar 

  13. N. Tellier, H. Arndt, H. Luo, Speech or noise? Using signal detection and noise reduction. Hear. Rev. 10(6), 48–51 (2003)

    Google Scholar 

  14. H. Levitt, Noise reduction in hearing aids: an overview. J. Rehabil. Res. Dev. 38(1), 111–121 (2001)

    MathSciNet  Google Scholar 

  15. M. Boymans, W. Dreschler, P. Schoneveld, H. Verschuure, Clinical evaluation of a full-digital in-the-ear hearing instrument. Int. J. Audiol. 38(2), 99–108 (1999)

    Article  Google Scholar 

  16. J. Alcántara, B. Moore, V. Kühnel, S. Launer, Evaluation of the noise reduction system in a commrcial digital hearing aid: evaluación del sistema de reducción de ruido en un auxiliar auditivo digital comercial. Int. J. Audiol. 42(1), 34–42 (2003)

    Article  Google Scholar 

  17. C. Elberling, About the voicefinder. News from Oticon (2002)

    Google Scholar 

  18. D. Schum, Noise-reduction circuitry in hearing aids: (2) goals and current strategies. Hear. J. 56(6), 32 (2003)

    Article  Google Scholar 

  19. L. Girin, J. Schwartz, G. Feng, Audio-visual enhancement of speech in noise. J. Acoust. Soc. Am. 109, 3007 (2001)

    Article  Google Scholar 

  20. R. Goecke, G. Potamianos, C. Neti, Noisy audio feature enhancement using audio-visual speech data, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 2 (IEEE, 2002), pp. 2025–2028

    Google Scholar 

  21. S. Deligne, G. Potamianos, C. Neti, Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization), in Proceedings of the Sensor Array and Multichannel Signal Processing Workshop (IEEE, 2003), pp. 68–71

    Google Scholar 

  22. A. Acero, R. Stern, Environmental robustness in automatic speech recognition, in Proceedings of the International Conference on Acoustics, Speech, and Signal ProcessingICASSP-90 (IEEE, 2002), pp. 849–852

    Google Scholar 

  23. L. Deng, A. Acero, L. Jiang, J. Droppo, X. Huang, High-performance robust speech recognition using stereo training data, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 1 (IEEE, 2002), pp. 301–304

    Google Scholar 

  24. B. Rivet, J. Chambers, Multimodal speech separation, in Advances in Nonlinear Speech Processing, vol. 5933, Lecture Notes in Computer Science, ed. by J. Sole-Casals, V. Zaiats (Springer, Berlin, 2010), pp. 1–11

    Chapter  Google Scholar 

  25. B. Rivet, L. Girin, C. Jutten, Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans. Audio Speech Lang. Process. 15(3), 796–802 (2007)

    Article  Google Scholar 

  26. B. Rivet, L. Girin, C. Jutten, Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15(1), 96–108 (2007)

    Article  Google Scholar 

  27. B. Rivet, L. Girin, C. Serviere, D.-T. Pham, C. Jutten, Using a visual voice activity detector to regularize the permutations in blind separation of convolutive speech mixtures, in Proceedings of the 15th International Conference on Digital Signal Processing (2007), pp. 223 –226

    Google Scholar 

  28. B. Rivet, L. Girin, C. Jutten, Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Commun. 49(7–8), 667–677 (2007)

    Article  Google Scholar 

  29. C. Jutten, J. Herault, Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture. Signal Process. 24(1), 1–10 (1991)

    Article  MATH  Google Scholar 

  30. J. Herault, C. Jutten, B. Ans, Detection de grandeurs primitives dans un message composite par une architecture de calcul neuromimetrique en apprentissage non supervise. Actes du Xeme colloque GRETSI 2, 1017–1020 (1985)

    Google Scholar 

  31. E. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)

    Article  Google Scholar 

  32. L. Girin, G. Feng, J. Schwartz, Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (IEEE, 2002), pp. 1005–1008

    Google Scholar 

  33. D. Sodoyer, L. Girin, C. Jutten, J. Schwartz, Developing an audio-visual speech source separation algorithm. Speech Commun. 44(1–4), 113–125 (2004)

    Article  Google Scholar 

  34. D. Sodoyer, J. Schwartz, L. Girin, J. Klinkisch, C. Jutten, Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP J. Appl. Signal Process. 2002(1), 1165–1173 (2002)

    Article  MATH  Google Scholar 

  35. S. Naqvi, M. Yu, J. Chambers, A multimodal approach to blind source separation of moving sources. IEEE J. Sel. Top. Signal Process. 4(5), 895–910 (2010)

    Article  Google Scholar 

  36. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (IEEE Computer Society, 2001), pp. 511–518

    Google Scholar 

  37. A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, vol. 26 (Wiley-Interscience, New York, 2001)

    Book  Google Scholar 

  38. E. Bingham, A. Hyvarinen, A fast fixed-point algorithm for independent component analysis of complex valued signals. Int. J. Neural Syst. 10(1), 1–8 (2000)

    Article  Google Scholar 

  39. J. Barker, X. Shao, Audio-visual speech fragment decoding, in Proceedings of the International Conference on Auditory-Visual Speech Processing (2007), pp. 37–42

    Google Scholar 

  40. J. Barker, M. Cooke, D. Ellis, Decoding speech in the presence of other sources. Speech Commun. 45(1), 5–25 (2005)

    Article  Google Scholar 

  41. A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (The MIT Press, Cambridge, 1990)

    Google Scholar 

  42. A. Bregman, Auditory Scene Analysis: Hearing in Complex Environments (Oxford University Press, Oxford, 1993)

    Google Scholar 

  43. J. Barker, X. Shao, Energetic and informational masking effects in an audiovisual speech recognition system. IEEE Trans. Audio Speech Lang. Process. 17(3), 446–458 (2009)

    Article  Google Scholar 

  44. M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5 Pt 1), 2421–2424 (2006)

    Article  Google Scholar 

  45. I. Almajai, B. Milner, in Proceedings of the Enhancing Audio Speech using Visual Speech Features (Interspeech, Brighton, 2009)

    Google Scholar 

  46. N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications (The MIT Press, Cambridge, 1949)

    MATH  Google Scholar 

  47. I. Almajai, B. Milner, Maximising audio-visual speech correlation, in Proceedings of the AVSP (2007)

    Google Scholar 

  48. I. Almajai, B. Milner, J. Darch, S. Vaseghi, Visually-derived Wiener filters for speech enhancement, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 4 (2007), pp. 585–588

    Google Scholar 

  49. I. Almajai, B. Milner, in Proceedings of the Effective Visually-derived Wiener Filtering For Audio-visual Speech Processing (Interspeech, Brighton, UK, 2009)

    Google Scholar 

  50. B. Milner, I. Almajai, Noisy audio speech enhancement using Wiener filters derived from visual speech, in Proceedings of the International Workshop on Auditory-Visual Speech Processing (AVSP)

    Google Scholar 

  51. B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, T. Huang, AVICAR: audio-visual speech corpus in a car environment, in Proceedings of the Conference on Spoken Language, Jeju, Korea (Citeseer, 2004), pp. 2489–2492

    Google Scholar 

  52. H. Lane, B. Tranel, The Lombard sign and the role of hearing in speech. J. Speech Hear. Res. 14(4), 677 (1971)

    Article  Google Scholar 

  53. T. Wakasugi, M. Nishiura, K. Fukui, Robust lip contour extraction using separability of multi-dimensional distributions, in Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (IEEE, 2004), pp. 415–420

    Google Scholar 

  54. A. Liew, S. Leung, W. Lau, Lip contour extraction from color images using a deformable model. Pattern Recognit. 35(12), 2949–2962 (2002)

    Article  MATH  Google Scholar 

  55. Q. Nguyen, M. Milgram, Semi adaptive appearance models for lip tracking, in Proceedings of the ICIP09 (2009), pp. 2437–2440

    Google Scholar 

  56. M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models. Int. J. Comput. Vis. 1, 321–331 (1988)

    Article  Google Scholar 

  57. A. Das, D. Ghoshal, Extraction of time invariant lips based on morphological operation and corner detection method. Int. J. Comput. Appl. 48(21), 7–11 (2012)

    Google Scholar 

  58. Y. Cheung, X. Liu, X. You, A local region based approach to lip tracking. Pattern Recognit. 45, 3336–3347 (2012)

    Article  Google Scholar 

  59. X. Zhang, R. Mersereau, Lip feature extraction towards an automatic speechreading system, in Proceedings of the 2000 International Conference on Image Processing, vol. 3 (IEEE, 2000), pp. 226–229

    Google Scholar 

  60. N. Eveno, A. Caplier, P. Coulon, New color transformation for lips segmentation, in IEEE Fourth Workshop on Multimedia Signal Processing (IEEE, 2001), pp. 3–8

    Google Scholar 

  61. N. Eveno, A. Caplier, P. Coulon, Key points based segmentation of lips, in Proceedings of the 2002 IEEE International Conference on Multimedia and Expo, ICME’02, vol. 2, (IEEE, 2002), pp. 125–128

    Google Scholar 

  62. D. Freedman, M. Brandstein, Contour tracking in clutter: a subset approach. Int. J. Comput. Vis. 38(2), 173–186 (2000)

    Article  MATH  Google Scholar 

  63. Z. Ji, Y. Su, J. Wang, R. Hua, Robust sea-sky-line detection based on horizontal projection and hough transformation, in 2nd International Congress on Image and Signal Processing, CISP’09 (IEEE, 2009), pp. 1–4

    Google Scholar 

  64. C. Harris, M. Stephens, A combined corner and edge detector, in Alvey Vision Conference, vol. 15 (Manchester, 1988), p. 50

    Google Scholar 

  65. J. Luettin, N. Thacker, S. Beet, Visual speech recognition using active shape models and hidden Markov models, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 2 (IEEE, 1996), pp. 817–820

    Google Scholar 

  66. Q. Nguyen, M. Milgram, T. Nguyen, Multi features models for robust lip tracking, in 10th International Conference on Control, Automation, Robotics and Vision, 2008. ICARCV 2008, (IEEE, 2008), pp. 1333–1337

    Google Scholar 

  67. T. Cootes, G. Edwards, C. Taylor, Active appearance models, inComputer Vision-ECCV’98 (1998), pp. 484–498

    Google Scholar 

  68. A. Yuille, P. Hallinan, D. Cohen, Feature extraction from faces using deformable templates. Int. J Comput. Vis. 8(2), 99–111 (1992)

    Article  Google Scholar 

  69. G. Chiou, J. Hwang, Lipreading from color video. IEEE Trans. Image Process. 6(8), 1192–1195 (1997)

    Article  Google Scholar 

  70. M. Yang, D. Kriegman, N. Ahuja, Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 34–58 (2002)

    Article  Google Scholar 

  71. S. Wang, A. Abdel-Dayem, Improved viola-jones face detector, in Proceedings of the 1st Taibah University International Conference on Computing and Information Technology, ICCIT’12 (2012), pp. 321–328

    Google Scholar 

  72. C. Kotropoulos, I. Pitas, Rule-based face detection in frontal views, in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-97, vol. 4, (IEEE, 1997), pp. 2537–2540

    Google Scholar 

  73. G. Yang, T. Huang, Human face detection in a complex background. Pattern Recognit. 27(1), 53–63 (1994)

    Article  Google Scholar 

  74. R. Kjeldsen, J. Kender, Finding skin in color images, in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (IEEE, 1996), pp. 312–317

    Google Scholar 

  75. K. Yow, R. Cipolla, A probabilistic framework for perceptual grouping of features for human face detection, in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, (IEEE, 1996), pp. 16–21

    Google Scholar 

  76. T. Kohonen, Self-organisation and Associative Memory (Springer, Berlin, 1989)

    Book  Google Scholar 

  77. K. Sung, Learning and example selection for object and pattern detection (1996)

    Google Scholar 

  78. T. Agui, Y. Kokubo, H. Nagahashi, T. Nagao, Extraction of face regions from monochromatic photographs using neural networks, in Proceedings of the International Conference on Robotics (1992)

    Google Scholar 

  79. F. Crow, Summed-area tables for texture mapping. Comput. Graph. 18(3), 207–212 (1984)

    Article  Google Scholar 

  80. G. Bradski, The OpenCV Library. Dr. Dobb’s J. Softw. Tools 25(11), 120–126 (2000)

    Google Scholar 

  81. C. Zhang, Z. Zhang, A survey of recent advances in face detection. Microsoft Research, June 2010

    Google Scholar 

  82. R. Meir, G. Rätsch, An introduction to boosting and leveraging, Advanced Lectures on Machine Learning (Springer, New York, 2003), pp. 118–183

    Chapter  Google Scholar 

  83. Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Computational Learning Theory (Springer, Berlin, 1995), pp. 23–37

    Chapter  Google Scholar 

  84. J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Ann. Stat. 28(2), 337–407 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  85. S. Brubaker, J. Wu, J. Sun, M. Mullin, J. Rehg, On the design of cascades of boosted ensembles for face detection. Int. J. Comput. Vis. 77(1), 65–86 (2008)

    Article  Google Scholar 

  86. S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, H. Shum, Statistical learning of multi-view face detection. in Computer Vision, ECCV 2002 (2006), pp. 117–121

    Google Scholar 

  87. C. Bishop, P. Viola, Learning and vision: discriminative methods. ICCV Course Lear. Vis. 2(7), 11 (2003)

    Google Scholar 

  88. R. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions. Mach. Lear. 37(3), 297–336 (1999)

    Article  MATH  Google Scholar 

  89. X. Huang, S. Li, Y. Wang, Jensen-Shannon boosting learning for object recognition, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2 (IEEE, 2005), pp. 144–149

    Google Scholar 

  90. E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, Cuave: a new audio-visual database for multimodal human-computer interface research, in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-93, vol. 2 (IEEE, 2002), p. II

    Google Scholar 

  91. E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée et al., The BANCA database and evaluation protocol, Audio- and Video-Based Biometric Person Authentication (Springer, 2003), p. 1057

    Google Scholar 

  92. K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitre, XM2VTSDB: the extended M2VTS database, in Second International Conference on Audio and Video-based Biometric Person Authentication, vol. 964 (Citeseer, 1999), pp. 965–966

    Google Scholar 

  93. C. Sanderson, K. Paliwal, Polynomial features for robust face authentication, in Proceedings of the International Conference on Image Processing, vol. 3 (IEEE, 2002), pp. 997–1000

    Google Scholar 

  94. C. Sanderson, Biometric Person Recognition: Face, Speech and Fusion (VDM Verlag Dr, Muller, 2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Abel .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 The Author(s)

About this chapter

Cite this chapter

Abel, A., Hussain, A. (2015). The Research Context. In: Cognitively Inspired Audiovisual Speech Filtering. SpringerBriefs in Cognitive Computation, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-319-13509-0_3

Download citation

Publish with us

Policies and ethics