A comparison of active shape model and scale decomposition based features for visual speech recognition

  • Iain Matthews
  • J. Andrew Bangham
  • Richard Harvey
  • Stephen Cox
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1407)


Two quite different strategies for characterising mouth shapes for visual speech recognition (lipreading) are compared. The first strategy extracts the parameters required to fit an active shape model (ASM) to the outline of the lips. The second uses a feature derived from a one-dimensional multiscale spatial analysis (MSA) of the mouth region using a new processor derived from mathematical morphology and median filtering. With multispeaker trials, using image data only, the accuracy is 45% using MSA and 19% using ASM on a letters database. A digits database is simpler with accuracies of 77% and 77% respectively. These scores are significant since separate work has demonstrated that even quite low recognition accuracies in the vision channel can be combined with the audio system to give improved composite performance [16].


Speech Recognition Automatic Speech Recognition Visual Speech Active Shape Model Gaussian Mode 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    A. Adjoudani and C. BenoÎt. On the Integration of Auditory and Visual Pararneters in an HMM-based ASR, pages 461–471. In Stork and Hennecke [38], 1996.Google Scholar
  2. 2.
    J. A. Bangham, T. G. Campbell, and R. V. Aldridge. Multiscale median and morphological filters used for 2d pattern recognition. Signal Processing, 38:387–415, 1994.CrossRefGoogle Scholar
  3. 3.
    J. A. Bangham, P. Chardaire, C. J. Pye, and P. Ling. Mulitscale nonlinear decomposition: The sieve decomposition theorem. IEEE Trans. Pattern Analysis and Machine Intelligence, 18(5):529–539, 1996.CrossRefGoogle Scholar
  4. 4.
    J. A. Bangham, R. Harvey, P. Ling, and R. V. Aldridge. Morphological scale-space preserving transforms in many dimensions. Journal of Electronic Imaging, 5(3):283–299, July 1996.CrossRefGoogle Scholar
  5. 5.
    J. A. Bangham, R. Harvey, P. Ling, and R. V. Aldridge. Nonlinear scale-space from n-dimensional sieves. Proc. European Conference on Computer Vision, 1:189–198, 1996.Google Scholar
  6. 6.
    J. A. Bangham, P. Ling, and R. Young. Mulitscale recursive medians, scale-space and transforms with applications to image processing. IEEE Trans. Image Processing, 5(6):1043–1048, 1996.CrossRefGoogle Scholar
  7. 7.
    C. BenoÎt and R. Campbell, editors. Proceedings of the ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Sept. 1997.Google Scholar
  8. 8.
    A. Bosson, R. Harvey, and J. A. Bangham. Robustness of scale space filters. In BMVC, volume 1, pages 11–21, 1997.Google Scholar
  9. 9.
    C. Bregler and S. M. Omohundro. Learning visual models for lipreading. In M. Shah and R. Jain, editors, Motion-Based Recognition, volume 9 of Computational Imaging and Vision, chapter 13, pages 301–320. Kluwer Academic, 1997.Google Scholar
  10. 10.
    C. Bregler, S. M. Omohundro, and J. Shi. Towards a Robust Speechreading Dialog System, pages 409–423. In Stork and Hennecke [38], 1996.Google Scholar
  11. 11.
    N. M. Brooke, M. J. Tomlinson, and R. K. Moore. Automatic speech recognition that includes visual speech cues. Proc. Institute of Acoustics, 16(5):15–22, 1994.Google Scholar
  12. 12.
    C. C. Chibelushi, S. Gandon, J. S. D. Mason, F. Deravi, and R. D. Johnston. Desing issues for a digital audio-visual integrated database. In IEE Colloquium on Integrated Audio-Visual Processing, number 1996/213, pages 7/1–7/7, Savoy Place, London, Nov. 1996.Google Scholar
  13. 13.
    T. Coianiz, L. Torresani, and B. Caprile. 2D Deformable Models for Visual Speech Analysis, pages 391–398. In Stork and Hennecke [38], 1996.Google Scholar
  14. 14.
    T. F. Cootes, A. Hill, C. J. Taylor, and J. Haslam. The use of active shape models for locating structures in medical images. Image and Vision Computing, 12(6):355–366, 1994.CrossRefGoogle Scholar
  15. 15.
    P. Cosi and E. M. Caldognetto. Lips and Jaw Movements for Vowels and Consonants: Spatio-Temporal Characteristics and Bimodal Recognition Applications, pages 291–313. In Stork and Hennecke [38], 1996.Google Scholar
  16. 16.
    S. Cox, I. Matthews, and A. Bangham. Combining noise compensation with visual information in speech recognition. In BenoÎt and Campbell [7], pages 53–56.Google Scholar
  17. 17.
    N. P. Erber. Interaction of audition and vision in the recognition of oral speech stimuli. Journal of Speech and Hearing Research, 12:423–425, 1969.Google Scholar
  18. 18.
    A. J. Goldschen. Continuous Automatic Speech Recognition by Lipreading. PhD thesis, George Washington University, 1993.Google Scholar
  19. 19.
    R. Harvey, I. Matthews, J. A. Bangham, and S. Cox. Lip reading from scale-space measurements. In Proc. Computer Vision and Pattern Recognition, pages 582–587, Puerto Rico, June 1997. IEEE.Google Scholar
  20. 20.
    H. J. A. M. Heijmans, P. Nacken, A. Toet, and L. Vincent. Graph morphology. Journal of Visual Computing and Image Representation, 3(1):24–38, March 1992.CrossRefGoogle Scholar
  21. 21.
    M. E. Hennecke, D. G. Stork, and K. V. Prasad. Visionary Speech: Looking Ahead to Practical Speechreading Systems, pages 331–349. In Stork and Hennecke [38], 1996.Google Scholar
  22. 22.
    A. Hill and C. J. Taylor. Automatic landmark generation for point distribution models. In Proc. British Machine Vision Conference, 1994.Google Scholar
  23. 23.
    R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In Proc. European Conference on Computer Vision, volume II, pages 376–387, 1996.Google Scholar
  24. 24.
    P. K. Kuhl and A. N. Meltzoff. The bimodal perception of speech in infancy. Science, 218:1138–1141, Dec. 1982.Google Scholar
  25. 25.
    S. E. Levinson, L. R. Rabiner, and M. M. Sondhi. An introduction to the application of the theory of probabilistic functions of a markov process to automatic speech recognition. The Bell System Technical Journal, 62(4):1035–1074, Apr. 1983.zbMATHMathSciNetGoogle Scholar
  26. 26.
    J. Luettin. Towards speaker independent continuous speechreading. In Proc. of the European Conference on Speech Communication and Technology, 1997.Google Scholar
  27. 27.
    J. Luettin. Visual Speech and Speaker Recognition. PhD thesis, University of Sheffield, May 1997.Google Scholar
  28. 28.
    K. Mase and A. Pentland. Automatic lipreading by optical-flow analysis. Systems and Computers in Japan, 22(6):67–75, 1991.Google Scholar
  29. 29.
    I. Matthews, J. A. Bangham, and S. Cox. Scale based features for audiovisual speech recognition. In IEE Colloquium on Integrated Audio-Visual Processing, number 1996/213, pages 8/1–8/7, Savoy Place, London, Nov. 1996.Google Scholar
  30. 30.
    H. McGurk and J. McDonald. Hearing lips and seeing voices. Nature, 264:746–748, Dec. 1976.CrossRefGoogle Scholar
  31. 31.
    U. Meier, R. Stiefelhagen, and J. Yang. Preprocessing of visual speech under real world conditions. In BenoÎt and Campbell [7], pages 113–116.Google Scholar
  32. 32.
    J. R. Movellan. Visual speech recognition with stochastic networks. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, 1995.Google Scholar
  33. 33.
    K. K. Neely. Effect of visual factors on the intelligibility of speech. Journal of the Acoustical Society of America, 28(6):1275–1277, Nov. 1956.CrossRefGoogle Scholar
  34. 34.
    J. A. Nelder and R. Mead. A simplex method for function minimisation. Computing Journal, 7(4):308–313, 1965.zbMATHGoogle Scholar
  35. 35.
    E. D. Petajan. Automatic Lipreading to Enhance Speech Recognition. PhD thesis, University of Illinois, Urbana-Champaign, 1984.Google Scholar
  36. 36.
    G. Potamianos, Cosatto, H. P. Graf, and D. B. Roe. Speaker independent audiovisual database for bimodal ASR. In BenoÎt and Campbell [7], pages 65–68.Google Scholar
  37. 37.
    P. L. Silsbee. Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. PhD thesis, The University of Texas, Austin, Dec. 1993.Google Scholar
  38. 38.
    D. G. Stork and M. E. Hennecke, editors. Speechreading by Humans and Machines: Models, Systems and Applications. NATO ASI Series F: Computer and Systems Sciences. Springer-Verlag, Berlin, 1996.Google Scholar
  39. 39.
    W. H. Sumby and I. Pollack. Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26(2):212–215, Mar. 1954.CrossRefGoogle Scholar
  40. 40.
    Q. Summerfield. Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd and R. Campbell, editors, Hearing by Eye: The Psychology of Lip-reading, pages 3–51. Lawrence Erlbaum Associates, London, 1987.Google Scholar
  41. 41.
    S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland. The HTK Book. Cambridge University, 1996.Google Scholar
  42. 42.
    B. P. Yuhas, M. H. Goldstein, Jr., and T. J. Sejnowski. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27:65–71, 1989.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Iain Matthews
    • 1
  • J. Andrew Bangham
    • 1
  • Richard Harvey
    • 1
  • Stephen Cox
    • 1
  1. 1.School of Information SystemsUniversity of East AngliaNorwichUK

Personalised recommendations