A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition

  • Dahai Yu
  • Ovidiu Ghita
  • Alistair Sutherland
  • Paul F. Whelan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5414)


This paper presents the development of a novel visual speech recognition (VSR) system based on a new representation that extends the standard viseme concept (that is referred in this paper to as Visual Speech Unit (VSU)) and Hidden Markov Models (HMM). The visemes have been regarded as the smallest visual speech elements in the visual domain and they have been widely applied to model the visual speech, but it is worth noting that they are problematic when applied to the continuous visual speech recognition. To circumvent the problems associated with standard visemes, we propose a new visual speech representation that includes not only the data associated with the articulation of the visemes but also the transitory information between consecutive visemes. To fully evaluate the appropriateness of the proposed visual speech representation, in this paper an extensive set of experiments have been conducted to analyse the performance of the visual speech units when compared with that offered by the standard MPEG-4 visemes. The experimental results indicate that the developed VSR application achieved up to 90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only in the range 62-72%.


Visual Speech Recognition Visual Speech Unit Viseme EMPCA HMM Dynamic Time Warping 


  1. 1.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audio-Visual Speech. Proc. of IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  2. 2.
    Shamaie, A., Sutherland, A.: Accurate Recognition of Large Number of Hand Gestures. In: Iranian Conference on Machine Vision and Image Processing, University of Technology, Tehran. ICMVIP Press (2003)Google Scholar
  3. 3.
    Luettin, J., Thacker, N.A., Beet, S.W.: Active Shape Models for Visual Speech Feature Extraction, Speechreading by Humans and Machine: Models, Systems and Applications. NATO ASI Series (1996)Google Scholar
  4. 4.
    Dong, L., Foo, S.W., Lian, Y.: A Two-channel Training Algorithm for Hidden Markov Model and its Application to Lip Reading. EURASIP Journal on Applied Signal Processing, 1382–1399 (2005)Google Scholar
  5. 5.
    Eveno, N., Caplier, A., Coulon, P.: A new color transformation for lips segmentation. In: 4th Workshop on Multimedia Signal Processing, Cannes, pp. 3–8. IEEE Press, Los Alamitos (2001)Google Scholar
  6. 6.
    Roweis, S.: EM Algorithms for PCA and SPCA. Advances in Neural Information Processing Systems 10, 626–632 (1998)Google Scholar
  7. 7.
    Petajan, E.D.: Automatic Lip-reading to Enhance Speech Recognition, Ph.D. dissertation, University of Illinois, Urbana-Champaign, USA (1984)Google Scholar
  8. 8.
    Yu, D., Ghita, O., Sutherland, A., Whelan, P.F.: A New Manifold Representation for Visual Speech Recognition. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 374–382. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Pandzic, I.S., Forchheimer, R. (eds.): MPEG-4 Facial Animation – The Standard, Implementation and Applications. John Wiley and Sons Ltd., Chichester (2002)Google Scholar
  10. 10.
    Visser, M., Poel, M., Nijholt, A.: Classifying Visemes for Automatic Lip-reading. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS, vol. 1692, pp. 349–352. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  11. 11.
    Yau, W., Kumar, D.K., Arjunan, S.P., Kumar, S.: Visual Speech Recognition Using Image Moments and Multi-resolution Wavelet Images. Computer Graphics, Imaging and Visualisation, 194–199 (2006)Google Scholar
  12. 12.
    Leszczynski, M., Skarberk, W.: Viseme Recognition – A Comparative Study. In: Conference on Advanced Video and Signal Based Surveillance, pp. 287–292 (2005)Google Scholar
  13. 13.
    Scott, K.C., Kagels, D.S., Watson, S.H., Rom, H., Wright, J.R., Lee, M., Hussey, K.J.: Synthesis of Speaker Facial Movement to Match Selected Speech Sequences. In: 5th Australian Conference on Speech, Science and Technology (1994)Google Scholar
  14. 14.
    Potamianos, G., Neti, C., Huang, J., Connell, J.H., Chu, S., Libal, V., Marcheret, E., Haas, N., Jiang, J.: Towards Practical Deployment of Audio-Visual Speech Recognition. In: International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 777–780 (2004)Google Scholar
  15. 15.
    Ratanamahatana, C.A., Keogh, E.: Everything you know about dynamic time warping is wrong. In: 3rd SIGKDD Workshop on Mining Temporal and Sequential Data (2004)Google Scholar
  16. 16.
    Foo, S.W., Dong, L.: Recognition of Visual Speech Elements Using Hidden Markov Models. In: Chen, Y.-C., Chang, L.-W., Hsu, C.-T. (eds.) PCM 2002. LNCS, vol. 2532, pp. 607–614. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  17. 17.
    Silveira, L.G., Facon, J., Borges, D.L.: Visual Speech Recognition: A Solution from Feature Extraction to Words Classification. In: 16th Brazilian Symposium on Computer Graphics and Image Processing, pp. 399–405 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Dahai Yu
    • 1
  • Ovidiu Ghita
    • 1
  • Alistair Sutherland
    • 1
  • Paul F. Whelan
    • 1
  1. 1.Vision Systems Group, School of Electronic Engineering and ComputingDublin City UniversityDublinIreland

Personalised recommendations