Clustering Persian viseme using phoneme subspace for developing visual speech application


There are numerous multimedia applications such as talking head, lip reading, lip synchronization, and computer assisted pronunciation training, which entices researchers to bring clustering and analyzing viseme into focus. With respect to the fact that clustering and analyzing visemes are language dependent process, we concentrated our research on Persian language, which indeed has suffered from the lack of such study. To this end, we proposed a novel adopting image-based approach which consists of four main steps including (a) extracting the lip region, (b) obtaining Eigenviseme of each phoneme considering coarticulation effect, (c) mapping each viseme into its subspace and other phonemes’ subspaces in order to create the distance matrix so as to calculate the distance between viseme’s cluster, and finally (d) comparing similarity of each viseme based on the weight value of reconstructed one. In order to indicate the robustness of the proposed algorithm, three sets of experiments were conducted on Persian and English databases in which Consonant/Vowel and Consonant/Vowel/Consonant syllables were examined. The results indicated that the proposed method outperformed the observed state-of-the-art algorithms in feature extraction, and it had a comparable efficiency in generating adequate clusters. Moreover, obtained results reached a milestone in grouping Persian visemes with respect to the perceptual test given by volunteers.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    C is stand for Consonant and V is stand for Vowel


  1. 1.

    Bälter O, Engwall O, Öster A-M, Kjellström H (2005) Wizard-of-Oz test of ARTUR: a computer-based speech training system with articulation correction. Paper presented at the proceedings of the 7th international ACM SIGACCESS conference on computers and accessibility, Baltimore, MD, USA

  2. 2.

    Bastanfard A, Aghaahmadi M, Kelishami A, Fazel M, Moghadam M (2009) Persian viseme classification for developing visual speech training application advances in multimedia information processing—PCM 2009. In: Muneesawang P, Wu F, Kumazawa I, Roeksabutr A, Liao M, Tang X (eds) Lecture notes in computer science, vol 5879. Springer, Berlin, pp 1080–1085

    Google Scholar 

  3. 3.

    Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2009) A comprehensive audio-visual corpus for teaching sound persian phoneme articulation. Paper presented at the Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics, San Antonio, TX, USA

  4. 4.

    Bastanfard A, Fazel M, Kelishami A, Aghaahmadi M (2010) The Persian linguistic based audio-visual data corpus, AVA II, considering coarticulation. Advances in multimedia modeling. In: Boll S, Tian Q, Zhang L, Zhang Z, Chen Y-P (eds) Lecture notes in computer science, vol 5916. Springer, Berlin, pp 284–294

    Google Scholar 

  5. 5.

    Bastanfard A, Rezaei NA, Mottaghizadeh M, Fazel M (2010) A novel multimedia educational speech therapy system for hearing impaired children. Paper presented at the proceedings of the advances in multimedia information processing, and 11th Pacific Rim conference on Multimedia: Part II, Shanghai, China

  6. 6.

    Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591

    Google Scholar 

  7. 7.

    Benguerel A-P, Pichora-Fuller MK (1984) Coarticulation effects in lipreading. J Speech Hear Res 25(4):600–607

    Google Scholar 

  8. 8.

    Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38(1):45–57

    MATH  Article  Google Scholar 

  9. 9.

    Fisher CG (1968) Confusions among visually perceived consonants. J Speech Hear Res 11:796–804

    Google Scholar 

  10. 10.

    Garcia C, Zikos G, Tziritas G (1998) A wavelet-based framework for face recognition. Paper presented at the workshop on advances in facial image analysis and recognition technology, 5th European conference on computer vision

  11. 11.

    Harris C, Stephens M (1988) A combined corner and edge detector. Paper presented at the proceedings of the 4th Alvey vision conference

  12. 12.

    Hartigan JA, Wong MA (1977) Algorithm AS 136: a K-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108

    Google Scholar 

  13. 13.

    Henton C, Edelman B (1996) Generating and manipulating emotional synthetic speech on a personal computer. Multimed Tools Appl 3(2):105–125

    Article  Google Scholar 

  14. 14. Audiovisual database of spoken American English. Accessed 13th December 2011

  15. 15.

    Joe H, Ward J (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  Google Scholar 

  16. 16.

    Karabalkan H, Erdoğan H (2007) Audio-visual speech recognition in vehicular noise using a multi-classifier approach. Paper presented at the DSP for in-Vehicle and Mobile Systems, Istanbul, Turkey

  17. 17.

    Kjellstrm H, Engwall O, Abdou S, Balter O (2007) Audio-visual phoneme classification for pronunciation training applications paper presented at the 8th Annual Conference of the International Speech Communication Association

  18. 18.

    Kjellström H, Engwall O (2009) Audiovisual-to-articulatory inversion. Speech Commun 51(3):195–209

    Article  Google Scholar 

  19. 19.

    Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  20. 20.

    Krňoul Z, Císař P, Železný M, Holas J (2005) Viseme analysis for speech-driven facial animation for Czech audio-visual speech synthesis. Paper presented at the SPECOM, Moscow, Russia

  21. 21.

    Lehiste I, Shockey L (1972) Coarticulation effects in the identification of final plosives. J Acoust Soc Am 51(1A):101

    Article  Google Scholar 

  22. 22.

    Leszczynski M, Skarbek W (2005) Viseme classification for talking head application computer analysis of images and patterns. In: Gagalowicz A, Philips W (eds) Lecture notes in computer science, vol 3691. Springer, Berlin, pp 773–780

    Google Scholar 

  23. 23.

    Lofqvist A (2009) Vowel-to-vowel coarticulation in Japanese: the effect of consonant duration. J Acoust Soc Am 125(2):636–639

    Article  Google Scholar 

  24. 24.

    Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297

    Article  Google Scholar 

  25. 25.

    Melenchon J, Simo J, Cobo G, Martinez E (2007) Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes. Paper presented at the International Conference on Auditory-Visual Speech Processing

  26. 26.

    Möttönen R, Olivés J, Kulja J, Sams M (2000) Parameterized visual speech synthesis and its evaluation Proc. of EUSIPCO 2000, Tampere, Finland

  27. 27.

    Nefian AV, Liang L, Pi X, Liu X, Mao C, Murphy K (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of ICASSP‘02

  28. 28.

    Potamianos G, Graf HP, Cosatto E (1998) An image transform approach for HMM based automatic lipreading. International Conference on Image Processing ICIP (3):173–177

  29. 29.

    Potamianos G, Neti C, Luettin J, Matthews I (2004) Audiovisual automatic speech recognition: an overview. Issues inb Visual and Audio-Visual Speech Processing, MIT Press

  30. 30.

    Safabakhsh R, Mirzazadeh F. AUT-Talk: a farsi talking head. In: information and communication technologies, 2006. ICTTA ‘06. 2nd, 0-0 0 2006, pp 2994–2998

  31. 31.

    Salah W, Walid M, Abdelmajid H (2007) Lip localization and viseme classification for visual speech recognition. Int J Comput Inf Sci 5(1):62–75

    Google Scholar 

  32. 32.

    Scholkopf B, Smola AJ, Muller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  33. 33.

    Shaw R, Laplante PA, Salinas J, Riccone R (1996) A multimedia speech learning system for the hearing impaired. Multimed Tools Appl 3(1):55–70

    Article  Google Scholar 

  34. 34.

    Tiddeman B, Perrett D (2002) Prototyping and transforming visemes for animated speech. Paper presented at the proceedings of the computer animation

  35. 35.

    Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86

    Article  Google Scholar 

  36. 36.

    Visser M, Poel M, Nijholt A (1999) Classifying visemes for automatic lipreading. Paper presented at the Proceedings of the Second International Workshop on Text, Speech and Dialogue

  37. 37.

    Waters K, Levergood T (1995) DECface: a system for synthetic face applications. Multimed Tools Appl 1(4):349–366

    Article  Google Scholar 

  38. 38.

    Williams JJ, Rutledge JC, Katsaggelos AK, Garstecki DC (1998) Frame rate and viseme analysis for multimedia applications to assist speechreading. J VLSI Signal Process 20(1):7–23

    Article  Google Scholar 

  39. 39.

    Yu K, Jiang X, Bunke H (2002) Sentence lipreading using hidden Markov model with integrated grammar. In: Hidden Markov models. World Scientific Publishing Co., Inc., pp 161–176

Download references

Author information



Corresponding author

Correspondence to Mohammad Aghaahmadi.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Aghaahmadi, M., Dehshibi, M.M., Bastanfard, A. et al. Clustering Persian viseme using phoneme subspace for developing visual speech application. Multimed Tools Appl 65, 521–541 (2013).

Download citation


  • Audio/visual processing
  • Computer assisted pronunciation training
  • Eigen space
  • Multimedia systems
  • Persian viseme clustering