Skip to main content
Log in

An adaptive approach for lip-reading using image and depth data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Lip-reading (LR) systems play an important role for automatic speech recognition when acoustic information is corrupted or unavailable. This article proposes an adaptive LR system for speech segment recognition using image and depth data. In addition to 2D images, the proposed system handles depth data that are very informative about 3D lips’ deformations when uttering and present a certain robustness against the variation of mouth skin color and texture. The proposed system is based on two main steps. In the first step, the mouth thumbnails are extracted based on a 3D face pose tracking. Then, appearance and motion descriptors are computed and combined in a final feature vector describing the uttered speech. The accuracy of 3D face tracking module is evaluated on the BIWI Kinect Head Pose database. The obtained results show that our method is competitive comparing to other state-of-the-art methods combining image and depth data (i.e., 2.26 m m and 3.86 for mean position error and mean orientation error). Additionally, the overall LR system is evaluated using three public LR datasets (i.e., MIRACL-VC1, OuluVS, and CUAVE). The obtained results demonstrate that data are complementary to 2D image data and reduce the speaker dependency problem in LR. The OuluVS and CUAVE datasets containing 2D images only are used to evaluate the proposed system when depth data are unavailable and to compare it to recent state-of-the art LR systems. The obtained results show very competitive recognition rates (up to 96 % for MIRACL-VC1, 93.2 % for OuluVS, and 90 % for CUAVE).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The partial derivatives of E u are analytically computed.

  2. In our set-up N s t a n d is experimentally fixed to 20 frames.

  3. MIRACL-VC1 is accessible following https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1

References

  1. Ahlberg J (2001) Candide-3 - an updated parameterised face. Tech. rep.

  2. Aleksic PS, Katsaggelos AK (2003) Product hmms for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo, vol 2, pp II–481

  3. Bakry A, Elgammal A (2013) Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: International conference on computer vision and pattern recognition, pp 684– 691

  4. Baltrušaitis T, Robinson P, Morency LP (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: International conference on computer vision and pattern recognition, pp 2610–2617

  5. Ben-Hamadou A, Soussen C, Daul C, Blondel W, Wolf D (2013) Flexible calibration of structured-light systems projecting point patterns. Comput Vis Image Underst 117(10):1468–1481

    Article  Google Scholar 

  6. Breitenstein MD, Küttel D, Weise T, Gool LJV, Pfister H (2008) Real-time face pose estimation from single range images. In: International conference on computer vision and pattern recognition, pp 1–8

  7. Cai Q, Gallup D, Zhang C, Zhang Z (2010) 3D deformable face tracking with a commodity depth camera. In: European conference on computer vision, pp 229–242

  8. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition, vol 1, pp 886–893

  9. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision, pp 428–441

  10. Danelakis A, Theoharis T, Pratikakis I (2014) A survey on facial expression recognition in 3d video sequences. Multimedia Tools and Applications, pp 1–39

  11. Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley

  12. Estellers V, Thiran JP (2012) Multi-pose lipreading and audio-visual speech recognition. Journal on Advances in Signal Processing 2012(1):1–23

    Article  Google Scholar 

  13. Fanelli G, Gall J, Gool LJV (2011) Real time head pose estimation with random regression forests. In: International conference on computer vision and pattern recognition, pp 617–624

  14. Fanelli G, Weise T, Gall J, Gool LV (2011) Real time head pose estimation from consumer depth cameras. In: International conference on pattern recognition, pp 101–110

  15. Fanelli G, Dantone M, Gall J, Fossati A, Van Gool L (2013) Random forests for real time 3D face analysis. Int J Comput Vis 101(3):437–458

    Article  Google Scholar 

  16. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Image analysis. Springer, pp 363–370

  17. Gogoi UR, Bhowmik MK, Saha P, Bhattacharjee D, De BK (2015) Facial mole detection: an approach towards face identification. Procedia Computer Science 46:1546–1553

    Article  Google Scholar 

  18. Gowdy JN, Subramanya A, Bartels C, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE International conference on acoustics, speech, and signal processing, vol 1, pp I–993

  19. Huang XD, Ariki Y, Jack MA (1990) Hidden Markov models for speech recognition. Columbia University Press, New York. ISBN: 0748601627

    Google Scholar 

  20. Kumar K, Chen T, Stern RM (2007) Profile view lip reading. In: IEEE International conference on acoustics, speech and signal Processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–429

  21. Lan Y, Theobald BJ, Harvey R (2012) View independent computer lip-reading. In: International conference on multimedia and expo, pp 432–437

  22. Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Kantor A, Lal P, Yung L, Bezman A et al (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: summary from the 2006 jhu summer workshop. In: IEEE International conference on acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–621

  23. Lucey P, Potamianos G (2006) Lipreading using profile versus frontal views. In: 2006 IEEE 8th workshop on multimedia signal processing. IEEE, pp 24–28

  24. Lucey P, Sridharan S (2006) Patch-based representation of visual speech. In: Proceedings of the HCSNet workshop on use of vision in human-computer interaction, pp 79–85

  25. Lucey PJ, Potamianos G, Sridharan S (2007) A unified approach to multi-pose audio-visual asr

  26. Lucey PJ, Sridharan S, Dean DB (2008) Continuous pose-invariant lipreading. In: Interspeech, casual productions, pp 2679–2682

  27. Mahdi W, Werda S, Hamadou AB (2008) A hybrid approach for automatic lip localization and viseme classification to enhance visual speech recognition. Integrated Computer-Aided Engineering 15(3):253–266

    Google Scholar 

  28. Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11(2):431–441

    Article  MathSciNet  MATH  Google Scholar 

  29. Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213

    Article  Google Scholar 

  30. Maurel P (2008) Shape gradients, shape warping and medical application to facial expression analysis. PhD thesis, Ecole Doctorale de Sciences Mathématiques de Paris Centre

  31. Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626

    Article  Google Scholar 

  32. Nanni L, Lumini A, Brahnam S (2012) Survey on lbp based texture descriptors for image classification. Expert Syst Appl 39(3):3634–3641

    Article  Google Scholar 

  33. Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled hmm for audio-visual speech recognition. In: Acoustics, speech, and signal processing, vol 2, pp II–2013

  34. Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313

    Article  MATH  Google Scholar 

  35. Padeleris P, Zabulis X, Argyros AA (2012) Head pose estimation on depth data based on particle swarm optimization. In: Computer vision and pattern recognition workshops, pp 42–49

  36. Paleček K (2014) Extraction of features for lip-reading using autoencoders. In: Speech and computer. Springer, pp 209–216

  37. Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing 17(3):423–435

    Article  Google Scholar 

  38. Patterson EK, Gurbuz S, Tufekci Z, Gowdy J (2002) Cuave: a new audio-visual database for multimodal human-computer interface research. In: Acoustics, speech, and signal processing, vol 2, pp 2017–2020

  39. Pei Y, Kim TK, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: International conference on computer vision, pp 129–136

  40. Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326

    Article  Google Scholar 

  41. Rabiner L, Juang BH (1986) An introduction to hidden markov models. IEEE ASSP Mag 3(1):4–16

    Article  Google Scholar 

  42. Rekik A, Ben-Hamadou A, Mahdi W (2013) 3D face pose tracking using low quality depth cameras. In: International conference on computer vision theory and applications, pp 223–228

  43. Rekik A, Ben-Hamadou A, Mahdi W (2014) A new visual speech recognition approach for RGB-D cameras. In: International conference on image analysis and recognition, pp 21–28

  44. Romero M, Pears N (2009) Landmark localisation in 3d face data. In: 6th IEEE International conference on advanced video and signal based surveillance, 2009. AVSS’09. IEEE, pp 73–78

  45. Saeed U (2011) Person identification using behavioral features from lip motion. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011). IEEE, pp 131–136

  46. Shaikh AA, Kumar DK, Yau WC, Che Azemin M, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: Image and Signal Processing (CISP), vol 1, pp 327–330

  47. Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated korean word recognition. Pattern Recogn 44(3):559–571

    Article  MATH  Google Scholar 

  48. Smisek J, Jancosek M, Pajdla T (2013) 3D with kinect. In: Consumer depth cameras for computer vision, pp 3–25

  49. Valstar MF, Martinez B, Binefa X, Pantic M (2010) Facial point detection using boosted regression and graph models. In: International conference on computer vision and pattern recognition, pp 2729– 2736

  50. Vapnik V (2000) The nature of statistical learning theory. Springer

  51. Vezzetti E, Marcolin F (2012) 3d human face description: landmarks measures and geometrical features. Image Vis Comput 30(10):698–712

    Article  Google Scholar 

  52. Vezzetti E, Calignano F, Moos S (2010) Computer-aided morphological analysis for maxillo-facial diagnostic: a preliminary study. J Plast Reconstr Aesthet Surg 63(2):218–226

    Article  Google Scholar 

  53. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  54. Werda S, Mahdi W, Hamadou AB (2007) A new lip-reading approach for human computer interaction. In: Proceedings of the 9th International conference on enterprise information systems, ICEIS 2007, Volume HCI, Funchal, Madeira, Portugal, June 12–16, 2007, pp 27–36

  55. Yargic A, Dogan M (2013) A lip reading application on ms kinect camera. In: Innovations in intelligent systems and applications, pp 1–5

  56. Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia 11(7):1254–1265

    Article  Google Scholar 

  57. Zhou Z, Zhao G, Pietikainen M (2010) Lipreading: a graph embedding approach. In: International conference on pattern recognition, pp 523–526

  58. Zhou Z, Zhao G, Pietikainen M (2011) Towards a practical lipreading system. In: International conference on computer vision and pattern recognition, pp 137–144

  59. Zhou Z, Hong X, Zhao G, Pietikainen M (2014) A compact representation of visual speech data using latent variables. IEEE Trans Pattern Anal Mach Intell 36(1):181–187

    Google Scholar 

  60. Zhou Z, Zhao G, Hong X, Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ahmed Rekik, Achraf Ben-Hamadou or Walid Mahdi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rekik, A., Ben-Hamadou, A. & Mahdi, W. An adaptive approach for lip-reading using image and depth data. Multimed Tools Appl 75, 8609–8636 (2016). https://doi.org/10.1007/s11042-015-2774-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2774-3

Keywords

Navigation