Abstract
Lip-reading (LR) systems play an important role for automatic speech recognition when acoustic information is corrupted or unavailable. This article proposes an adaptive LR system for speech segment recognition using image and depth data. In addition to 2D images, the proposed system handles depth data that are very informative about 3D lips’ deformations when uttering and present a certain robustness against the variation of mouth skin color and texture. The proposed system is based on two main steps. In the first step, the mouth thumbnails are extracted based on a 3D face pose tracking. Then, appearance and motion descriptors are computed and combined in a final feature vector describing the uttered speech. The accuracy of 3D face tracking module is evaluated on the BIWI Kinect Head Pose database. The obtained results show that our method is competitive comparing to other state-of-the-art methods combining image and depth data (i.e., 2.26 m m and 3.86∘ for mean position error and mean orientation error). Additionally, the overall LR system is evaluated using three public LR datasets (i.e., MIRACL-VC1, OuluVS, and CUAVE). The obtained results demonstrate that data are complementary to 2D image data and reduce the speaker dependency problem in LR. The OuluVS and CUAVE datasets containing 2D images only are used to evaluate the proposed system when depth data are unavailable and to compare it to recent state-of-the art LR systems. The obtained results show very competitive recognition rates (up to 96 % for MIRACL-VC1, 93.2 % for OuluVS, and 90 % for CUAVE).
Similar content being viewed by others
Notes
The partial derivatives of E u are analytically computed.
In our set-up N s t a n d is experimentally fixed to 20 frames.
MIRACL-VC1 is accessible following https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1
References
Ahlberg J (2001) Candide-3 - an updated parameterised face. Tech. rep.
Aleksic PS, Katsaggelos AK (2003) Product hmms for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo, vol 2, pp II–481
Bakry A, Elgammal A (2013) Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: International conference on computer vision and pattern recognition, pp 684– 691
Baltrušaitis T, Robinson P, Morency LP (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: International conference on computer vision and pattern recognition, pp 2610–2617
Ben-Hamadou A, Soussen C, Daul C, Blondel W, Wolf D (2013) Flexible calibration of structured-light systems projecting point patterns. Comput Vis Image Underst 117(10):1468–1481
Breitenstein MD, Küttel D, Weise T, Gool LJV, Pfister H (2008) Real-time face pose estimation from single range images. In: International conference on computer vision and pattern recognition, pp 1–8
Cai Q, Gallup D, Zhang C, Zhang Z (2010) 3D deformable face tracking with a commodity depth camera. In: European conference on computer vision, pp 229–242
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition, vol 1, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision, pp 428–441
Danelakis A, Theoharis T, Pratikakis I (2014) A survey on facial expression recognition in 3d video sequences. Multimedia Tools and Applications, pp 1–39
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley
Estellers V, Thiran JP (2012) Multi-pose lipreading and audio-visual speech recognition. Journal on Advances in Signal Processing 2012(1):1–23
Fanelli G, Gall J, Gool LJV (2011) Real time head pose estimation with random regression forests. In: International conference on computer vision and pattern recognition, pp 617–624
Fanelli G, Weise T, Gall J, Gool LV (2011) Real time head pose estimation from consumer depth cameras. In: International conference on pattern recognition, pp 101–110
Fanelli G, Dantone M, Gall J, Fossati A, Van Gool L (2013) Random forests for real time 3D face analysis. Int J Comput Vis 101(3):437–458
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Image analysis. Springer, pp 363–370
Gogoi UR, Bhowmik MK, Saha P, Bhattacharjee D, De BK (2015) Facial mole detection: an approach towards face identification. Procedia Computer Science 46:1546–1553
Gowdy JN, Subramanya A, Bartels C, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE International conference on acoustics, speech, and signal processing, vol 1, pp I–993
Huang XD, Ariki Y, Jack MA (1990) Hidden Markov models for speech recognition. Columbia University Press, New York. ISBN: 0748601627
Kumar K, Chen T, Stern RM (2007) Profile view lip reading. In: IEEE International conference on acoustics, speech and signal Processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–429
Lan Y, Theobald BJ, Harvey R (2012) View independent computer lip-reading. In: International conference on multimedia and expo, pp 432–437
Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Kantor A, Lal P, Yung L, Bezman A et al (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: summary from the 2006 jhu summer workshop. In: IEEE International conference on acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–621
Lucey P, Potamianos G (2006) Lipreading using profile versus frontal views. In: 2006 IEEE 8th workshop on multimedia signal processing. IEEE, pp 24–28
Lucey P, Sridharan S (2006) Patch-based representation of visual speech. In: Proceedings of the HCSNet workshop on use of vision in human-computer interaction, pp 79–85
Lucey PJ, Potamianos G, Sridharan S (2007) A unified approach to multi-pose audio-visual asr
Lucey PJ, Sridharan S, Dean DB (2008) Continuous pose-invariant lipreading. In: Interspeech, casual productions, pp 2679–2682
Mahdi W, Werda S, Hamadou AB (2008) A hybrid approach for automatic lip localization and viseme classification to enhance visual speech recognition. Integrated Computer-Aided Engineering 15(3):253–266
Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11(2):431–441
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213
Maurel P (2008) Shape gradients, shape warping and medical application to facial expression analysis. PhD thesis, Ecole Doctorale de Sciences Mathématiques de Paris Centre
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626
Nanni L, Lumini A, Brahnam S (2012) Survey on lbp based texture descriptors for image classification. Expert Syst Appl 39(3):3634–3641
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled hmm for audio-visual speech recognition. In: Acoustics, speech, and signal processing, vol 2, pp II–2013
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313
Padeleris P, Zabulis X, Argyros AA (2012) Head pose estimation on depth data based on particle swarm optimization. In: Computer vision and pattern recognition workshops, pp 42–49
Paleček K (2014) Extraction of features for lip-reading using autoencoders. In: Speech and computer. Springer, pp 209–216
Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing 17(3):423–435
Patterson EK, Gurbuz S, Tufekci Z, Gowdy J (2002) Cuave: a new audio-visual database for multimodal human-computer interface research. In: Acoustics, speech, and signal processing, vol 2, pp 2017–2020
Pei Y, Kim TK, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: International conference on computer vision, pp 129–136
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
Rabiner L, Juang BH (1986) An introduction to hidden markov models. IEEE ASSP Mag 3(1):4–16
Rekik A, Ben-Hamadou A, Mahdi W (2013) 3D face pose tracking using low quality depth cameras. In: International conference on computer vision theory and applications, pp 223–228
Rekik A, Ben-Hamadou A, Mahdi W (2014) A new visual speech recognition approach for RGB-D cameras. In: International conference on image analysis and recognition, pp 21–28
Romero M, Pears N (2009) Landmark localisation in 3d face data. In: 6th IEEE International conference on advanced video and signal based surveillance, 2009. AVSS’09. IEEE, pp 73–78
Saeed U (2011) Person identification using behavioral features from lip motion. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011). IEEE, pp 131–136
Shaikh AA, Kumar DK, Yau WC, Che Azemin M, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: Image and Signal Processing (CISP), vol 1, pp 327–330
Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated korean word recognition. Pattern Recogn 44(3):559–571
Smisek J, Jancosek M, Pajdla T (2013) 3D with kinect. In: Consumer depth cameras for computer vision, pp 3–25
Valstar MF, Martinez B, Binefa X, Pantic M (2010) Facial point detection using boosted regression and graph models. In: International conference on computer vision and pattern recognition, pp 2729– 2736
Vapnik V (2000) The nature of statistical learning theory. Springer
Vezzetti E, Marcolin F (2012) 3d human face description: landmarks measures and geometrical features. Image Vis Comput 30(10):698–712
Vezzetti E, Calignano F, Moos S (2010) Computer-aided morphological analysis for maxillo-facial diagnostic: a preliminary study. J Plast Reconstr Aesthet Surg 63(2):218–226
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Werda S, Mahdi W, Hamadou AB (2007) A new lip-reading approach for human computer interaction. In: Proceedings of the 9th International conference on enterprise information systems, ICEIS 2007, Volume HCI, Funchal, Madeira, Portugal, June 12–16, 2007, pp 27–36
Yargic A, Dogan M (2013) A lip reading application on ms kinect camera. In: Innovations in intelligent systems and applications, pp 1–5
Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia 11(7):1254–1265
Zhou Z, Zhao G, Pietikainen M (2010) Lipreading: a graph embedding approach. In: International conference on pattern recognition, pp 523–526
Zhou Z, Zhao G, Pietikainen M (2011) Towards a practical lipreading system. In: International conference on computer vision and pattern recognition, pp 137–144
Zhou Z, Hong X, Zhao G, Pietikainen M (2014) A compact representation of visual speech data using latent variables. IEEE Trans Pattern Anal Mach Intell 36(1):181–187
Zhou Z, Zhao G, Hong X, Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Rekik, A., Ben-Hamadou, A. & Mahdi, W. An adaptive approach for lip-reading using image and depth data. Multimed Tools Appl 75, 8609–8636 (2016). https://doi.org/10.1007/s11042-015-2774-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2774-3