An adaptive approach for lip-reading using image and depth data

Rekik, Ahmed; Ben-Hamadou, Achraf; Mahdi, Walid

doi:10.1007/s11042-015-2774-3

An adaptive approach for lip-reading using image and depth data

Published: 09 July 2015

Volume 75, pages 8609–8636, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

693 Accesses
16 Citations
Explore all metrics

Abstract

Lip-reading (LR) systems play an important role for automatic speech recognition when acoustic information is corrupted or unavailable. This article proposes an adaptive LR system for speech segment recognition using image and depth data. In addition to 2D images, the proposed system handles depth data that are very informative about 3D lips’ deformations when uttering and present a certain robustness against the variation of mouth skin color and texture. The proposed system is based on two main steps. In the first step, the mouth thumbnails are extracted based on a 3D face pose tracking. Then, appearance and motion descriptors are computed and combined in a final feature vector describing the uttered speech. The accuracy of 3D face tracking module is evaluated on the BIWI Kinect Head Pose database. The obtained results show that our method is competitive comparing to other state-of-the-art methods combining image and depth data (i.e., 2.26 m m and 3.86^∘ for mean position error and mean orientation error). Additionally, the overall LR system is evaluated using three public LR datasets (i.e., MIRACL-VC1, OuluVS, and CUAVE). The obtained results demonstrate that data are complementary to 2D image data and reduce the speaker dependency problem in LR. The OuluVS and CUAVE datasets containing 2D images only are used to evaluate the proposed system when depth data are unavailable and to compare it to recent state-of-the art LR systems. The obtained results show very competitive recognition rates (up to 96 % for MIRACL-VC1, 93.2 % for OuluVS, and 90 % for CUAVE).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The partial derivatives of E _u are analytically computed.
In our set-up N _{s
t
a
n
d} is experimentally fixed to 20 frames.
MIRACL-VC1 is accessible following https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1

References

Ahlberg J (2001) Candide-3 - an updated parameterised face. Tech. rep.
Aleksic PS, Katsaggelos AK (2003) Product hmms for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo, vol 2, pp II–481
Bakry A, Elgammal A (2013) Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: International conference on computer vision and pattern recognition, pp 684– 691
Baltrušaitis T, Robinson P, Morency LP (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: International conference on computer vision and pattern recognition, pp 2610–2617
Ben-Hamadou A, Soussen C, Daul C, Blondel W, Wolf D (2013) Flexible calibration of structured-light systems projecting point patterns. Comput Vis Image Underst 117(10):1468–1481
Article Google Scholar
Breitenstein MD, Küttel D, Weise T, Gool LJV, Pfister H (2008) Real-time face pose estimation from single range images. In: International conference on computer vision and pattern recognition, pp 1–8
Cai Q, Gallup D, Zhang C, Zhang Z (2010) 3D deformable face tracking with a commodity depth camera. In: European conference on computer vision, pp 229–242
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: International conference on computer vision and pattern recognition, vol 1, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision, pp 428–441
Danelakis A, Theoharis T, Pratikakis I (2014) A survey on facial expression recognition in 3d video sequences. Multimedia Tools and Applications, pp 1–39
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley
Estellers V, Thiran JP (2012) Multi-pose lipreading and audio-visual speech recognition. Journal on Advances in Signal Processing 2012(1):1–23
Article Google Scholar
Fanelli G, Gall J, Gool LJV (2011) Real time head pose estimation with random regression forests. In: International conference on computer vision and pattern recognition, pp 617–624
Fanelli G, Weise T, Gall J, Gool LV (2011) Real time head pose estimation from consumer depth cameras. In: International conference on pattern recognition, pp 101–110
Fanelli G, Dantone M, Gall J, Fossati A, Van Gool L (2013) Random forests for real time 3D face analysis. Int J Comput Vis 101(3):437–458
Article Google Scholar
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Image analysis. Springer, pp 363–370
Gogoi UR, Bhowmik MK, Saha P, Bhattacharjee D, De BK (2015) Facial mole detection: an approach towards face identification. Procedia Computer Science 46:1546–1553
Article Google Scholar
Gowdy JN, Subramanya A, Bartels C, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE International conference on acoustics, speech, and signal processing, vol 1, pp I–993
Huang XD, Ariki Y, Jack MA (1990) Hidden Markov models for speech recognition. Columbia University Press, New York. ISBN: 0748601627
Google Scholar
Kumar K, Chen T, Stern RM (2007) Profile view lip reading. In: IEEE International conference on acoustics, speech and signal Processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–429
Lan Y, Theobald BJ, Harvey R (2012) View independent computer lip-reading. In: International conference on multimedia and expo, pp 432–437
Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Kantor A, Lal P, Yung L, Bezman A et al (2007) Articulatory feature-based methods for acoustic and audio-visual speech recognition: summary from the 2006 jhu summer workshop. In: IEEE International conference on acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE, vol 4, pp IV–621
Lucey P, Potamianos G (2006) Lipreading using profile versus frontal views. In: 2006 IEEE 8th workshop on multimedia signal processing. IEEE, pp 24–28
Lucey P, Sridharan S (2006) Patch-based representation of visual speech. In: Proceedings of the HCSNet workshop on use of vision in human-computer interaction, pp 79–85
Lucey PJ, Potamianos G, Sridharan S (2007) A unified approach to multi-pose audio-visual asr
Lucey PJ, Sridharan S, Dean DB (2008) Continuous pose-invariant lipreading. In: Interspeech, casual productions, pp 2679–2682
Mahdi W, Werda S, Hamadou AB (2008) A hybrid approach for automatic lip localization and viseme classification to enhance visual speech recognition. Integrated Computer-Aided Engineering 15(3):253–266
Google Scholar
Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11(2):431–441
Article MathSciNet MATH Google Scholar
Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213
Article Google Scholar
Maurel P (2008) Shape gradients, shape warping and medical application to facial expression analysis. PhD thesis, Ecole Doctorale de Sciences Mathématiques de Paris Centre
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626
Article Google Scholar
Nanni L, Lumini A, Brahnam S (2012) Survey on lbp based texture descriptors for image classification. Expert Syst Appl 39(3):3634–3641
Article Google Scholar
Nefian AV, Liang L, Pi X, Xiaoxiang L, Mao C, Murphy K (2002) A coupled hmm for audio-visual speech recognition. In: Acoustics, speech, and signal processing, vol 2, pp II–2013
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4):308–313
Article MATH Google Scholar
Padeleris P, Zabulis X, Argyros AA (2012) Head pose estimation on depth data based on particle swarm optimization. In: Computer vision and pattern recognition workshops, pp 42–49
Paleček K (2014) Extraction of features for lip-reading using autoencoders. In: Speech and computer. Springer, pp 209–216
Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing 17(3):423–435
Article Google Scholar
Patterson EK, Gurbuz S, Tufekci Z, Gowdy J (2002) Cuave: a new audio-visual database for multimodal human-computer interface research. In: Acoustics, speech, and signal processing, vol 2, pp 2017–2020
Pei Y, Kim TK, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: International conference on computer vision, pp 129–136
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
Article Google Scholar
Rabiner L, Juang BH (1986) An introduction to hidden markov models. IEEE ASSP Mag 3(1):4–16
Article Google Scholar
Rekik A, Ben-Hamadou A, Mahdi W (2013) 3D face pose tracking using low quality depth cameras. In: International conference on computer vision theory and applications, pp 223–228
Rekik A, Ben-Hamadou A, Mahdi W (2014) A new visual speech recognition approach for RGB-D cameras. In: International conference on image analysis and recognition, pp 21–28
Romero M, Pears N (2009) Landmark localisation in 3d face data. In: 6th IEEE International conference on advanced video and signal based surveillance, 2009. AVSS’09. IEEE, pp 73–78
Saeed U (2011) Person identification using behavioral features from lip motion. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011). IEEE, pp 131–136
Shaikh AA, Kumar DK, Yau WC, Che Azemin M, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: Image and Signal Processing (CISP), vol 1, pp 327–330
Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated korean word recognition. Pattern Recogn 44(3):559–571
Article MATH Google Scholar
Smisek J, Jancosek M, Pajdla T (2013) 3D with kinect. In: Consumer depth cameras for computer vision, pp 3–25
Valstar MF, Martinez B, Binefa X, Pantic M (2010) Facial point detection using boosted regression and graph models. In: International conference on computer vision and pattern recognition, pp 2729– 2736
Vapnik V (2000) The nature of statistical learning theory. Springer
Vezzetti E, Marcolin F (2012) 3d human face description: landmarks measures and geometrical features. Image Vis Comput 30(10):698–712
Article Google Scholar
Vezzetti E, Calignano F, Moos S (2010) Computer-aided morphological analysis for maxillo-facial diagnostic: a preliminary study. J Plast Reconstr Aesthet Surg 63(2):218–226
Article Google Scholar
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Article Google Scholar
Werda S, Mahdi W, Hamadou AB (2007) A new lip-reading approach for human computer interaction. In: Proceedings of the 9th International conference on enterprise information systems, ICEIS 2007, Volume HCI, Funchal, Madeira, Portugal, June 12–16, 2007, pp 27–36
Yargic A, Dogan M (2013) A lip reading application on ms kinect camera. In: Innovations in intelligent systems and applications, pp 1–5
Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimedia 11(7):1254–1265
Article Google Scholar
Zhou Z, Zhao G, Pietikainen M (2010) Lipreading: a graph embedding approach. In: International conference on pattern recognition, pp 523–526
Zhou Z, Zhao G, Pietikainen M (2011) Towards a practical lipreading system. In: International conference on computer vision and pattern recognition, pp 137–144
Zhou Z, Hong X, Zhao G, Pietikainen M (2014) A compact representation of visual speech data using latent variables. IEEE Trans Pattern Anal Mach Intell 36(1):181–187
Google Scholar
Zhou Z, Zhao G, Hong X, Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605
Article Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Information Systems and Advanced Computing Laboratory (MIRACL), Sfax University, Pôle Technologique de Sfax, Route de Tunis Km 10, BP 242, 3021, Sfax, Tunisia
Ahmed Rekik & Walid Mahdi
Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 888, Zip Code 21974, Hawiyah Taif, Kingdom of Saudi Arabia
Walid Mahdi
Valeo Driving Assistance Research Center, 34 rue St-André Z.I. des Vignes, 93012, Bobigny, France
Achraf Ben-Hamadou

Authors

Ahmed Rekik
View author publications
You can also search for this author in PubMed Google Scholar
Achraf Ben-Hamadou
View author publications
You can also search for this author in PubMed Google Scholar
Walid Mahdi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ahmed Rekik, Achraf Ben-Hamadou or Walid Mahdi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rekik, A., Ben-Hamadou, A. & Mahdi, W. An adaptive approach for lip-reading using image and depth data. Multimed Tools Appl 75, 8609–8636 (2016). https://doi.org/10.1007/s11042-015-2774-3

Download citation

Received: 03 December 2014
Revised: 23 May 2015
Accepted: 23 June 2015
Published: 09 July 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11042-015-2774-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive approach for lip-reading using image and depth data

Abstract

Access this article

Similar content being viewed by others

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Lip Reading from Multi View Facial Images Using 3D-AAM

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An adaptive approach for lip-reading using image and depth data

Abstract

Access this article

Similar content being viewed by others

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

Lip Reading from Multi View Facial Images Using 3D-AAM

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation