Advertisement

Visual Speech Feature Representations: Recent Advances

Chapter

Abstract

Exploiting the relevant speech information that is embedded in facial images has been a significant research topic in recent years, because it has provided complementary information to acoustic signals for a wide range of automatic speech recognition (ASR) tasks. Visual information is particularly important in many real applications where acoustic signals are corrupted by environmental noises. This chapter reviews the most recent advances in feature extraction and representation for Visual Speech Recognition (VSR). In comparison with other surveys published in the past decade, this chapter presents a more up-to-date survey and highlights the strengths of two newly developed approaches (i.e., graph-based learning and deep learning) for VSR. In particular, we summarise the methods of using these two techniques to overcome one of the most challenging difficulties in this area-that is, how to automatically learn good visual feature representations from facial images to replace the widely used handcrafted features. This chapter concludes by discussing potential visual feature representation solutions that may overcome the remaining challenges in this domain.

Keywords

Discrete Cosine Transform Speech Recognition Automatic Speech Recognition Convolutional Neural Network Visual Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    T.R. Almaev, M.F. Valstar, Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition, in Proceedings of Humaine Association Conference on Affective Computing and Intelligent Interaction (IEEE, Geneva, 2013), pp. 356–361Google Scholar
  2. 2.
    A. Bakry, A. Elgammal, Mkpls: manifold kernel partial least squares for lipreading and speaker identification, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Washington, 2013), pp. 684–691Google Scholar
  3. 3.
    Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19, 153 (2007)Google Scholar
  4. 4.
    C. Bregler, Y. Konig, eigenlips for robust speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2 (IEEE, Washington, 1994), pp. 669–672Google Scholar
  5. 5.
    D. Burnham, E. Ambikairajah, J. Arciuli, M. Bennamoun, C.T. Best, S. Bird, A. Butcher, C. Cassidy, G. Chetty, F.M. Cox et al., A blueprint for a comprehensive Australian English auditory-visual speech corpus, in Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus (2009), pp. 96–107Google Scholar
  6. 6.
    D. Burnham, D. Estival, S. Fazio, J. Viethen, F. Cox, R. Dale, S. Cassidy, J. Epps, R. Togneri, M. Wagner et al. Building an audio-visual corpus of Australian English: Large corpus collection with an economical portable and replicable black box, in Proceedings of Twelfth Annual Conference of the International Speech Communication Association (2011)Google Scholar
  7. 7.
    H.E. Cetingul, Y. Yemez, E. Erzin, A.M. Tekalp, Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Trans. Image Process. 15(10), 2879–2891 (2006)CrossRefzbMATHGoogle Scholar
  8. 8.
    Y. Cheung, X. Liu, X. You, A local region based approach to lip tracking. Pattern Recogn. 45(9), 3336–3347 (2012)Google Scholar
  9. 9.
    N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (IEEE, Washington, 2005), pp. 886–893Google Scholar
  10. 10.
    M. Dantone, J. Gall, G. Fanelli, L. van Gool, Real-time facial feature detection using conditional regression forests, in Proceedings of International Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  11. 11.
    P. Delmas, P. Coulon, V. Fristot, Automatic snakes for robust lip boundaries extraction, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 6 (IEEE, Washington, 1999), pp. 3069–3072Google Scholar
  12. 12.
    L. Deng, D. Yu, Deep Learning: Methods and Applications (Now Publishers, Boston, 2014)zbMATHGoogle Scholar
  13. 13.
    S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)CrossRefGoogle Scholar
  14. 14.
    A. Elgammal, C.S. Lee, Separating style and content on a nonlinear manifold, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. I (IEEE, Washington, 2004), pp. 478–485Google Scholar
  15. 15.
    V. Estellers, M. Gurban, J. Thiran, On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)CrossRefGoogle Scholar
  16. 16.
    J. Gehring, Y. Miao, F. Metze, A. Waibel, Extracting deep bottleneck features using stacked auto-encoders, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, Washington, 2013), pp. 3377–3381Google Scholar
  17. 17.
    Y. Gizatdinova, V. Surakka Feature-based detection of facial landmarks from neutral and expressive facial images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 135–139 (2006)CrossRefGoogle Scholar
  18. 18.
    J.N. Gowdy, A. Subramanya, C. Bartels, J. Bilmes, Dbn based multi-stream models for audio-visual speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (IEEE, Washington, 2004), pp. 993–996Google Scholar
  19. 19.
    A. Graves, Ar. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings ofIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Washington, 2013), pp. 6645–6649Google Scholar
  20. 20.
    M. Gurban, J. Thiran, Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57(12), 4765–4776 (2009)MathSciNetCrossRefGoogle Scholar
  21. 21.
    M. Hayat, M. Bennamoun, S. An, Learning non-linear reconstruction models for image set classification, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1915–1922Google Scholar
  22. 22.
    M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 713–727 (2015)CrossRefGoogle Scholar
  23. 23.
    T.J. Hazen, K. Saenko, C.H. La, J.R. Glass, A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments, in Proceedings of the 6th international conference on Multimodal interfaces (ACM, New York, 2004), pp. 235–242Google Scholar
  24. 24.
    G. Hinton, Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    G. Hinton, L. Deng, D. Yu, G.E. Dahl, Ar. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)Google Scholar
  27. 27.
    J. Huang, B. Kingsbury, Audio-visual deep learning for noise robust speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Washington, 2013), pp. 7596–7599Google Scholar
  28. 28.
    J. Huang, G. Potamianos, J. Connell, C. Neti, Audio-visual speech recognition using an infrared headset. Speech Comm. 44(1), 83–96 (2004)CrossRefGoogle Scholar
  29. 29.
    X. Huang, A. Acero, H.W. Hon et al., Spoken Language Processing (Prentice Hall, Englewood Cliffs, 2001)Google Scholar
  30. 30.
    B. Jiang, M.F. Valstar, M. Pantic, Action unit detection using sparse appearance descriptors in space-time video volumes, in Proceedings of IEEE International Conference on Automatic Face & Gesture Recognition (IEEE, Washington, 2011), pp. 314–321Google Scholar
  31. 31.
    S. Lankton, A. Tannenbaum, Localizing region-based active contours. IEEE Trans. Image Process. 17(11), 2029–2039 (2008)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  33. 33.
    Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in Proceedings of 2010 IEEE International Symposium on Circuits and Systems (IEEE, Washington, 2010), pp. 253–256Google Scholar
  34. 34.
    H. Lee, C. Ekanadham, A.Y. Ng, Sparse deep belief net model for visual area V2, in proceedings of Adv. Neural Inf. Process. Syst., 873–880 (2008)Google Scholar
  35. 35.
    M. Li, Y. Cheung, Automatic lip localization under face illumination with shadow consideration. Signal Process. 89(12), 2425–2434 (2009)CrossRefzbMATHGoogle Scholar
  36. 36.
    A. Liew, S. Leung, W. Lau, Lip contour extraction from color images using a deformable model. Pattern Recogn. 35(12), 2949–2962 (2002)CrossRefzbMATHGoogle Scholar
  37. 37.
    P. Lucey, S. Sridharan, Patch-based representation of visual speech, in Proceedings of the HCSNet workshop on Use of vision in human-computer interaction, vol. 56 (Australian Computer Society, Inc., 2006), pp. 79–85Google Scholar
  38. 38.
    I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)CrossRefGoogle Scholar
  39. 39.
    K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitre, Xm2vtsdb: The extended m2vts database, in Proceedings of Second International Conference on Audio and Video-based Biometric Person Authentication, vol. 964, Citeseer (1999), pp. 965–966Google Scholar
  40. 40.
    Ar. Mohamed, G.E. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)Google Scholar
  41. 41.
    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in Proceedings of the 28th International Conference on Machine Learning (2011), pp. 689–696Google Scholar
  42. 42.
    Q. Nguyen, M. Milgram, T. Nguyen, Multi features models for robust lip tracking, in Proceedings of International Conference on Control, Automation, Robotics and Vision (IEEE, Washington, 2008), pp. 1333–1337Google Scholar
  43. 43.
    K. Noda, Y. Yamaguchi, K. Nakadai, H.G. Okuno, T. Ogata, Audio-visual speech recognition using deep learning. Appl. Intell. 42, 1–16 (2014)Google Scholar
  44. 44.
    K. Noda, Y. Yamaguchi, K. Nakadai, H.G. Okuno, T. Ogata, Lipreading using convolutional neural network, in Proceedings of INTERSPEECH (2014), pp. 1149–1153Google Scholar
  45. 45.
    T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 29(1), 51–59 (1996)CrossRefGoogle Scholar
  46. 46.
    E. Ong, R. Bowden, Learning sequential patterns for lipreading, in Proceedings of the 22nd British Machine Vision Conference (2011), pp. 55.1–55.10Google Scholar
  47. 47.
    K. Paleček, Extraction of features for lip-reading using autoencoders, in Speech and Computer (Springer, Berlin, 2014), pp. 209–216Google Scholar
  48. 48.
    G. Papandreou, A. Katsamanis, V. Pitsikalis, P. Maragos, Multimodal fusion and learning with uncertain features applied to audiovisual speech recognition, in Proceedings of IEEE 9th Workshop on Multimedia Signal Processing (IEEE, Washington, 2007), pp. 264–267Google Scholar
  49. 49.
    G. Papandreou, A. Katsamanis, V. Pitsikalis, P. Maragos, Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)CrossRefGoogle Scholar
  50. 50.
    E.K. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, Cuave: a new audio-visual database for multimodal human-computer interface research, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2 (IEEE, Washington, 2002), pp. 2017–2020Google Scholar
  51. 51.
    Y. Pei, T.K. Kim, H. Zha, Unsupervised random forest manifold alignment for lipreading, in Proceedings of IEEE International Conference on Computer Vision (IEEE, Washington, 2013), pp. 129–136Google Scholar
  52. 52.
    R. Poppe, A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  53. 53.
    G. Potamianos, H.P. Graf, E. Cosatto, An image transform approach for hmm based automatic lipreading, in Proceedings of International Conference on Image Processing (IEEE, Washington, 1998), pp. 173–177Google Scholar
  54. 54.
    G. Potamianos, C. Neti, G. Gravier, A. Garg, A.W. Senior, Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  55. 55.
    G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: an overview. Issues Vis. Audio-Vis. Speech Process. 22, 23 (2004)Google Scholar
  56. 56.
    M. Ranzato, F.J. Huang, Y.L. Boureau, Y. LeCun, Unsupervised learning of invariant feature hierarchies with applications to object recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Washington, 2007), pp. 1–8Google Scholar
  57. 57.
    K. Saenko, K. Livescu, J. Glass, T. Darrell, Multistream articulatory feature-based models for visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1700–1707 (2009)CrossRefGoogle Scholar
  58. 58.
    T.N. Sainath, B. Kingsbury, B. Ramabhadran, Auto-encoder bottleneck features using deep belief networks, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, Washington, 2012), pp. 4153–4156Google Scholar
  59. 59.
    R. Salakhutdinov, G.E. Hinton, Deep boltzmann machines, in Proceedings of International Conference on Artificial Intelligence and Statistics (2009), pp. 448–455Google Scholar
  60. 60.
    P. Scanlon, R. Reilly, Feature analysis for automatic speechreading, in Proceeding of IEEE Fourth Workshop on Multimedia Signal Processing (IEEE, Washington, 2001), pp. 625–630Google Scholar
  61. 61.
    X. Shao, J. Barker, Stream weight estimation for multistream audio–visual speech recognition in a multispeaker environment. Speech Comm. 50(4), 337–353 (2008)CrossRefGoogle Scholar
  62. 62.
    N. Srivastava, R. Salakhutdinov, Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)MathSciNetzbMATHGoogle Scholar
  63. 63.
    N. Srivastava, R.R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in Advances in neural information processing systems (2012), pp. 2222–2230Google Scholar
  64. 64.
    D. Stewart, R. Seymour, A. Pass, J. Ming, Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans. Cybern. 44(2), 175–184 (2014)CrossRefGoogle Scholar
  65. 65.
    J. Su, A. Srivastava, F.D. de Souza, S. Sarkar, Rate-invariant analysis of trajectories on riemannian manifolds with application in visual speech recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Washington, 2004)Google Scholar
  66. 66.
    J. Su, S. Kurtek, E. Klassen, A. Srivastava et al., Statistical analysis of trajectories on riemannian manifolds: bird migration, hurricane tracking and video surveillance. Ann. Appl. Stat. 8(1), 530–552 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  67. 67.
    C. Sui, S. Haque, R. Togneri, M. Bennamoun, A 3D audio-visual corpus for speech recognition, in Proceedings of Australasian International Conference on Speech Science and Technology (2012)Google Scholar
  68. 68.
    C. Sui, R. Togneri, S. Haque, M. Bennamoun, Discrimination comparison between audio and visual features, in Proceedings of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (IEEE, Washington, 2012), pp. 1609–1612Google Scholar
  69. 69.
    C. Sui, M. Bennamoun, R. Togneri, S. Haque, A lip extraction algorithm using region-based ACM with automatic contour initialization, in Proceedings of IEEE Workshop on Applications of Computer Vision (IEEE, Washington, 2013), pp. 275–280Google Scholar
  70. 70.
    C. Sui, R. Togneri, M. Bennamoun, Extracting deep bottleneck features for visual speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, Washington, 2015)Google Scholar
  71. 71.
    P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  72. 72.
    M. Wagner, D. Tran, R. Togneri, P. Rose, D. Powers, M. Onslow, D. Loakes, T. Lewis, T. Kuratate, Y. Kinoshita et al., The big Australian speech corpus (the big ASC), in Proceedings of 13th Australasian International Conference on Speech Science and Technology (2010), pp. 166–170Google Scholar
  73. 73.
    D. Yu, M.L. Seltzer, Improved bottleneck features using pretrained deep neural networks, in Proceedings of INTERSPEECH (2011)Google Scholar
  74. 74.
    G. Zhao, M. Barnard, M. Pietikainen, Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRefGoogle Scholar
  75. 75.
    G. Zhao, X. Huang, Y. Gizatdinova, M. Pietikäinen, Combining dynamic texture and structural features for speaker identification, in Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence (ACM, 2010), pp. 93–98Google Scholar
  76. 76.
    G. Zhao, T. Ahonen, J. Matas, M. Pietikainen, Rotation-invariant image and video description with local binary pattern features. IEEE Trans. Image Process. 21(4), 1465–1477 (2012)MathSciNetCrossRefGoogle Scholar
  77. 77.
    Z. Zhou, G. Zhao, M. Pietikainen, Towards a practical lipreading system, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Washington, 2011), pp. 137–144Google Scholar
  78. 78.
    Z. Zhou, X. Hong, G. Zhao, M. Pietikainen, A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014)Google Scholar
  79. 79.
    Z. Zhou, G. Zhao, X. Hong, M. Pietikäinen, A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Computer Science and Software EngineeringUniversity of Western AustraliaPerthAustralia
  2. 2.School of Electrical, Electronic and Computer EngineeringUniversity of Western AustraliaPerthAustralia

Personalised recommendations