A Data Driven Approach to Audiovisual Speech Mapping

  • Andrew Abel
  • Ricard Marxer
  • Jon Barker
  • Roger Watt
  • Bill Whitmer
  • Peter Derleth
  • Amir Hussain
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10023)


The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.


Audiovisual Speech processing Speech mapping ANNs 



This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 (CogAVHearing- In accordance with EPSRC policy, all experimental data used in the project simulations is available at The authors would also like to gratefully acknowledge Prof. Leslie Smith and Dr Ahsan Adeel at the University of Stirling, Dr Kristína Malinovská at Comenius University in Bratislava, and the anonymous reviewers for their helpful comments and suggestions.


  1. 1.
    Abel, A., Hussain, A.: Novel two-stage audiovisual speech filtering in noisy environments. Cogn. Comput. 6, 1–18 (2013)Google Scholar
  2. 2.
    Abel, A., Hussain, A.: Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System, vol. 5. Springer, New York (2015)Google Scholar
  3. 3.
    Almajai, I., Milner, B.: Effective visually-derived Wiener filtering for audio-visual speech processing. In: Proceedings of Interspeech, Brighton, UK (2009)Google Scholar
  4. 4.
    Bear, H., Harvey, R.: Decoding visemes: improving machine lip-reading. In: International Conference on Acoustics, Speech and Signal Processing (2016)Google Scholar
  5. 5.
    Bear, H.L., Harvey, R.W., Theobald, B.-J., Lan, Y.: Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 230–239. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-14364-4_22 Google Scholar
  6. 6.
    Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)Google Scholar
  7. 7.
    Chung, K.: Challenges and recent developments in hearing aids part I. speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)Google Scholar
  8. 8.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5 Pt 1), 2421–2424 (2006)CrossRefGoogle Scholar
  9. 9.
    Dakin, S.C., Watt, R.J.: Biological bar codes in human faces. J. Vis. 9(4), 2:1–2:10 (2009)CrossRefGoogle Scholar
  10. 10.
    Fu, S., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P.K., Garcia, O.N.: Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans. Multimedia 7(2), 243–252 (2005)CrossRefGoogle Scholar
  11. 11.
    Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving visual features for lip-reading. In: AVSP, pp. 7–3 (2010)Google Scholar
  12. 12.
    McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  13. 13.
    Milner, B., Websdale, D.: Analysing the importance of different visual feature coefficients. In: FAAVSP 2015 (2015)Google Scholar
  14. 14.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)Google Scholar
  15. 15.
    Pahar, M.: A novel sound reconstruction technique based on a spike code (event) representation. Ph.D. thesis, Computing Science and Mathematics, University of Stirling, Stirling, Scotland (2016)Google Scholar
  16. 16.
    Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)CrossRefGoogle Scholar
  17. 17.
    Sumby, W., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)CrossRefGoogle Scholar
  18. 18.
    Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015)CrossRefGoogle Scholar
  19. 19.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I-511. IEEE (2001)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Andrew Abel
    • 1
  • Ricard Marxer
    • 2
  • Jon Barker
    • 2
  • Roger Watt
    • 1
  • Bill Whitmer
    • 3
  • Peter Derleth
    • 4
  • Amir Hussain
    • 1
  1. 1.University of StirlingStirlingScotland
  2. 2.University of SheffieldSheffieldUK
  3. 3.MRC/CSO IHR - Scottish SectionGRIGlasgowScotland
  4. 4.Sonova AGStaefaSwitzerland

Personalised recommendations