A Data Driven Approach to Audiovisual Speech Mapping
The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.
KeywordsAudiovisual Speech processing Speech mapping ANNs
This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 (CogAVHearing-http://cogavhearing.cs.stir.ac.uk). In accordance with EPSRC policy, all experimental data used in the project simulations is available at http://hdl.handle.net/11667/81. The authors would also like to gratefully acknowledge Prof. Leslie Smith and Dr Ahsan Adeel at the University of Stirling, Dr Kristína Malinovská at Comenius University in Bratislava, and the anonymous reviewers for their helpful comments and suggestions.
- 1.Abel, A., Hussain, A.: Novel two-stage audiovisual speech filtering in noisy environments. Cogn. Comput. 6, 1–18 (2013)Google Scholar
- 2.Abel, A., Hussain, A.: Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System, vol. 5. Springer, New York (2015)Google Scholar
- 3.Almajai, I., Milner, B.: Effective visually-derived Wiener filtering for audio-visual speech processing. In: Proceedings of Interspeech, Brighton, UK (2009)Google Scholar
- 4.Bear, H., Harvey, R.: Decoding visemes: improving machine lip-reading. In: International Conference on Acoustics, Speech and Signal Processing (2016)Google Scholar
- 6.Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)Google Scholar
- 7.Chung, K.: Challenges and recent developments in hearing aids part I. speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif. 8(3), 83–124 (2004)Google Scholar
- 11.Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving visual features for lip-reading. In: AVSP, pp. 7–3 (2010)Google Scholar
- 13.Milner, B., Websdale, D.: Analysing the importance of different visual feature coefficients. In: FAAVSP 2015 (2015)Google Scholar
- 14.Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)Google Scholar
- 15.Pahar, M.: A novel sound reconstruction technique based on a spike code (event) representation. Ph.D. thesis, Computing Science and Mathematics, University of Stirling, Stirling, Scotland (2016)Google Scholar
- 19.Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I-511. IEEE (2001)Google Scholar