Abstract
Acoustic speech recognition, as a technique to decode text from a speech, receives a great success in recent years. The trained model of Ping An Technology (ShenZhen) Co., Ltd results in a word error rate (WER) of 8.4%, which shows competitive performance among popular business products. However, an assumption of the achievement is the quiet environment of the speech. In a noisy environment, the accuracy will decrease 10%–20%. For the improvement in such environment, a multi-modal biometric system integrating acoustic speech-recognition with sentence level lip-reading is designed. In several noisy situations, the 5.7% averaged word error rate (WER) of the results of our integrated system indicates a significant improvement to the pure acoustic speech-recognition system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Goldschen, A.J., Garcia, O.N., Petajan, E.D.: Continuous automatic speech recognition by lipreading. In: Shah, M., Jain, R. (eds.) Motion-Based Recognition, pp. 321–343. Springer, Dordrecht (1997). doi:10.1007/978-94-015-8935-2_14
Maas, A.L., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: NAACL (2015)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A.: Audio visual speech recognition. Technical report, IDIAP (2000)
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. arXiv preprint arXiv:1512.02595 (2015)
Song, W., Cai, J.: End-to-End Deep Neural Network for Automatic Speech Recognition, Stanford CS224D reports (2015)
Assael, Y.M.: LipNet: end-to-end sentence-level lipreading. In: ICLR (2017)
Acknowledgments
This work was primarily supported by PingAn Deep Learning Group.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wang, J., Wang, Y., Liu, A., Xiao, J. (2017). Assistance of Speech Recognition in Noisy Environment with Sentence Level Lip-Reading. In: Zhou, J., et al. Biometric Recognition. CCBR 2017. Lecture Notes in Computer Science(), vol 10568. Springer, Cham. https://doi.org/10.1007/978-3-319-69923-3_64
Download citation
DOI: https://doi.org/10.1007/978-3-319-69923-3_64
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69922-6
Online ISBN: 978-3-319-69923-3
eBook Packages: Computer ScienceComputer Science (R0)