LipSpeaker: Helping Acquired Voice Disorders People Speak Again
Conference paper
First Online:
- 160 Mentions
- 621 Downloads
Abstract
In this paper, we designed a system called LipSpeaker to help acquired voice disorder people to communicate in daily life. Acquired voice disorder users only need to face the camera on their smartphones, and then use their lips to imitate the pronunciation of the words. LipSpeaker can recognize the movements of the lips and convert them to texts, and then it generates audio to play.
Compared to texts, mel-spectrogram is more emotionally informative. In order to generate smoother and more emotional audio, we also use the method of predicting mel-spectrogram instead of texts through recognizing users’ lip movements and expression together.
Keywords
Accessibility Disabled people LipreadingReferences
- 1.Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: End-to-End Sentence-level Lipreading. arXiv e-prints, page arXiv:1611.01599, November 2016
- 2.Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip Reading Sentences in the Wild. arXiv e-prints, page arXiv:1611.05358, November 2016
- 3.Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. J. 120, 2421 (2006)CrossRefGoogle Scholar
- 4.Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with Long Short-Term Memory. arXiv e-prints, page arXiv:1601.08188, January 2016
- 5.Shen, J., et al.: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv e-prints, page arXiv:1712.05884, December 2017
- 6.van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv e-prints, page arXiv:1609.03499, September 2016
- 7.King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
- 8.Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006)Google Scholar
- 9.Rekimoto, J., Kimura, N., Kono, M.: SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. In: ACM CHI (2019)Google Scholar
- 10.Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
Copyright information
© Springer Nature Switzerland AG 2019