Advertisement

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1088)

Abstract

In this paper, we designed a system called LipSpeaker to help acquired voice disorder people to communicate in daily life. Acquired voice disorder users only need to face the camera on their smartphones, and then use their lips to imitate the pronunciation of the words. LipSpeaker can recognize the movements of the lips and convert them to texts, and then it generates audio to play.

Compared to texts, mel-spectrogram is more emotionally informative. In order to generate smoother and more emotional audio, we also use the method of predicting mel-spectrogram instead of texts through recognizing users’ lip movements and expression together.

Keywords

Accessibility Disabled people Lipreading 

References

  1. 1.
    Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: End-to-End Sentence-level Lipreading. arXiv e-prints, page arXiv:1611.01599, November 2016
  2. 2.
    Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip Reading Sentences in the Wild. arXiv e-prints, page arXiv:1611.05358, November 2016
  3. 3.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. J. 120, 2421 (2006)CrossRefGoogle Scholar
  4. 4.
    Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with Long Short-Term Memory. arXiv e-prints, page arXiv:1601.08188, January 2016
  5. 5.
    Shen, J., et al.: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv e-prints, page arXiv:1712.05884, December 2017
  6. 6.
    van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv e-prints, page arXiv:1609.03499, September 2016
  7. 7.
    King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
  8. 8.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006)Google Scholar
  9. 9.
    Rekimoto, J., Kimura, N., Kono, M.: SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. In: ACM CHI (2019)Google Scholar
  10. 10.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Digital Nature GroupUniversity of TsukubaTsukubaJapan
  2. 2.Applied AnalyticsColumbia UniversityNew YorkUSA

Personalised recommendations