Skip to main content

Generating synthetic dysarthric speech to overcome dysarthria acoustic data scarcity

Abstract

Dysarthria is a disorder that affects an individual’s speech intelligibility due to the paralysis of muscles and organs involved in the articulation process. As the condition is often associated with physically debilitating disabilities, performing daily tasks can become challenging. Not only do such individuals face communication problems, but interactions with digital devices can also become a burden. For such individuals, speech-to-text and text-to-normal-speech technologies can make a significant difference as computers and smartphones may become an interaction medium, enabling them to communicate. However, automatic speech recognition (ASR) technologies designed to understand normal speakers are incapable of perceiving dysarthric speech, and other attempts to design dysarthric ASR systems have progressed slowly, mainly due to the scarcity of dysarthric speech. As these systems’ performances rely heavily on dysarthric speech samples for training, generating synthetic dysarthric speech can significantly boost their efficiencies. This paper reports on adapting normal speech generation systems to produce dysarthric speech utilizing transfer learning and considering both subjective and objective evaluations. The results reveal that the syntactically produced dysarthric speech improved our novel dysarthric ASR accuracy by up to 5.67% for severe dysarthria, which has traditionally been the most challenging type of dysarthric speech to recognize. Adopting this study’s findings, other researchers can produce an unlimited amount of synthetic dysarthric speech by capturing a limited amount of speech data from dysarthric individuals and utilizing synthetic data to tackle the data scarcity problem in their studies.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Data availability

All datasets mentioned in Sect. 3.2 are available via the citations provided in the text. Likewise, Speech Vision is open source and available via https://github.com/rshahamiri/SpeechVision.

References

  1. Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y et al (2017) Deep voice: real-time neural text-to-speech. ArXiv, Retrieved from https://arxiv.org/abs/1702.07825v2

  2. Bennett CL (2005) Large scale evaluation of corpus-based synthesizers: results and lessons from the blizzard challenge 2005. In: 9th European conference on speech communication and technology, pp 105–108

  3. Black AW (2003) Unit selection and emotional speech. In: EUROSPEECH 2003—8th European conference on speech communication and technology, vol 3, pp 1649–1652

  4. Black A, Campbell N (1996) Optimising selection of units from speech databases for concatenative synthesis. International Speech Communication Association, 1

  5. Christensen H, Cunningham SP, Fox C, Green PD, Hain T (2012) A comparative study of adaptive, automatic recognition of disordered speech. Paper presented at the INTERSPEECH 2012, pp 1776–1779

  6. Demonte P (2019) HARVARD corpus speech shaped noise and speech modulated noise for SIN test. Collection. https://doi.org/10.17866/rd.salford.c.4700054.v1

  7. Donahue C, McAuley J, Puckette M (2019) Adversarial audio synthesis. In: 7th International conference on learning representations, ICLR 2019, pp 1–16

  8. Dua M, Aggarwal RK, Biswas M (2017) Discriminative training using heterogeneous feature vector for hindi automatic speech recognition system. In: Paper presented at the 2017 international conference on computer and applications (ICCA), pp 158–162. https://doi.org/10.1109/COMAPP.2017.8079777

  9. Dua M, Aggarwal RK, Biswas M (2018) Performance evaluation of hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J 21(3):389–398. https://doi.org/10.1016/j.jestch.2018.04.005

    Article  Google Scholar 

  10. Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for hindi language. J Ambient Intell Humaniz Comput 10(6):2301–2314. https://doi.org/10.1007/s12652-018-0828-x

    Article  Google Scholar 

  11. Duffy JR (2005) Motor speech disorders: substrates, differential diagnosis, and management Mosby

  12. Ellis D, Morgan N (1999) Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, vol 2, pp 1013–1016. https://doi.org/10.1109/icassp.1999.759875

  13. Enderby P (1980) Frenchay dysarthria assessment. Int J Lang Commun Disord 15(3):165–173. https://doi.org/10.3109/13682828009112541

    Article  Google Scholar 

  14. Fang W, Chung Y-A, Glass J (2019) Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models. arXiv:1906.07307

  15. Fonseca ES, Guido RC, Scalassara PR, Maciel CD, Pereira JC (2007) Wavelet time-frequency analysis and least squares support vector machines for the identification of voice disorders. Comput Biol Med 37(4):571–578. https://doi.org/10.1016/j.compbiomed.2006.08.008

    Article  Google Scholar 

  16. Gupta S, Patil AT, Purohit M, Parmar M, Patel M, Patil HA, Guido RC (2021) Residual neural network precisely quantifies dysarthria severity-level based on short-duration speech segments. Neural Netw 139:105–117

    Article  Google Scholar 

  17. Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. In: Paper presented at the Nterspeech 2017, 2017-Augus, pp 3364–3368

  18. Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, vol 1, pp 373-376

  19. Ito K, Johnson L (2017) The LJ speech dataset. Retrieved from https://keithito.com/LJ-Speech-Dataset/

  20. Jiao Y, Tu M, Berisha V, Liss J (2018) Simulating dysarthric speech for training data augmentation in clinical speech applications. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6009–6013. https://doi.org/10.1109/ICASSP.2018.8462290

  21. Kim H, Hasegawa-Johnson M, Perlman A, Gunderson J, Huang T, Watkin K, Frame S (2008) Dysarthric speech database for universal access research. In: Paper presented at the INTERSPEECH 2008—9th annual conference of the international speech communication association, pp 1741–1744

  22. Kyubyong Park, & Mulc T (2018) A TensorFlow implementation of tacotron: a fully end-to-end text-to-speech synthesis model. Retrieved from https://github.com/Kyubyong/tacotron

  23. Menendez-Pidal X, Polikoff JB, Peters SM, Leonzio JE, Bunnell HT (1996) Nemours database of dysarthric speech. In: International conference on spoken language processing, ICSLP, proceedings, vol 3, pp 1962–1965

  24. Mubashir M, Shao L, Seed L (2013) Audio augmentation for speech recognition tom. Neurocomputing 100:144–152. https://doi.org/10.1016/j.neucom.2011.09.037

    Article  Google Scholar 

  25. Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A et al (2016) WaveNet: a generative model for raw audio, pp 1–15. arXiv:1609.03499

  26. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: IEEE international conference on acoustics, speech and signal processing—proceedings, 2015-Augus, pp 5206–5210

  27. Ping W, Peng K, Gibiansky A, Arik S, Kannan A, Narang S et al. (2017) Deep voice 3: 2000-speaker neural text-to-speech [computer software]

  28. Rudzicz F, Namasivayam AK, Wolff T (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang Resour Eval 46(4):523–541

    Article  Google Scholar 

  29. Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):279–283

    Article  Google Scholar 

  30. Saon G, Chien JT (2012) Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Signal Process Mag 29(6):18–33. https://doi.org/10.1109/MSP.2012.2197156

    Article  Google Scholar 

  31. Sehgal S, Cunningham S (2015) Model adaptation and adaptive training for the recognition of dysarthric speech. In: Paper presented at the proceedings of SLPAT 2015: 6th workshop on speech and language processing for assistive technologies, pp 65–71. https://doi.org/10.18653/v1/W15-5112 Retrieved from http://aclweb.org/anthology/W15-5112

  32. Shahamiri SR (2021) Speech vision: an end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Trans Neural Syst Rehabil Eng 29:852–861. https://doi.org/10.1109/TNSRE.2021.3076778

    Article  Google Scholar 

  33. Shahamiri SR, Binti Salim SS (2014a) Artificial neural networks as speech recognisers for dysarthric speech: identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv Eng Inf. https://doi.org/10.1016/j.aei.2014.01.001

    Article  Google Scholar 

  34. Shahamiri SR, Binti Salim SS (2014b) A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks. IEEE Trans Neural Syst Rehabil Eng 22(5):1053–1063. https://doi.org/10.1109/TNSRE.2014.2309336

    Article  Google Scholar 

  35. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z et al (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, 2018-April, pp 4779–4783

  36. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2019) Char2Wav: end-to-end speech synthesis. In: 5th international conference on learning representations, ICLR 2017—workshop track proceedings, (2015), pp 1–6

  37. Tachibana H, Uenoyama K, Aihara S (2018) Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: Paper presented at the ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings 2018-April pp 4784–4788. https://doi.org/10.1109/ICASSP.2018.8461829

  38. Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, New York

    Book  Google Scholar 

  39. Tirumala SS, Shahamiri SR (2016) A review on deep learning approaches in speaker identification. In: Paper presented at the proceedings of the 8th international conference on signal processing systems—ICSPS 2016, pp 142–147. https://doi.org/10.1145/3015166.3015210. Retrieved from http://dl.acm.org/citation.cfm?doid=3015166.3015210

  40. Tu M, Wisler A, Berisha V, Liss JM (2016) The relationship between perceptual disturbances in dysarthric speech and automatic speech recognition performance. J Acoust Soc Am 140:EL416–EL422

    Article  Google Scholar 

  41. Vachhani B, Bhat C, Kopparapu SK (2018) Data augmentation using healthy speech for dysarthric speech recognition. In: Paper presented at the Interspeech 2018, pp 471–475. https://doi.org/10.21437/Interspeech.2018-1751. Retrieved from http://www.isca-speech.org/archive/Interspeech_2018/abstracts/1751.html

  42. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Le Q (2017) Tacotron: towards end-to-end speech synthesis. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2017-Augus, pp 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452

  43. Yamamoto R (2018) Deepvoice3_pytorch [computer software]

  44. Ypma A, Heskes T (2017) Multi-stage DNN training for automatic recognition of dysarthric speech. In: Paper presented at the Interspeech 2017, https://doi.org/10.21437/Interspeech.2017-303

  45. Yu K, Young S (2011) Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans Audio Speech Lang Process 19(5):1071–1079. https://doi.org/10.1109/TASL.2010.2076805

    Article  Google Scholar 

  46. Zen H, Tokuda K, Black AW (2007) Statistical parametric speech synthesis. Speech Commun 51(11):1229–1232

    Google Scholar 

  47. Zhu X, Beauregard G, Wyse L (2007) Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Trans Audio Speech Lang Process 15:1645–1653

    Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Seyed Reza Shahamiri.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hu, A., Phadnis, D. & Shahamiri, S.R. Generating synthetic dysarthric speech to overcome dysarthria acoustic data scarcity. J Ambient Intell Human Comput (2021). https://doi.org/10.1007/s12652-021-03542-w

Download citation

Keywords

  • Dysarthria
  • Data scarcity
  • Synthetic speech
  • Text-to-speech
  • Automatic speech recognition
  • Deep learning