Skip to main content
Log in

Comprehensive literature review on children automatic speech recognition system, acoustic linguistic mismatch approaches and challenges

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic Speech Recognition (ASR) system for children is as important as for adults since children are more dependent on these systems nowadays, such as computer games, reading tutors, foreign language learning tools, etc. Consequently, this article aims to present several important aspects related to children's speech recognition systems, in which a comprehensive review is presented. Acoustic and linguistic challenges of children's speech are presented thoroughly to understand the basic anatomy of children's articulation organs. A variety of challenges exist for the development of children's ASR, such as the collection of children's speech data is a very complex task; the available child corpora are not publicly accessible, children's speakers differ greatly due to linguistic and acoustic variations, and ASRs developed for one age group are not suitable for another age group. All these challenges are systematically described in this article. Various data augmentation methods are also explored here, along with different approaches to develop ASR in children's speech. It has been observed that the inaccessibility of child corpora publicly is a significant barrier to children's ASR. Apart from the challenges mentioned earlier related to children’s ASR, an attempt has been made to thoroughly review the children’s ASR in the case of Punjabi language, as this language is ranked 10th most spoken globally and is still considered a low-resource language. Further, various approaches for the development of children’s ASR such as traditional, hybrid and end-to-end (E2E) networks are also reported. In addition, an analytical summary and discussion are included.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or used to carry out experiments since it is review article.

References

  1. Katore M, Bachute MR (2015) Speech based human machine interaction system for home automation. In: 2015 IEEE Bombay Section Symposium (IBSS). pp 1–6. https://doi.org/10.1109/IBSS.2015.7456634

  2. Levis J, Suvorov R (2012) Automatic speech recognition. The encyclopedia of applied linguistics. https://doi.org/10.1002/9781405198431.wbeal0066

  3. Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall Inc., USA

    Google Scholar 

  4. Kaur AP, Singh A, Sachdeva R, Kukreja V (2023) Automatic speech recognition systems: A survey of discriminative techniques. Multimed Tools Appl 82:13307–13339. https://doi.org/10.1007/s11042-022-13645-x

    Article  Google Scholar 

  5. Ghai S (2011) Addressing pitch mismatch for children’s automatic speech recognition. Dissertation, IIT Guwahati, India

  6. Shahnawazuddin S (2016) Improving children’s mismatched ASR through adaptive pitch compensation. Dissertation, IIT Guwahati, India

  7. Sunil Y, Prasanna SRM, Sinha R (2016) Children’s speech recognition under mismatched condition: a review. IETE J Educ 57:96–108. https://doi.org/10.1080/09747338.2016.1201014

    Article  Google Scholar 

  8. Pons-Salvador G, Zubieta-Méndez X, Frias-Navarro D (2018) Internet Use by Children Aged six to nine: Parents’ Beliefs and Knowledge about Risk Prevention. Child Indic Res 11:1983–2000. https://doi.org/10.1007/s12187-018-9529-4

    Article  Google Scholar 

  9. Forsberg M (2003) Why is speech recognition difficult?. Chalmers University of Technology. https://api.semanticscholar.org/CorpusID:62660

  10. Benzeghiba M, De Mori R, Deroo O et al (2007) Automatic speech recognition and speech variability: A review. Speech Commun 49:763–786. https://doi.org/10.1016/j.specom.2007.02.006

    Article  Google Scholar 

  11. Reynolds DA (2002) An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. pp IV–4072–IV–4075. https://doi.org/10.1109/ICASSP.2002.5745552

  12. Kajarekar SS (2002) Analysis of variability in speech with applications to speech and speaker recognition. Ph. D. Dissertation, Oregon Health & Science University. https://doi.org/10.6083/M4ZP44DZ

  13. Malik M, Malik MK, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. Multimed Tools Appl 80:9411–9457. https://doi.org/10.1007/s11042-020-10073-7

    Article  Google Scholar 

  14. Russell M, D’Arcy S (2007) Challenges for computer recognition of children’s speech. Proc. Speech and Language Technology in Education (SLaTE 2007). Farmington, PA, USA, pp 108–111. https://doi.org/10.21437/SLaTE.2007-26

  15. Russell M, Brown C, Skilling A, et al (1996) Applications of automatic speech recognition to speech and language development in young children. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96. 1;176–179. https://doi.org/10.1109/ICSLP.1996.607069

  16. Hagen A, Pellom B, Cole R (2007) Highly accurate children’s speech recognition for interactive reading tutors using subword units. Speech Commun 49:861–873. https://doi.org/10.1016/j.specom.2007.05.004

    Article  Google Scholar 

  17. Alharbi S, Hasan M, Simons AJH, et al (2018) A lightly supervised approach to detect stuttering in children’s speech. In: Proceedings of Interspeech 2018. ISCA, pp 3433–3437. https://doi.org/10.21437/Interspeech.2018-2155

  18. Mostow J Is ASR accurate enough for automated reading tutors, and how can we tell? http://www.cs.cmu.edu/~listen/pdfs/icslp2006-ASR-metrics.pdf. Accessed 1 May 2023

  19. Li X, Ju Y-C, Deng L, Acero A (2007) Efficient and Robust Language Modeling in an Automatic Children’s Reading Tutor System. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07. pp IV–193–IV–196. https://doi.org/10.1109/ICASSP.2007.367196

  20. Website. https://d3.harvard.edu/platform-digit/submission/hello-barbie-ai-making-childrens-dreams-come-true/. Accessed 27 Dec 2023

  21. Husni H, Jamaludin Z (2009) ASR Technology for Children with Dyslexia: Enabling Immediate Intervention to Support Reading in Bahasa Melayu. Online Submission 6:64–70

    Google Scholar 

  22. Lee K, Hagen A, Romanyshyn N, Martin S, Pellom B (2004) Analysis and Detection of Reading Miscues for Interactive Literacy Tutors. In: Proceedings of the 20th International Conference on Computational Linguistics .pp. 1254–1260. https://doi.org/10.3115/1220355.1220537

  23. Claus F, Rosales HG, Petrick R, Hain HU, Hoffmann R (2013) A survey about databases of children’s speech. Interspeech 2013:2410–2414. https://doi.org/10.21437/Interspeech.2013-561

    Article  Google Scholar 

  24. Kraleva R (2016) Design and development a children’s speech database. arXiv:1605.07735. In: Fourth International Scientific Conference "Mathematics and Natural Sciences" 2011, Bulgaria, Vol. (2), pp. 41–48. https://doi.org/10.48550/arXiv.1605.07735

  25. Ahmed B, Ballard K, Burnham D et al (2021) AusKidTalk: an auditory-visual corpus of 3-to 12-year-old Australian children’s speech. Interspeech 2021:3680–3684. https://doi.org/10.21437/Interspeech.2021-2000

    Article  Google Scholar 

  26. Chen NF, Tong R, Wee D et al (2016) SingaKids-mandarin: Speech corpus of Singaporean children speaking mandarin Chinese. Interspeech 2016:1545–1549. https://doi.org/10.21437/Interspeech.2016-139

    Article  Google Scholar 

  27. Sobti R, Kadyan V, Guleria K (2022) Challenges for Designing of Children Speech Corpora: A State-of-the-Art Review. ECS Trans 107:9053–9064. https://doi.org/10.1149/10701.9053ecst

    Article  ADS  CAS  Google Scholar 

  28. Bawa P, Kadyan V (2021) Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Appl Acoust 175:107810. https://doi.org/10.1016/j.apacoust.2020.107810

    Article  Google Scholar 

  29. Hasija T, Kadyan V, Guleria K et al (2022) Prosodic Feature-Based Discriminatively Trained Low Resource Speech Recognition System. Sustainability 14:614. https://doi.org/10.3390/su14020614

    Article  Google Scholar 

  30. Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE International Conference on Acoustics, Speech, and Signal Processing. pp 328–331. https://doi.org/10.1109/ICASSP.1984.1172716

  31. Potamianos A, Narayanan S (1998) Spoken dialog systems for children. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). 1;197–200. https://doi.org/10.1109/ICASSP.1998.674401

  32. Lee S, Potamianos A, Narayanan S (1999) Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J Acoust Soc Am 105:1455–1468. https://doi.org/10.1121/1.426686

    Article  ADS  PubMed  CAS  Google Scholar 

  33. Shobaki K, Hosom J-P, Cole RA (2000) The ogi kids’ speech corpus and recognizers.In: Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), 4; 258–261. https://doi.org/10.21437/ICSLP.2000-800

  34. Kazemzadeh A, You H, Iseli M et al (2005) TBALL data collection: the making of a young children’s speech corpus. Interspeech 2005:1581–1584. https://doi.org/10.21437/Interspeech.2005-462

    Article  Google Scholar 

  35. Demuth K, Culbertson J, Alter J (2006) Word-minimality, epenthesis and coda licensing in the early acquisition of English. Lang Speech 49:137–174. https://doi.org/10.1177/00238309060490020201

    Article  PubMed  Google Scholar 

  36. Batliner A, Blomberg M, D’Arcy S et al (2005) The PF STAR children’s speech corpus. Interspeech 2005:2761–2764. https://doi.org/10.21437/Interspeech.2005-705

    Article  Google Scholar 

  37. Russell M (2006) The PF-STAR British English children’s speech corpus. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bc6aaefd9163b0b3a72420483411c37ea427c6db. Accessed 27 Dec 2023

  38. Hacker C (2009) Automatic Assessment of Children Speech to Support Language Learning. Logos Verlag, Berlin GmbH

    Google Scholar 

  39. My Science Tutor (MyST) Corpus. http://boulderlearning.com/resources/request-the-myst-corpus/. Accessed 21 Dec 2023

  40. Csatári F, Bakcsi Z, Vicsi K (1999) A Hungarian child database for speech processing applications. In: Sixth European Conference on Speech Communication and Technology, EUROSPEECH 1999. https://www.researchgate.net/publication/221491936_A_hungarian_child_database_for_speech_processing_applications Accessed 27 Dec 2023

  41. Iskra D, Grosskopf B, Marasek K, et al SPEECON -speech databases for Consumer Devices: Database specification and validation. https://repository.ubn.ru.nl/bitstream/handle/2066/76443/76443.pdf. Accessed 1 May 2023

  42. Cincarek T, Shindo I, Toda T et al (2007) Development of preschool children subsystem for ASR and Q&A in a real-environment speech-oriented guidance task. Proc Interspeech 2007:1469–1472. https://doi.org/10.21437/Interspeech.2007-426

    Article  Google Scholar 

  43. Cleuren L, Duchateau J, Ghesquière P, Van hamme H (2008) Children’s oral reading corpus (CHOREC): description and assessment of annotator agreement. In: Proceedings of the Sixth International conference on language resources and evaluation - LREC 2008, Marrakech, Morocco. European Language Resources Association (ELRA), pp 998–1005

  44. Ramteke PB, Supanekar S, Hegde P et al (2019) NITK Kids’ Speech Corpus. Interspeech 2019:331–335. https://doi.org/10.21437/Interspeech.2019-2061

    Article  Google Scholar 

  45. Huber JE, Stathopoulos ET, Curione GM et al (1999) Formants of children, women, and men: the effects of vocal intensity variation. J Acoust Soc Am 106:1532–1542. https://doi.org/10.1121/1.427150

    Article  ADS  PubMed  CAS  Google Scholar 

  46. Lee S, Potamianos A, Narayanan S (1997) Analysis of children’s speech: Duration, pitch and formants. In: Fifth European Conference on Speech Communication and Technology (Eurospeech 1997), pp 473–476. https://doi.org/10.21437/Eurospeech.1997-161

  47. Gerosa M, Giuliani D, Brugnara F (2007) Acoustic variability and automatic recognition of children’s speech. Speech Commun 49:847–860. https://doi.org/10.1016/j.specom.2007.01.002

    Article  Google Scholar 

  48. Bickley CA (1989) Acoustic evidence for the development of speech. Technical Report no. 548, Research Laboratory of Electronics, Massachusetts Institute of Technology, USA. http://hdl.handle.net/1721.1/4204

  49. Stemmer G, Hacker C, Steidl S, Nöth E (2003) Acoustic normalization of children’s speech. In: Eighth European Conference on Speech Communication and Technology (Eurospeech 2003), pp 1313–1316. https://doi.org/10.21437/Eurospeech.2003-415

  50. Wilpon JG, Jacobsen CN (1996) A study of speech recognition for children and the elderly. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. 1;349–352. https://doi.org/10.1109/ICASSP.1996.541104

  51. Gerosa M, Giuliani D, Brugnara F (2009) Towards age-independent acoustic modeling. Speech Commun 51:499–509. https://doi.org/10.1016/j.specom.2009.01.006

    Article  Google Scholar 

  52. Farantouri V, Potamianos A, Narayanan S (2008) Linguistic analysis of spontaneous children speech. Proc. First Workshop on Child, Computer and Interaction (WOCCI 2008), paper 04. https://www.isca-archive.org/wocci_2008/farantouri08_wocci.html

  53. Narayanan S, Potamianos A (2002) Creating conversational interfaces for children. IEEE Trans Audio Speech Lang Process 10:65–78. https://doi.org/10.1109/89.985544

    Article  Google Scholar 

  54. Potamianos A, Narayanan S (2007) A review of the acoustic and linguistic properties of children’s speech. In: 2007 IEEE 9th Workshop on Multimedia Signal Processing. pp 22–25. https://doi.org/10.1109/89.985544

  55. Kent RD (1976) Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. J Speech Hear Res 19:421–447. https://doi.org/10.1044/jshr.1903.421

    Article  PubMed  CAS  Google Scholar 

  56. Potamianos A, Narayanan S (2003) Robust recognition of children’s speech. IEEE Trans Audio Speech Lang Process 11:603–616. https://doi.org/10.1109/TSA.2003.818026

    Article  Google Scholar 

  57. Li Q, Russell M An analysis of the causes of increased error rates in children’s speech recognition. https://www.isca-speech.org/archive_v0/archive_papers/icslp_2002/i02_2337.pdf. Accessed 2 May 2023. https://doi.org/10.21437/ICSLP.2002-221

  58. D’Arcy SM, Wong LP, Russell MJ Recognition of read and spontaneous children’s speech using two new corpora. https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2004/i04_1473.pdf. Accessed 2 May 2023

  59. Kent RD, Forner LL (1980) Speech segment durations in sentence recitations by children and adults. J Phon 8:157–168. https://doi.org/10.1016/S0095-4470(19)31460-3

    Article  Google Scholar 

  60. Scharenborg O (2007) Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Commun 49:336–347. https://doi.org/10.1016/j.specom.2007.01.009

    Article  Google Scholar 

  61. Klatt DH, Klatt LC (1990) Analysis, synthesis, and perception of voice quality variations among female and male talkers. J Acoust Soc Am 87:820–857. https://doi.org/10.1121/1.398894

    Article  ADS  PubMed  CAS  Google Scholar 

  62. Fant G, Liljencrants J, Lin Q-G, Others (1985) A four-parameter model of glottal flow. STL-QPSR 4:1–13

    Google Scholar 

  63. Iseli M, Shue Y-L, Alwan A (2006) Age-and Gender-Dependent Analysis of Voice Source Characteristics. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. pp I–I. https://doi.org/10.1109/ICASSP.2006.1660039

  64. Weinrich B, Salz B, Hughes M (2005) Aerodynamic measurements: normative data for children ages 6:0 to 10:11 years. J Voice 19:326–339. https://doi.org/10.1016/j.jvoice.2004.07.009

    Article  PubMed  Google Scholar 

  65. Childers DG (1995) Glottal source modeling for voice conversion. Speech Commun 16:127–138. https://doi.org/10.1016/0167-6393(94)00050-K

    Article  Google Scholar 

  66. Gobl C (1989) A preliminary study of acoustic voice quality correlates. STL-QPSR 4:9–21

    Google Scholar 

  67. Karlsson I (1988) Glottal waveform parameters for different speaker types. STL-QPSR 29:61–67

    Google Scholar 

  68. Potamianos A, Narayanan S, Lee S (1997) Automatic speech recognition for children. In: Fifth European Conference on Speech Communication and Technology. researchgate.net. https://doi.org/10.21437/Eurospeech.1997-623

  69. Burnett DC, Fanty M (1996) Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96. 2;1145–1148. https://doi.org/10.1109/ICSLP.1996.607809

  70. Das S, Nix D, Picheny M (1998) Improvements in children’s speech recognition performance. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). 1;433–436. https://doi.org/10.1109/ICASSP.1998.674460

  71. D’Arcy S, Russell M (2005) A comparison of human and computer recognition accuracy for children’s speech. In: Interspeech 2005. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2005-697

  72. Lee J, Baek S, Kang H-G (2011) Signal and feature domain enhancement approaches for robust speech recognition. In: 2011 8th International Conference on Information, Communications & Signal Processing. pp 1–4. https://doi.org/10.1109/ICICS.2011.6173538

  73. Giuliani D, Gerosa M (2003) Investigating recognition of children’s speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03). pp II–137. https://doi.org/10.1109/ICASSP.2003.1202313

  74. Elenius D, Blomberg M (2005) Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children. In: Interspeech. pp 2749–2752. https://doi.org/10.21437/Interspeech.2005-702

  75. Cui X, Alwan A (2006) Adaptation of children’s speech with limited data based on formant-like peak alignment. Comput Speech Lang 20:400–419. https://doi.org/10.1016/j.csl.2005.05.004

    Article  Google Scholar 

  76. Hönig F, Stemmer G, Hacker C, Brugnara F (2005) Revising perceptual linear prediction (PLP). In: Interspeech 2005. ISCA. pp 2997–3000. https://doi.org/10.21437/Interspeech.2005-138

  77. Hagen A, Pellom B, Van Vuuren S, Cole R (2004) Advances in children’s speech recognition within an interactive literacy tutor. In: Proceedings of HLT-NAACL 2004: Short Papers on XX - HLT-NAACL ’04. Association for Computational Linguistics, Morristown, NJ, USA. pp 25–28. https://doi.org/10.3115/1613984.1613991

  78. Yeung G, Fan R, Alwan A (2021) Fundamental Frequency Feature Normalization and Data Augmentation for Child Speech Recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp 6993–6997. https://doi.org/10.48550/arXiv.2102.09106

  79. Kathania HK, Kadiri SR, Alku P, Kurimo M (2022) A formant modification method for improved ASR of children’s speech. Speech Commun 136:98–106. https://doi.org/10.1016/j.specom.2021.11.003

    Article  Google Scholar 

  80. Kathania HK, Shahnawazuddin S, Adiga N, Ahmad W (2018) Role of Prosodic Features on Children’s Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5519–5523. https://doi.org/10.1109/ICASSP.2018.8461668

  81. Shahnawazuddin S, Kumar A, Kumar V et al (2022) Robust children’s speech recognition in zero resource condition. Appl Acoust 185:108382. https://doi.org/10.1016/j.apacoust.2021.108382

    Article  Google Scholar 

  82. Tai C-L, Lee H-S, Tsao Y, Wang H-M (2022) Filter-based Discriminative Autoencoders for Children Speech Recognition. arXiv [cs.CL]. https://doi.org/10.48550/arXiv.2204.00164

  83. Shahnawazuddin S, Dey A, Sinha R (2016) Pitch-Adaptive Front-End Features for Robust Children’s ASR. In:Interspeech. pp 3459–3463. https://doi.org/10.21437/Interspeech.2016-1020

  84. Claes T, Dologlou I, ten Bosch L, van Compernolle D (1998) A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans Audio Speech Lang Process 6:549–557. https://doi.org/10.1109/89.725321

    Article  Google Scholar 

  85. Gerosa M, Giuliani D (2004) Preliminary investigations in automatic recognition of English sentences uttered by Italian children. In: InSTIL/ICALL Symposium 2004

  86. Shahnawazuddin S, Sinha R, Pradhan G (2017) Pitch-Normalized Acoustic Features for Robust Children’s Speech Recognition. IEEE Signal Process Lett 24:1128–1132. https://doi.org/10.1109/LSP.2017.2705085

    Article  ADS  Google Scholar 

  87. Yeung G, Alwan A (2019) frequency normalization technique for kindergarten speech recognition inspired by the role of f0 in vowel perception. In: Interspeech 2019. pp 6–10. https://doi.org/10.21437/Interspeech.2019-1847

  88. Legoh K, Bhattacharjee U, Tuithung T (2015) Features and model adaptation techniques for robust speech recognition: A review. Commun Appl Electron 1:18–31. https://doi.org/10.5120/cae-1507

    Article  Google Scholar 

  89. Cabral JP, Oliveira LC (2005) Pitch-synchronous time-scaling for prosodic and voice quality transformations. In: Interspeech 2005, pp 1137–1140, ISCA, ISCA. https://doi.org/10.21437/Interspeech.2005-209

  90. D’Arcy S, Russell M (2005) A comparison of human and computer recognition accuracy for children’s speech. In: Interspeech. pp 2197–2200. https://doi.org/10.21437/Interspeech.2005-697

  91. Gustafson J, Sjölander K (2002) Voice Transformations For Improving Children’s Speech Recognition In A Publicly Available Dialogue System. In: 7th International Conference on Spoken Language Processing (ICSLP2002 - INTERSPEECH 2002), Denver, Colorado, USA, September 16–20, 2002. International Speech Communication Association, pp 297–300. https://doi.org/10.21437/ICSLP.2002-139

  92. Umesh S, Sinha R, Kumar SVB (2004) An investigation into front-end signal processing for speaker normalization. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. pp I–345. https://doi.org/10.1109/ICASSP.2004.1325993

  93. Bawa P, Kadyan V, Kumar V, Raghuwanshi G (2021) Spectral-warping based noise-robust enhanced children ASR system. Res Square. https://doi.org/10.21203/rs.3.rs-976955/v1

    Article  Google Scholar 

  94. Hayashi G, Katagiri S, Lu X, Ohsaki M (2022) An Investigation of Feature Difference Between Child and Adult Voices Using Line Spectral Pairs. In: Proceedings of the 2022 5th International Conference on Signal Processing and Machine Learning. Association for Computing Machinery, New York, NY, USA, pp 94–100. https://doi.org/10.1145/3556384.3556399

  95. Yadav IC, Kumar A, Shahnawazuddin S, Pradhan G (2018) Non-uniform spectral smoothing for robust children’s speech recognition. In: Interspeech 2018. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2018-1828

  96. Bell P, Fainberg J, Klejch O et al (2021) Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview. IEEE Open J Signal Process 2:33–66. https://doi.org/10.48550/arXiv.2008.06580

    Article  Google Scholar 

  97. Shahnawazuddin S, Sinha R (2018) A fast adaptation approach for enhanced automatic recognition of children’s speech with mismatched acoustic models. Circ Syst Signal Process 37:1098–1115. https://doi.org/10.1007/s00034-017-0586-6

    Article  MathSciNet  Google Scholar 

  98. Giuliani D, Gerosa M, Brugnara F (2006) Improved automatic speech recognition through speaker normalization. Comput Speech Lang 20:107–123. https://doi.org/10.1016/j.csl.2005.05.002

    Article  Google Scholar 

  99. Hagen A, Pellom B, Cole R (2003) Children’s speech recognition with application to interactive books and tutors. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, pp 186–191. https://doi.org/10.1109/ASRU.2003.1318426

  100. Cosi P, Pellom B L (2005) Italian children’s speech recognition for advanced interactive literacy tutors. Interspeech 2005, pp 2201–2204. https://doi.org/10.21437/Interspeech.2005-698

  101. Gerosa M, Giuliani D, Narayanan S, Potamianos A(2009) A review of ASR technologies for children’s speech. In: WOCCI ’09: pp 1–8. https://doi.org/10.1145/1640377.1640384

  102. Shahnawazuddin S, Sinha R (2015) Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition. In: Interspeech 2015. ISCA, pp 1630–1634. https://doi.org/10.21437/Interspeech.2015-377

  103. Jain R, Barcovschi A, Yiwere M, et al (2023) Adaptation of Whisper models to child speech recognition. arXiv:2307.13008. https://doi.org/10.48550/arXiv.2307.13008

  104. Thienpondt J, Demuynck K (2022) Transfer Learning for Robust Low-Resource Children’s Speech ASR with Transformers and Source-Filter Warping. arXiv:2206.09396. https://doi.org/10.48550/arXiv.2206.09396

  105. Gurunath Shivakumar P, Narayanan S (2022) End-to-end neural systems for automatic children speech recognition: An empirical study. Comput Speech Lang 72:101289. https://doi.org/10.1016/j.csl.2021.101289

    Article  Google Scholar 

  106. Pavankumar Dubagunta S, Kabil SH, Magimai-Doss M (2019) Improving Children Speech Recognition through Feature Learning from Raw Speech Signal. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5736–5740. https://doi.org/10.1109/ICASSP.2019.8682826

  107. Gerosa M, Giuliani D, Brugnara F (2005) Speaker adaptive acoustic modeling with mixture of adult and children’s speech. In: Ninth European Conference on Speech Communication and Technology, Interspeech 2005, pp 2193–2196. https://doi.org/10.21437/Interspeech.2005-696

  108. Kathania HK, Shahnawazuddin S, Ahmad W, et al (2018) Improving Children’s Speech Recognition Through Time Scale Modification Based Speaking Rate Adaptation. In: 2018 International Conference on Signal Processing and Communications (SPCOM). IEEE, pp 257–261. https://doi.org/10.1109/SPCOM.2018.8724465

  109. Shivakumar PG, Potamianos A, Lee S, Narayanan S Improving speech recognition for children using acoustic adaptation and pronunciation modeling. https://apps.dtic.mil/sti/pdfs/AD1171103.pdf. Accessed 3 May 2023.

  110. Shahnawazuddin S, Kathania HK, Singh C et al (2018) Exploring the Role of Speaking-Rate Adaptation on Children’s Speech Recognition. In: 2018 International Conference on Signal Processing and Communications (SPCOM). IEEE, pp 21–25. https://doi.org/10.1109/SPCOM.2018.8724478

  111. Shahnawazuddin S, Kathania HK, Dey A, Sinha R (2018) Improving children’s mismatched ASR using structured low-rank feature projection. Speech Commun 105:103–113. https://doi.org/10.1016/j.specom.2018.11.001

    Article  Google Scholar 

  112. Kim C, Gowda D, Lee D et al (2020) A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers. IEEE, pp 277–283. doi:https://doi.org/10.48550/arXiv.2012.07974

  113. Li J (2022) Recent Advances in End-to-End Automatic Speech Recognition. APSIPA Transactions on Signal and Information Processing 11. https://doi.org/10.1561/116.00000050

  114. Chiu C-C, Han W, Zhang Y et al (2019) A Comparison of End-to-End Models for Long-Form Speech Recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, pp 889–896. https://doi.org/10.1109/ASRU46091.2019.9003854

  115. Wang D, Wang X, Lv S (2019) An Overview of End-to-End Automatic Speech Recognition. Symmetry 11:1018. https://doi.org/10.3390/sym11081018

    Article  ADS  Google Scholar 

  116. Hinton G, Deng L, Yu D et al (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag 29:82–97. https://doi.org/10.1109/MSP.2012.2205597

    Article  ADS  Google Scholar 

  117. Prabhavalkar R, Hori T, Sainath TN, et al (2023) End-to-end speech recognition: A survey. arXiv:2303.03329 [eess.AS]. https://doi.org/10.48550/arXiv.2303.03329

  118. Wei C, Wang Y-C, Wang B, Kuo C-CJ (2023) An overview on language models: Recent developments and outlook. arXiv: 2303.05759 [cs.CL]. https://doi.org/10.48550/arXiv.2303.05759

  119. Jelinek F, Bahl L, Mercer R (1975) Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans Inf Theory 21:250–256. https://doi.org/10.1109/TIT.1975.1055384

    Article  Google Scholar 

  120. Och FJ, Ueffing N, Ney H (2001) An Efficient A* Search Algorithm for Statistical Machine Translation. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation. https://doi.org/10.3115/1118037.1118045

  121. Federico M (1996) Bayesian Estimation Methods for N-gram Language Model Adaptation. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96. doi:https://doi.org/10.1109/ICSLP.1996.607087

  122. Berger AL, Della Pietra SA, Della Pietra VJ (1996) A Maximum Entropy Approach to Natural Language Processing. Comput Linguist 22:39–71. https://aclanthology.org/J96-1002 Accessed 27 Dec 2023

  123. Mikolov T, Karafiat M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech 2010, 11th Annual Conference of the International Speech Communication Association, pp 1045–1048. https://doi.org/10.21437/Interspeech.2010-343

  124. Niesler TR, Woodland PC (1996) A variable-length category-based n-gram language model. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, 1;164–167. https://doi.org/10.1109/ICASSP.1996.540316

  125. Hochreiter S (1998) The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 06:107–116. https://doi.org/10.1142/S0218488598000094

    Article  Google Scholar 

  126. Gulcehre C, Firat O, Xu K, et al (2015) On using monolingual corpora in neural machine translation. arXiv: 1503.03535 [cs.CL]. https://doi.org/10.48550/arXiv.1503.03535

  127. Sriram A, Jun H, Satheesh S, Coates A (2017) Cold Fusion: Training Seq2Seq models together with language models. arXiv: 1708.06426 [cs.CL]. https://doi.org/10.48550/arXiv.1708.06426

  128. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805 [cs.CL]. https://doi.org/10.48550/arXiv.1810.04805

  129. Radford A, Wu J, Child R, et al Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533 Accessed 27 Dec 2023

  130. Salazar J, Liang D, Nguyen TQ, Kirchhoff K (2019) Masked language model scoring. arXiv: 1910.14659 [cs.CL]. https://doi.org/10.48550/arXiv.1910.14659

  131. Kim S, Dalmia S, Metze F (2019) Gated embeddings in end-to-end speech recognition for conversational-context fusion. arXiv: 1906.11604 [cs.CL]. https://doi.org/10.48550/arXiv.1906.11604

  132. Eskenazi M, Pelton G (2002) Pinpointing pronunciation errors in children’s speech: examining the role of the speech recognizer. Proc. ITRW on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (PMLA 2002), 48–52. https://www.isca-archive.org/pmla_2002/eskenazi02_pmla.html

  133. Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Sixteenth annual conference of the international speech communication association. In: Interspeech 2015, pp 3586–3589. https://doi.org/10.21437/Interspeech.2015-711

  134. Chen G, Na X, Wang Y, et al (2020) Data augmentation for children’s speech recognition -- the “Ethiopian” system for the SLT 2021 Children Speech Recognition Challenge. arXiv: 2011.04547 [cs.SD]. https://doi.org/10.48550/arXiv.2011.04547

  135. Gales MJF, Kim DY, Woodland PC et al (2006) Progress in the CU-HTK broadcast news transcription system. IEEE Trans Audio Speech Lang Process 14:1513–1525. https://doi.org/10.1109/TASL.2006.878264

    Article  Google Scholar 

  136. Lamel L, Gauvain J-L (2002) Automatic processing of broadcast audio in multiple languages. In: 2002 11th European Signal Processing Conference. pp 1–4. https://ieeexplore.ieee.org/document/7072229 Accessed 27 Dec 2023

  137. Qian Y, Yu K, Liu J (2013) Combination of data borrowing strategies for low-resource LVCSR. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. pp 404–409. https://doi.org/10.1109/ASRU.2013.6707764

  138. Jaitly N, Hinton GE (2013) Vocal Tract Length Perturbation (VTLP) improves speech recognition. https://api.semanticscholar.org/CorpusID:14140670 Accessed 27 Dec 2023

  139. Park DS, Chan W, Zhang Y, et al (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv: 1904.08779 [eess.AS]. https://doi.org/10.48550/arXiv.1904.08779

  140. Geng M, Xie X, Liu S, et al (2022) Investigation of Data Augmentation Techniques for Disordered Speech Recognition. arXiv: 2201.05562 [cs.SD]. https://doi.org/10.48550/arXiv.2201.05562

  141. Fainberg J, Bell P, Lincoln M, Renals S (2016) Improving Children’s Speech Recognition Through Out-of-Domain Data Augmentation. In: Interspeech. pp 1598–1602. https://doi.org/10.21437/Interspeech.2016-1348

  142. Serizel R, Giuliani D (2014) Deep neural network adaptation for children’s and adults' speech recognition. Deep neural network adaptation for children’s and adults' speech recognition. pp 344–348. https://doi.org/10.12871/clicit2014166 Accessed 27 Dec 2023

  143. Shahnawazuddin S, Deepak KT, Pradhan G, Sinha R (2017) Enhancing noise and pitch robustness of children’s ASR. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5225–5229. https://doi.org/10.1109/ICASSP.2017.7953153

  144. Shahnawazuddin S, Ahmad W, Adiga N, Kumar A (2020) In-Domain and Out-of-Domain Data Augmentation to Improve Children’s Speaker Verification System in Limited Data Scenario. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7554–7558. https://doi.org/10.1109/ICASSP40776.2020.9053891

  145. Kadyan V, Bawa P, Hasija T (2022) In domain training data augmentation on noise robust Punjabi Children speech recognition. J Ambient Intell Humaniz Comput 13:2705–2721. https://doi.org/10.1007/s12652-021-03468-3

    Article  Google Scholar 

  146. Shahnawazuddin S, Adiga N, Kumar K et al (2020) Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario. In: Interspeech 2020. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2020-1112

  147. Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: A survey. Speech Commun 56:85–100. https://doi.org/10.1016/j.specom.2013.07.008

    Article  Google Scholar 

  148. Yu C, Kang M, Chen Y et al (2020) Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview. IEEE Access 8:163829–163843. https://doi.org/10.1109/ACCESS.2020.3020421

    Article  Google Scholar 

  149. Website. “Ethnologue.” https://www.ethnologue.com/insights/continents-most-indigenous-languages/. Accessed 27 Dec 2023

  150. Crystal D (2000) Language death. Cambridge University Press

    Book  Google Scholar 

  151. Kadyan V (2018) Acoustic features optimization for Punjabi automatic speech recognition system. Dissertation, Chitkara University Punjab, India

  152. Hartmann W, Ng T, Hsiao R, Tsakalidis S (2016) Two-Stage Data Augmentation for Low-Resourced Speech Recognition. In: Interspeech 2016, pp 2378–2382. https://doi.org/10.21437/Interspeech.2016-1386

  153. Huang X, Acero A, Hon H-W, Reddy R (2001) Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall PTR, USA

    Google Scholar 

  154. Singh A, Mehta AS, Ashish KKS et al (2023) Model Adaptation for ASR in low-resource Indian Languages. arXiv: 2307.07948 [eess.AS]. https://doi.org/10.48550/arXiv.2307.07948

  155. Diwan A, Vaideeswaran R, Shah S et al (2021) Multilingual and code-switching ASR challenges for low resource Indian languages. arXiv: 2104.00235 [cs.CL]. https://doi.org/10.48550/arXiv.2104.00235

  156. Thai B, Jimerson R, Ptucha R, Prud’hommeaux E (2020) Fully Convolutional ASR for Less-Resourced Endangered Languages. In: Beermann D, Besacier L, Sakti S, Soria C (eds) Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). European Language Resources association, Marseille, France, pp 126–130. https://aclanthology.org/2020.sltu-1.17 Accessed 27 Dec 2023

  157. Jimerson R, Prud’hommeaux E (2018) ASR for Documenting Acutely Under-Resourced Indigenous Languages. In: Calzolari N, Choukri K, Cieri C, et al (eds) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), pp 4161–4166. https://aclanthology.org/L18-1657 Accessed 27 Dec 2023

  158. Karunathilaka H, Welgama V, Nadungodage T, Weerasinghe R (2020) Low-resource Sinhala Speech Recognition using Deep Learning. In: 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer). IEEE, pp 196–201. https://doi.org/10.1109/ICTer51097.2020.9325468

  159. Bataev V, Korenevsky M, Medennikov I, Zatvornitskiy A (2018) Exploring End-to-End Techniques for Low-Resource Speech Recognition. In: Speech and Computer. Springer International Publishing, pp 32–41. https://doi.org/10.48550/arXiv.1807.00868

  160. Dalmia S, Sanabria R, Metze F, Black AW (2018) Sequence-Based Multi-Lingual Low Resource Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 4909–4913. https://doi.org/10.48550/arXiv.1802.07420

  161. Do C-T, Lamel L, Gauvain J-L (2014) Speech-to-text development for Slovak, a low-resourced language. https://api.semanticscholar.org/CorpusID:7788606 Accessed 27 Dec 2023

  162. Karim H (2020) Best way for collecting data for low-resourced languages. Dissertation, Dalarna University, School of Technology and Business Studies, Microdata Analysis. https://urn.kb.se/resolve?urn=urn:nbn:se:du-35945

  163. Strassel S, Tracey J (2016) LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 3273–3280. https://aclanthology.org/L16-1521Accessed 27 Dec 2023

  164. Dua M, Aggarwal RK, Kadyan V, Dua S (2012) Punjabi automatic speech recognition using HTK. IJCSI Int J Comput Sci Issues 9(4):359–364

    Google Scholar 

  165. Kumar R, Singh M (2011) Spoken Isolated Word Recognition of Punjabi Language Using Dynamic Time Warp Technique. In: Information Systems for Indian Languages. Springer Berlin Heidelberg, pp 301. https://doi.org/10.1007/978-3-642-19403-0_53

  166. Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int J Speech Technol 20:761–769. https://doi.org/10.1007/s10772-017-9446-9

    Article  Google Scholar 

  167. Guglani J, Mishra AN (2018) Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int J Speech Technol 21:211–216. https://doi.org/10.1007/s10772-018-9497-6

    Article  Google Scholar 

  168. Kadyan V, Mantri A, Aggarwal RK (2018) Refinement of HMM Model Parameters for Punjabi Automatic Speech Recognition (PASR) System. IETE J Res 64:673–688. https://doi.org/10.1080/03772063.2017.1369370

    Article  Google Scholar 

  169. Kadyan V, Hasija T, Singh A (2023) Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation. Multimed Tools Appl 82:3973–3994. https://doi.org/10.1007/s11042-022-13435-5

    Article  Google Scholar 

  170. Kaur H, Bhardwaj V, Kadyan V (2021) Punjabi Children Speech Recognition System Under Mismatch Conditions Using Discriminative Techniques. In: Innovations in Computer Science and Engineering. Springer Singapore, pp 195–203. https://doi.org/10.1007/978-981-33-4543-0_21

  171. Bhardwaj V, Bala S, Kadyan V, Kukreja V (2020) Development of Robust Automatic Speech Recognition System for Children’s using Kaldi Toolkit. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA). pp 10–13. https://doi.org/10.1109/ICIRCA48905.2020.9182941

  172. Hasija T, Kadyan V, Guleria K (2021) Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron. J Phys Conf Ser 1950:012044. https://doi.org/10.1088/1742-6596/1950/1/012044

    Article  Google Scholar 

  173. Bhardwaj V, Kukreja V (2021) Effect of pitch enhancement in Punjabi children’s speech recognition system under disparate acoustic conditions. Appl Acoust 177:107918. https://doi.org/10.1016/j.apacoust.2021.107918

    Article  Google Scholar 

  174. Ghai W, Singh N (2013) Phone based acoustic modeling for automatic speech recognition for Punjabi language. J of Speech Sci 3:68–83. https://doi.org/10.20396/joss.v3i1.15040

    Article  Google Scholar 

  175. Taniya, Bhardwaj V, Kadyan V (2020) Deep Neural Network Trained Punjabi Children Speech Recognition System Using Kaldi Toolkit. In: 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA). pp 374–378. https://doi.org/10.1109/ICCCA49541.2020.9250780

  176. Kaur H, Kadyan V (2020) Feature space discriminatively trained Punjabi children speech recognition system using Kaldi toolkit. In: International Conference on Intelligent Communication and Computational Research. pp1–5. https://doi.org/10.2139/ssrn.3565906

  177. Dua M, Kadyan V, Banthia N, Bansal A, Agarwal T (2022) Spectral warping and data augmentation for low resource language ASR system under mismatched conditions. Appl Acoust 190:108643. https://doi.org/10.1016/j.apacoust.2022.108643

    Article  Google Scholar 

  178. Kadyan V, Shanawazuddin S, Singh A (2021) Developing children’s speech recognition system for low resource Punjabi language. Appl Acoust 178:108002. https://doi.org/10.1016/j.apacoust.2021.108002

    Article  Google Scholar 

  179. Bhardwaj V, Kukreja V, Singh A (2021) Usage of prosody modification and acoustic adaptation for robust automatic speech recognition (ASR) system. Rev D Intell Artif 35:235–242. https://doi.org/10.18280/ria.350307

    Article  Google Scholar 

  180. Hasija T, Kadyan V, Guleria K (2021) Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI). pp 702–706. https://doi.org/10.1109/ESCI50559.2021.9397041

Download references

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kalpna Guleria or Virender Kadyan.

Ethics declarations

Informed consent

All authors have read and agreed to this version of the manuscript.

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sobti, R., Guleria, K. & Kadyan, V. Comprehensive literature review on children automatic speech recognition system, acoustic linguistic mismatch approaches and challenges. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18753-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18753-4

Keywords

Navigation