Abstract
Automatic Speech Recognition (ASR) system for children is as important as for adults since children are more dependent on these systems nowadays, such as computer games, reading tutors, foreign language learning tools, etc. Consequently, this article aims to present several important aspects related to children's speech recognition systems, in which a comprehensive review is presented. Acoustic and linguistic challenges of children's speech are presented thoroughly to understand the basic anatomy of children's articulation organs. A variety of challenges exist for the development of children's ASR, such as the collection of children's speech data is a very complex task; the available child corpora are not publicly accessible, children's speakers differ greatly due to linguistic and acoustic variations, and ASRs developed for one age group are not suitable for another age group. All these challenges are systematically described in this article. Various data augmentation methods are also explored here, along with different approaches to develop ASR in children's speech. It has been observed that the inaccessibility of child corpora publicly is a significant barrier to children's ASR. Apart from the challenges mentioned earlier related to children’s ASR, an attempt has been made to thoroughly review the children’s ASR in the case of Punjabi language, as this language is ranked 10th most spoken globally and is still considered a low-resource language. Further, various approaches for the development of children’s ASR such as traditional, hybrid and end-to-end (E2E) networks are also reported. In addition, an analytical summary and discussion are included.
Graphical Abstract
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or used to carry out experiments since it is review article.
References
Katore M, Bachute MR (2015) Speech based human machine interaction system for home automation. In: 2015 IEEE Bombay Section Symposium (IBSS). pp 1–6. https://doi.org/10.1109/IBSS.2015.7456634
Levis J, Suvorov R (2012) Automatic speech recognition. The encyclopedia of applied linguistics. https://doi.org/10.1002/9781405198431.wbeal0066
Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall Inc., USA
Kaur AP, Singh A, Sachdeva R, Kukreja V (2023) Automatic speech recognition systems: A survey of discriminative techniques. Multimed Tools Appl 82:13307–13339. https://doi.org/10.1007/s11042-022-13645-x
Ghai S (2011) Addressing pitch mismatch for children’s automatic speech recognition. Dissertation, IIT Guwahati, India
Shahnawazuddin S (2016) Improving children’s mismatched ASR through adaptive pitch compensation. Dissertation, IIT Guwahati, India
Sunil Y, Prasanna SRM, Sinha R (2016) Children’s speech recognition under mismatched condition: a review. IETE J Educ 57:96–108. https://doi.org/10.1080/09747338.2016.1201014
Pons-Salvador G, Zubieta-Méndez X, Frias-Navarro D (2018) Internet Use by Children Aged six to nine: Parents’ Beliefs and Knowledge about Risk Prevention. Child Indic Res 11:1983–2000. https://doi.org/10.1007/s12187-018-9529-4
Forsberg M (2003) Why is speech recognition difficult?. Chalmers University of Technology. https://api.semanticscholar.org/CorpusID:62660
Benzeghiba M, De Mori R, Deroo O et al (2007) Automatic speech recognition and speech variability: A review. Speech Commun 49:763–786. https://doi.org/10.1016/j.specom.2007.02.006
Reynolds DA (2002) An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. pp IV–4072–IV–4075. https://doi.org/10.1109/ICASSP.2002.5745552
Kajarekar SS (2002) Analysis of variability in speech with applications to speech and speaker recognition. Ph. D. Dissertation, Oregon Health & Science University. https://doi.org/10.6083/M4ZP44DZ
Malik M, Malik MK, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. Multimed Tools Appl 80:9411–9457. https://doi.org/10.1007/s11042-020-10073-7
Russell M, D’Arcy S (2007) Challenges for computer recognition of children’s speech. Proc. Speech and Language Technology in Education (SLaTE 2007). Farmington, PA, USA, pp 108–111. https://doi.org/10.21437/SLaTE.2007-26
Russell M, Brown C, Skilling A, et al (1996) Applications of automatic speech recognition to speech and language development in young children. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96. 1;176–179. https://doi.org/10.1109/ICSLP.1996.607069
Hagen A, Pellom B, Cole R (2007) Highly accurate children’s speech recognition for interactive reading tutors using subword units. Speech Commun 49:861–873. https://doi.org/10.1016/j.specom.2007.05.004
Alharbi S, Hasan M, Simons AJH, et al (2018) A lightly supervised approach to detect stuttering in children’s speech. In: Proceedings of Interspeech 2018. ISCA, pp 3433–3437. https://doi.org/10.21437/Interspeech.2018-2155
Mostow J Is ASR accurate enough for automated reading tutors, and how can we tell? http://www.cs.cmu.edu/~listen/pdfs/icslp2006-ASR-metrics.pdf. Accessed 1 May 2023
Li X, Ju Y-C, Deng L, Acero A (2007) Efficient and Robust Language Modeling in an Automatic Children’s Reading Tutor System. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07. pp IV–193–IV–196. https://doi.org/10.1109/ICASSP.2007.367196
Website. https://d3.harvard.edu/platform-digit/submission/hello-barbie-ai-making-childrens-dreams-come-true/. Accessed 27 Dec 2023
Husni H, Jamaludin Z (2009) ASR Technology for Children with Dyslexia: Enabling Immediate Intervention to Support Reading in Bahasa Melayu. Online Submission 6:64–70
Lee K, Hagen A, Romanyshyn N, Martin S, Pellom B (2004) Analysis and Detection of Reading Miscues for Interactive Literacy Tutors. In: Proceedings of the 20th International Conference on Computational Linguistics .pp. 1254–1260. https://doi.org/10.3115/1220355.1220537
Claus F, Rosales HG, Petrick R, Hain HU, Hoffmann R (2013) A survey about databases of children’s speech. Interspeech 2013:2410–2414. https://doi.org/10.21437/Interspeech.2013-561
Kraleva R (2016) Design and development a children’s speech database. arXiv:1605.07735. In: Fourth International Scientific Conference "Mathematics and Natural Sciences" 2011, Bulgaria, Vol. (2), pp. 41–48. https://doi.org/10.48550/arXiv.1605.07735
Ahmed B, Ballard K, Burnham D et al (2021) AusKidTalk: an auditory-visual corpus of 3-to 12-year-old Australian children’s speech. Interspeech 2021:3680–3684. https://doi.org/10.21437/Interspeech.2021-2000
Chen NF, Tong R, Wee D et al (2016) SingaKids-mandarin: Speech corpus of Singaporean children speaking mandarin Chinese. Interspeech 2016:1545–1549. https://doi.org/10.21437/Interspeech.2016-139
Sobti R, Kadyan V, Guleria K (2022) Challenges for Designing of Children Speech Corpora: A State-of-the-Art Review. ECS Trans 107:9053–9064. https://doi.org/10.1149/10701.9053ecst
Bawa P, Kadyan V (2021) Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Appl Acoust 175:107810. https://doi.org/10.1016/j.apacoust.2020.107810
Hasija T, Kadyan V, Guleria K et al (2022) Prosodic Feature-Based Discriminatively Trained Low Resource Speech Recognition System. Sustainability 14:614. https://doi.org/10.3390/su14020614
Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE International Conference on Acoustics, Speech, and Signal Processing. pp 328–331. https://doi.org/10.1109/ICASSP.1984.1172716
Potamianos A, Narayanan S (1998) Spoken dialog systems for children. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). 1;197–200. https://doi.org/10.1109/ICASSP.1998.674401
Lee S, Potamianos A, Narayanan S (1999) Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J Acoust Soc Am 105:1455–1468. https://doi.org/10.1121/1.426686
Shobaki K, Hosom J-P, Cole RA (2000) The ogi kids’ speech corpus and recognizers.In: Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), 4; 258–261. https://doi.org/10.21437/ICSLP.2000-800
Kazemzadeh A, You H, Iseli M et al (2005) TBALL data collection: the making of a young children’s speech corpus. Interspeech 2005:1581–1584. https://doi.org/10.21437/Interspeech.2005-462
Demuth K, Culbertson J, Alter J (2006) Word-minimality, epenthesis and coda licensing in the early acquisition of English. Lang Speech 49:137–174. https://doi.org/10.1177/00238309060490020201
Batliner A, Blomberg M, D’Arcy S et al (2005) The PF STAR children’s speech corpus. Interspeech 2005:2761–2764. https://doi.org/10.21437/Interspeech.2005-705
Russell M (2006) The PF-STAR British English children’s speech corpus. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bc6aaefd9163b0b3a72420483411c37ea427c6db. Accessed 27 Dec 2023
Hacker C (2009) Automatic Assessment of Children Speech to Support Language Learning. Logos Verlag, Berlin GmbH
My Science Tutor (MyST) Corpus. http://boulderlearning.com/resources/request-the-myst-corpus/. Accessed 21 Dec 2023
Csatári F, Bakcsi Z, Vicsi K (1999) A Hungarian child database for speech processing applications. In: Sixth European Conference on Speech Communication and Technology, EUROSPEECH 1999. https://www.researchgate.net/publication/221491936_A_hungarian_child_database_for_speech_processing_applications Accessed 27 Dec 2023
Iskra D, Grosskopf B, Marasek K, et al SPEECON -speech databases for Consumer Devices: Database specification and validation. https://repository.ubn.ru.nl/bitstream/handle/2066/76443/76443.pdf. Accessed 1 May 2023
Cincarek T, Shindo I, Toda T et al (2007) Development of preschool children subsystem for ASR and Q&A in a real-environment speech-oriented guidance task. Proc Interspeech 2007:1469–1472. https://doi.org/10.21437/Interspeech.2007-426
Cleuren L, Duchateau J, Ghesquière P, Van hamme H (2008) Children’s oral reading corpus (CHOREC): description and assessment of annotator agreement. In: Proceedings of the Sixth International conference on language resources and evaluation - LREC 2008, Marrakech, Morocco. European Language Resources Association (ELRA), pp 998–1005
Ramteke PB, Supanekar S, Hegde P et al (2019) NITK Kids’ Speech Corpus. Interspeech 2019:331–335. https://doi.org/10.21437/Interspeech.2019-2061
Huber JE, Stathopoulos ET, Curione GM et al (1999) Formants of children, women, and men: the effects of vocal intensity variation. J Acoust Soc Am 106:1532–1542. https://doi.org/10.1121/1.427150
Lee S, Potamianos A, Narayanan S (1997) Analysis of children’s speech: Duration, pitch and formants. In: Fifth European Conference on Speech Communication and Technology (Eurospeech 1997), pp 473–476. https://doi.org/10.21437/Eurospeech.1997-161
Gerosa M, Giuliani D, Brugnara F (2007) Acoustic variability and automatic recognition of children’s speech. Speech Commun 49:847–860. https://doi.org/10.1016/j.specom.2007.01.002
Bickley CA (1989) Acoustic evidence for the development of speech. Technical Report no. 548, Research Laboratory of Electronics, Massachusetts Institute of Technology, USA. http://hdl.handle.net/1721.1/4204
Stemmer G, Hacker C, Steidl S, Nöth E (2003) Acoustic normalization of children’s speech. In: Eighth European Conference on Speech Communication and Technology (Eurospeech 2003), pp 1313–1316. https://doi.org/10.21437/Eurospeech.2003-415
Wilpon JG, Jacobsen CN (1996) A study of speech recognition for children and the elderly. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. 1;349–352. https://doi.org/10.1109/ICASSP.1996.541104
Gerosa M, Giuliani D, Brugnara F (2009) Towards age-independent acoustic modeling. Speech Commun 51:499–509. https://doi.org/10.1016/j.specom.2009.01.006
Farantouri V, Potamianos A, Narayanan S (2008) Linguistic analysis of spontaneous children speech. Proc. First Workshop on Child, Computer and Interaction (WOCCI 2008), paper 04. https://www.isca-archive.org/wocci_2008/farantouri08_wocci.html
Narayanan S, Potamianos A (2002) Creating conversational interfaces for children. IEEE Trans Audio Speech Lang Process 10:65–78. https://doi.org/10.1109/89.985544
Potamianos A, Narayanan S (2007) A review of the acoustic and linguistic properties of children’s speech. In: 2007 IEEE 9th Workshop on Multimedia Signal Processing. pp 22–25. https://doi.org/10.1109/89.985544
Kent RD (1976) Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. J Speech Hear Res 19:421–447. https://doi.org/10.1044/jshr.1903.421
Potamianos A, Narayanan S (2003) Robust recognition of children’s speech. IEEE Trans Audio Speech Lang Process 11:603–616. https://doi.org/10.1109/TSA.2003.818026
Li Q, Russell M An analysis of the causes of increased error rates in children’s speech recognition. https://www.isca-speech.org/archive_v0/archive_papers/icslp_2002/i02_2337.pdf. Accessed 2 May 2023. https://doi.org/10.21437/ICSLP.2002-221
D’Arcy SM, Wong LP, Russell MJ Recognition of read and spontaneous children’s speech using two new corpora. https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2004/i04_1473.pdf. Accessed 2 May 2023
Kent RD, Forner LL (1980) Speech segment durations in sentence recitations by children and adults. J Phon 8:157–168. https://doi.org/10.1016/S0095-4470(19)31460-3
Scharenborg O (2007) Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Commun 49:336–347. https://doi.org/10.1016/j.specom.2007.01.009
Klatt DH, Klatt LC (1990) Analysis, synthesis, and perception of voice quality variations among female and male talkers. J Acoust Soc Am 87:820–857. https://doi.org/10.1121/1.398894
Fant G, Liljencrants J, Lin Q-G, Others (1985) A four-parameter model of glottal flow. STL-QPSR 4:1–13
Iseli M, Shue Y-L, Alwan A (2006) Age-and Gender-Dependent Analysis of Voice Source Characteristics. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. pp I–I. https://doi.org/10.1109/ICASSP.2006.1660039
Weinrich B, Salz B, Hughes M (2005) Aerodynamic measurements: normative data for children ages 6:0 to 10:11 years. J Voice 19:326–339. https://doi.org/10.1016/j.jvoice.2004.07.009
Childers DG (1995) Glottal source modeling for voice conversion. Speech Commun 16:127–138. https://doi.org/10.1016/0167-6393(94)00050-K
Gobl C (1989) A preliminary study of acoustic voice quality correlates. STL-QPSR 4:9–21
Karlsson I (1988) Glottal waveform parameters for different speaker types. STL-QPSR 29:61–67
Potamianos A, Narayanan S, Lee S (1997) Automatic speech recognition for children. In: Fifth European Conference on Speech Communication and Technology. researchgate.net. https://doi.org/10.21437/Eurospeech.1997-623
Burnett DC, Fanty M (1996) Rapid unsupervised adaptation to children’s speech on a connected-digit task. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96. 2;1145–1148. https://doi.org/10.1109/ICSLP.1996.607809
Das S, Nix D, Picheny M (1998) Improvements in children’s speech recognition performance. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). 1;433–436. https://doi.org/10.1109/ICASSP.1998.674460
D’Arcy S, Russell M (2005) A comparison of human and computer recognition accuracy for children’s speech. In: Interspeech 2005. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2005-697
Lee J, Baek S, Kang H-G (2011) Signal and feature domain enhancement approaches for robust speech recognition. In: 2011 8th International Conference on Information, Communications & Signal Processing. pp 1–4. https://doi.org/10.1109/ICICS.2011.6173538
Giuliani D, Gerosa M (2003) Investigating recognition of children’s speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03). pp II–137. https://doi.org/10.1109/ICASSP.2003.1202313
Elenius D, Blomberg M (2005) Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children. In: Interspeech. pp 2749–2752. https://doi.org/10.21437/Interspeech.2005-702
Cui X, Alwan A (2006) Adaptation of children’s speech with limited data based on formant-like peak alignment. Comput Speech Lang 20:400–419. https://doi.org/10.1016/j.csl.2005.05.004
Hönig F, Stemmer G, Hacker C, Brugnara F (2005) Revising perceptual linear prediction (PLP). In: Interspeech 2005. ISCA. pp 2997–3000. https://doi.org/10.21437/Interspeech.2005-138
Hagen A, Pellom B, Van Vuuren S, Cole R (2004) Advances in children’s speech recognition within an interactive literacy tutor. In: Proceedings of HLT-NAACL 2004: Short Papers on XX - HLT-NAACL ’04. Association for Computational Linguistics, Morristown, NJ, USA. pp 25–28. https://doi.org/10.3115/1613984.1613991
Yeung G, Fan R, Alwan A (2021) Fundamental Frequency Feature Normalization and Data Augmentation for Child Speech Recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp 6993–6997. https://doi.org/10.48550/arXiv.2102.09106
Kathania HK, Kadiri SR, Alku P, Kurimo M (2022) A formant modification method for improved ASR of children’s speech. Speech Commun 136:98–106. https://doi.org/10.1016/j.specom.2021.11.003
Kathania HK, Shahnawazuddin S, Adiga N, Ahmad W (2018) Role of Prosodic Features on Children’s Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5519–5523. https://doi.org/10.1109/ICASSP.2018.8461668
Shahnawazuddin S, Kumar A, Kumar V et al (2022) Robust children’s speech recognition in zero resource condition. Appl Acoust 185:108382. https://doi.org/10.1016/j.apacoust.2021.108382
Tai C-L, Lee H-S, Tsao Y, Wang H-M (2022) Filter-based Discriminative Autoencoders for Children Speech Recognition. arXiv [cs.CL]. https://doi.org/10.48550/arXiv.2204.00164
Shahnawazuddin S, Dey A, Sinha R (2016) Pitch-Adaptive Front-End Features for Robust Children’s ASR. In:Interspeech. pp 3459–3463. https://doi.org/10.21437/Interspeech.2016-1020
Claes T, Dologlou I, ten Bosch L, van Compernolle D (1998) A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Trans Audio Speech Lang Process 6:549–557. https://doi.org/10.1109/89.725321
Gerosa M, Giuliani D (2004) Preliminary investigations in automatic recognition of English sentences uttered by Italian children. In: InSTIL/ICALL Symposium 2004
Shahnawazuddin S, Sinha R, Pradhan G (2017) Pitch-Normalized Acoustic Features for Robust Children’s Speech Recognition. IEEE Signal Process Lett 24:1128–1132. https://doi.org/10.1109/LSP.2017.2705085
Yeung G, Alwan A (2019) frequency normalization technique for kindergarten speech recognition inspired by the role of f0 in vowel perception. In: Interspeech 2019. pp 6–10. https://doi.org/10.21437/Interspeech.2019-1847
Legoh K, Bhattacharjee U, Tuithung T (2015) Features and model adaptation techniques for robust speech recognition: A review. Commun Appl Electron 1:18–31. https://doi.org/10.5120/cae-1507
Cabral JP, Oliveira LC (2005) Pitch-synchronous time-scaling for prosodic and voice quality transformations. In: Interspeech 2005, pp 1137–1140, ISCA, ISCA. https://doi.org/10.21437/Interspeech.2005-209
D’Arcy S, Russell M (2005) A comparison of human and computer recognition accuracy for children’s speech. In: Interspeech. pp 2197–2200. https://doi.org/10.21437/Interspeech.2005-697
Gustafson J, Sjölander K (2002) Voice Transformations For Improving Children’s Speech Recognition In A Publicly Available Dialogue System. In: 7th International Conference on Spoken Language Processing (ICSLP2002 - INTERSPEECH 2002), Denver, Colorado, USA, September 16–20, 2002. International Speech Communication Association, pp 297–300. https://doi.org/10.21437/ICSLP.2002-139
Umesh S, Sinha R, Kumar SVB (2004) An investigation into front-end signal processing for speaker normalization. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. pp I–345. https://doi.org/10.1109/ICASSP.2004.1325993
Bawa P, Kadyan V, Kumar V, Raghuwanshi G (2021) Spectral-warping based noise-robust enhanced children ASR system. Res Square. https://doi.org/10.21203/rs.3.rs-976955/v1
Hayashi G, Katagiri S, Lu X, Ohsaki M (2022) An Investigation of Feature Difference Between Child and Adult Voices Using Line Spectral Pairs. In: Proceedings of the 2022 5th International Conference on Signal Processing and Machine Learning. Association for Computing Machinery, New York, NY, USA, pp 94–100. https://doi.org/10.1145/3556384.3556399
Yadav IC, Kumar A, Shahnawazuddin S, Pradhan G (2018) Non-uniform spectral smoothing for robust children’s speech recognition. In: Interspeech 2018. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2018-1828
Bell P, Fainberg J, Klejch O et al (2021) Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview. IEEE Open J Signal Process 2:33–66. https://doi.org/10.48550/arXiv.2008.06580
Shahnawazuddin S, Sinha R (2018) A fast adaptation approach for enhanced automatic recognition of children’s speech with mismatched acoustic models. Circ Syst Signal Process 37:1098–1115. https://doi.org/10.1007/s00034-017-0586-6
Giuliani D, Gerosa M, Brugnara F (2006) Improved automatic speech recognition through speaker normalization. Comput Speech Lang 20:107–123. https://doi.org/10.1016/j.csl.2005.05.002
Hagen A, Pellom B, Cole R (2003) Children’s speech recognition with application to interactive books and tutors. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, pp 186–191. https://doi.org/10.1109/ASRU.2003.1318426
Cosi P, Pellom B L (2005) Italian children’s speech recognition for advanced interactive literacy tutors. Interspeech 2005, pp 2201–2204. https://doi.org/10.21437/Interspeech.2005-698
Gerosa M, Giuliani D, Narayanan S, Potamianos A(2009) A review of ASR technologies for children’s speech. In: WOCCI ’09: pp 1–8. https://doi.org/10.1145/1640377.1640384
Shahnawazuddin S, Sinha R (2015) Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition. In: Interspeech 2015. ISCA, pp 1630–1634. https://doi.org/10.21437/Interspeech.2015-377
Jain R, Barcovschi A, Yiwere M, et al (2023) Adaptation of Whisper models to child speech recognition. arXiv:2307.13008. https://doi.org/10.48550/arXiv.2307.13008
Thienpondt J, Demuynck K (2022) Transfer Learning for Robust Low-Resource Children’s Speech ASR with Transformers and Source-Filter Warping. arXiv:2206.09396. https://doi.org/10.48550/arXiv.2206.09396
Gurunath Shivakumar P, Narayanan S (2022) End-to-end neural systems for automatic children speech recognition: An empirical study. Comput Speech Lang 72:101289. https://doi.org/10.1016/j.csl.2021.101289
Pavankumar Dubagunta S, Kabil SH, Magimai-Doss M (2019) Improving Children Speech Recognition through Feature Learning from Raw Speech Signal. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5736–5740. https://doi.org/10.1109/ICASSP.2019.8682826
Gerosa M, Giuliani D, Brugnara F (2005) Speaker adaptive acoustic modeling with mixture of adult and children’s speech. In: Ninth European Conference on Speech Communication and Technology, Interspeech 2005, pp 2193–2196. https://doi.org/10.21437/Interspeech.2005-696
Kathania HK, Shahnawazuddin S, Ahmad W, et al (2018) Improving Children’s Speech Recognition Through Time Scale Modification Based Speaking Rate Adaptation. In: 2018 International Conference on Signal Processing and Communications (SPCOM). IEEE, pp 257–261. https://doi.org/10.1109/SPCOM.2018.8724465
Shivakumar PG, Potamianos A, Lee S, Narayanan S Improving speech recognition for children using acoustic adaptation and pronunciation modeling. https://apps.dtic.mil/sti/pdfs/AD1171103.pdf. Accessed 3 May 2023.
Shahnawazuddin S, Kathania HK, Singh C et al (2018) Exploring the Role of Speaking-Rate Adaptation on Children’s Speech Recognition. In: 2018 International Conference on Signal Processing and Communications (SPCOM). IEEE, pp 21–25. https://doi.org/10.1109/SPCOM.2018.8724478
Shahnawazuddin S, Kathania HK, Dey A, Sinha R (2018) Improving children’s mismatched ASR using structured low-rank feature projection. Speech Commun 105:103–113. https://doi.org/10.1016/j.specom.2018.11.001
Kim C, Gowda D, Lee D et al (2020) A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers. IEEE, pp 277–283. doi:https://doi.org/10.48550/arXiv.2012.07974
Li J (2022) Recent Advances in End-to-End Automatic Speech Recognition. APSIPA Transactions on Signal and Information Processing 11. https://doi.org/10.1561/116.00000050
Chiu C-C, Han W, Zhang Y et al (2019) A Comparison of End-to-End Models for Long-Form Speech Recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, pp 889–896. https://doi.org/10.1109/ASRU46091.2019.9003854
Wang D, Wang X, Lv S (2019) An Overview of End-to-End Automatic Speech Recognition. Symmetry 11:1018. https://doi.org/10.3390/sym11081018
Hinton G, Deng L, Yu D et al (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag 29:82–97. https://doi.org/10.1109/MSP.2012.2205597
Prabhavalkar R, Hori T, Sainath TN, et al (2023) End-to-end speech recognition: A survey. arXiv:2303.03329 [eess.AS]. https://doi.org/10.48550/arXiv.2303.03329
Wei C, Wang Y-C, Wang B, Kuo C-CJ (2023) An overview on language models: Recent developments and outlook. arXiv: 2303.05759 [cs.CL]. https://doi.org/10.48550/arXiv.2303.05759
Jelinek F, Bahl L, Mercer R (1975) Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans Inf Theory 21:250–256. https://doi.org/10.1109/TIT.1975.1055384
Och FJ, Ueffing N, Ney H (2001) An Efficient A* Search Algorithm for Statistical Machine Translation. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation. https://doi.org/10.3115/1118037.1118045
Federico M (1996) Bayesian Estimation Methods for N-gram Language Model Adaptation. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96. doi:https://doi.org/10.1109/ICSLP.1996.607087
Berger AL, Della Pietra SA, Della Pietra VJ (1996) A Maximum Entropy Approach to Natural Language Processing. Comput Linguist 22:39–71. https://aclanthology.org/J96-1002 Accessed 27 Dec 2023
Mikolov T, Karafiat M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech 2010, 11th Annual Conference of the International Speech Communication Association, pp 1045–1048. https://doi.org/10.21437/Interspeech.2010-343
Niesler TR, Woodland PC (1996) A variable-length category-based n-gram language model. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. IEEE, 1;164–167. https://doi.org/10.1109/ICASSP.1996.540316
Hochreiter S (1998) The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int J Uncertainty Fuzziness Knowledge Based Syst 06:107–116. https://doi.org/10.1142/S0218488598000094
Gulcehre C, Firat O, Xu K, et al (2015) On using monolingual corpora in neural machine translation. arXiv: 1503.03535 [cs.CL]. https://doi.org/10.48550/arXiv.1503.03535
Sriram A, Jun H, Satheesh S, Coates A (2017) Cold Fusion: Training Seq2Seq models together with language models. arXiv: 1708.06426 [cs.CL]. https://doi.org/10.48550/arXiv.1708.06426
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805 [cs.CL]. https://doi.org/10.48550/arXiv.1810.04805
Radford A, Wu J, Child R, et al Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533 Accessed 27 Dec 2023
Salazar J, Liang D, Nguyen TQ, Kirchhoff K (2019) Masked language model scoring. arXiv: 1910.14659 [cs.CL]. https://doi.org/10.48550/arXiv.1910.14659
Kim S, Dalmia S, Metze F (2019) Gated embeddings in end-to-end speech recognition for conversational-context fusion. arXiv: 1906.11604 [cs.CL]. https://doi.org/10.48550/arXiv.1906.11604
Eskenazi M, Pelton G (2002) Pinpointing pronunciation errors in children’s speech: examining the role of the speech recognizer. Proc. ITRW on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (PMLA 2002), 48–52. https://www.isca-archive.org/pmla_2002/eskenazi02_pmla.html
Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Sixteenth annual conference of the international speech communication association. In: Interspeech 2015, pp 3586–3589. https://doi.org/10.21437/Interspeech.2015-711
Chen G, Na X, Wang Y, et al (2020) Data augmentation for children’s speech recognition -- the “Ethiopian” system for the SLT 2021 Children Speech Recognition Challenge. arXiv: 2011.04547 [cs.SD]. https://doi.org/10.48550/arXiv.2011.04547
Gales MJF, Kim DY, Woodland PC et al (2006) Progress in the CU-HTK broadcast news transcription system. IEEE Trans Audio Speech Lang Process 14:1513–1525. https://doi.org/10.1109/TASL.2006.878264
Lamel L, Gauvain J-L (2002) Automatic processing of broadcast audio in multiple languages. In: 2002 11th European Signal Processing Conference. pp 1–4. https://ieeexplore.ieee.org/document/7072229 Accessed 27 Dec 2023
Qian Y, Yu K, Liu J (2013) Combination of data borrowing strategies for low-resource LVCSR. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. pp 404–409. https://doi.org/10.1109/ASRU.2013.6707764
Jaitly N, Hinton GE (2013) Vocal Tract Length Perturbation (VTLP) improves speech recognition. https://api.semanticscholar.org/CorpusID:14140670 Accessed 27 Dec 2023
Park DS, Chan W, Zhang Y, et al (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv: 1904.08779 [eess.AS]. https://doi.org/10.48550/arXiv.1904.08779
Geng M, Xie X, Liu S, et al (2022) Investigation of Data Augmentation Techniques for Disordered Speech Recognition. arXiv: 2201.05562 [cs.SD]. https://doi.org/10.48550/arXiv.2201.05562
Fainberg J, Bell P, Lincoln M, Renals S (2016) Improving Children’s Speech Recognition Through Out-of-Domain Data Augmentation. In: Interspeech. pp 1598–1602. https://doi.org/10.21437/Interspeech.2016-1348
Serizel R, Giuliani D (2014) Deep neural network adaptation for children’s and adults' speech recognition. Deep neural network adaptation for children’s and adults' speech recognition. pp 344–348. https://doi.org/10.12871/clicit2014166 Accessed 27 Dec 2023
Shahnawazuddin S, Deepak KT, Pradhan G, Sinha R (2017) Enhancing noise and pitch robustness of children’s ASR. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5225–5229. https://doi.org/10.1109/ICASSP.2017.7953153
Shahnawazuddin S, Ahmad W, Adiga N, Kumar A (2020) In-Domain and Out-of-Domain Data Augmentation to Improve Children’s Speaker Verification System in Limited Data Scenario. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7554–7558. https://doi.org/10.1109/ICASSP40776.2020.9053891
Kadyan V, Bawa P, Hasija T (2022) In domain training data augmentation on noise robust Punjabi Children speech recognition. J Ambient Intell Humaniz Comput 13:2705–2721. https://doi.org/10.1007/s12652-021-03468-3
Shahnawazuddin S, Adiga N, Kumar K et al (2020) Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario. In: Interspeech 2020. ISCA, ISCA. https://doi.org/10.21437/Interspeech.2020-1112
Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: A survey. Speech Commun 56:85–100. https://doi.org/10.1016/j.specom.2013.07.008
Yu C, Kang M, Chen Y et al (2020) Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview. IEEE Access 8:163829–163843. https://doi.org/10.1109/ACCESS.2020.3020421
Website. “Ethnologue.” https://www.ethnologue.com/insights/continents-most-indigenous-languages/. Accessed 27 Dec 2023
Crystal D (2000) Language death. Cambridge University Press
Kadyan V (2018) Acoustic features optimization for Punjabi automatic speech recognition system. Dissertation, Chitkara University Punjab, India
Hartmann W, Ng T, Hsiao R, Tsakalidis S (2016) Two-Stage Data Augmentation for Low-Resourced Speech Recognition. In: Interspeech 2016, pp 2378–2382. https://doi.org/10.21437/Interspeech.2016-1386
Huang X, Acero A, Hon H-W, Reddy R (2001) Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall PTR, USA
Singh A, Mehta AS, Ashish KKS et al (2023) Model Adaptation for ASR in low-resource Indian Languages. arXiv: 2307.07948 [eess.AS]. https://doi.org/10.48550/arXiv.2307.07948
Diwan A, Vaideeswaran R, Shah S et al (2021) Multilingual and code-switching ASR challenges for low resource Indian languages. arXiv: 2104.00235 [cs.CL]. https://doi.org/10.48550/arXiv.2104.00235
Thai B, Jimerson R, Ptucha R, Prud’hommeaux E (2020) Fully Convolutional ASR for Less-Resourced Endangered Languages. In: Beermann D, Besacier L, Sakti S, Soria C (eds) Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). European Language Resources association, Marseille, France, pp 126–130. https://aclanthology.org/2020.sltu-1.17 Accessed 27 Dec 2023
Jimerson R, Prud’hommeaux E (2018) ASR for Documenting Acutely Under-Resourced Indigenous Languages. In: Calzolari N, Choukri K, Cieri C, et al (eds) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), pp 4161–4166. https://aclanthology.org/L18-1657 Accessed 27 Dec 2023
Karunathilaka H, Welgama V, Nadungodage T, Weerasinghe R (2020) Low-resource Sinhala Speech Recognition using Deep Learning. In: 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer). IEEE, pp 196–201. https://doi.org/10.1109/ICTer51097.2020.9325468
Bataev V, Korenevsky M, Medennikov I, Zatvornitskiy A (2018) Exploring End-to-End Techniques for Low-Resource Speech Recognition. In: Speech and Computer. Springer International Publishing, pp 32–41. https://doi.org/10.48550/arXiv.1807.00868
Dalmia S, Sanabria R, Metze F, Black AW (2018) Sequence-Based Multi-Lingual Low Resource Speech Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 4909–4913. https://doi.org/10.48550/arXiv.1802.07420
Do C-T, Lamel L, Gauvain J-L (2014) Speech-to-text development for Slovak, a low-resourced language. https://api.semanticscholar.org/CorpusID:7788606 Accessed 27 Dec 2023
Karim H (2020) Best way for collecting data for low-resourced languages. Dissertation, Dalarna University, School of Technology and Business Studies, Microdata Analysis. https://urn.kb.se/resolve?urn=urn:nbn:se:du-35945
Strassel S, Tracey J (2016) LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, pp 3273–3280. https://aclanthology.org/L16-1521Accessed 27 Dec 2023
Dua M, Aggarwal RK, Kadyan V, Dua S (2012) Punjabi automatic speech recognition using HTK. IJCSI Int J Comput Sci Issues 9(4):359–364
Kumar R, Singh M (2011) Spoken Isolated Word Recognition of Punjabi Language Using Dynamic Time Warp Technique. In: Information Systems for Indian Languages. Springer Berlin Heidelberg, pp 301. https://doi.org/10.1007/978-3-642-19403-0_53
Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int J Speech Technol 20:761–769. https://doi.org/10.1007/s10772-017-9446-9
Guglani J, Mishra AN (2018) Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int J Speech Technol 21:211–216. https://doi.org/10.1007/s10772-018-9497-6
Kadyan V, Mantri A, Aggarwal RK (2018) Refinement of HMM Model Parameters for Punjabi Automatic Speech Recognition (PASR) System. IETE J Res 64:673–688. https://doi.org/10.1080/03772063.2017.1369370
Kadyan V, Hasija T, Singh A (2023) Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation. Multimed Tools Appl 82:3973–3994. https://doi.org/10.1007/s11042-022-13435-5
Kaur H, Bhardwaj V, Kadyan V (2021) Punjabi Children Speech Recognition System Under Mismatch Conditions Using Discriminative Techniques. In: Innovations in Computer Science and Engineering. Springer Singapore, pp 195–203. https://doi.org/10.1007/978-981-33-4543-0_21
Bhardwaj V, Bala S, Kadyan V, Kukreja V (2020) Development of Robust Automatic Speech Recognition System for Children’s using Kaldi Toolkit. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA). pp 10–13. https://doi.org/10.1109/ICIRCA48905.2020.9182941
Hasija T, Kadyan V, Guleria K (2021) Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron. J Phys Conf Ser 1950:012044. https://doi.org/10.1088/1742-6596/1950/1/012044
Bhardwaj V, Kukreja V (2021) Effect of pitch enhancement in Punjabi children’s speech recognition system under disparate acoustic conditions. Appl Acoust 177:107918. https://doi.org/10.1016/j.apacoust.2021.107918
Ghai W, Singh N (2013) Phone based acoustic modeling for automatic speech recognition for Punjabi language. J of Speech Sci 3:68–83. https://doi.org/10.20396/joss.v3i1.15040
Taniya, Bhardwaj V, Kadyan V (2020) Deep Neural Network Trained Punjabi Children Speech Recognition System Using Kaldi Toolkit. In: 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA). pp 374–378. https://doi.org/10.1109/ICCCA49541.2020.9250780
Kaur H, Kadyan V (2020) Feature space discriminatively trained Punjabi children speech recognition system using Kaldi toolkit. In: International Conference on Intelligent Communication and Computational Research. pp1–5. https://doi.org/10.2139/ssrn.3565906
Dua M, Kadyan V, Banthia N, Bansal A, Agarwal T (2022) Spectral warping and data augmentation for low resource language ASR system under mismatched conditions. Appl Acoust 190:108643. https://doi.org/10.1016/j.apacoust.2022.108643
Kadyan V, Shanawazuddin S, Singh A (2021) Developing children’s speech recognition system for low resource Punjabi language. Appl Acoust 178:108002. https://doi.org/10.1016/j.apacoust.2021.108002
Bhardwaj V, Kukreja V, Singh A (2021) Usage of prosody modification and acoustic adaptation for robust automatic speech recognition (ASR) system. Rev D Intell Artif 35:235–242. https://doi.org/10.18280/ria.350307
Hasija T, Kadyan V, Guleria K (2021) Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI). pp 702–706. https://doi.org/10.1109/ESCI50559.2021.9397041
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Informed consent
All authors have read and agreed to this version of the manuscript.
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sobti, R., Guleria, K. & Kadyan, V. Comprehensive literature review on children automatic speech recognition system, acoustic linguistic mismatch approaches and challenges. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18753-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-18753-4