In domain training data augmentation on noise robust Punjabi Children speech recognition

Kadyan, Virender; Bawa, Puneet; Hasija, Taniya

doi:10.1007/s12652-021-03468-3

In domain training data augmentation on noise robust Punjabi Children speech recognition

Original Research
Published: 13 September 2021

Volume 13, pages 2705–2721, (2022)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Virender Kadyan¹,
Puneet Bawa² &
Taniya Hasija²

349 Accesses
13 Citations
Explore all metrics

Abstract

For building a successful automatic speech recognition (ASR) engine large training data is required. It increases training complexity and become impossible for less resource language like Punjabi which have zero children corpus. Consequently, the issue of data scarcity, and small vocal length of children speakers also degrades the system performance under limited data conditions. Unfortunately, Punjabi is a tonal language and building an optimized ASR for such a language is near impossible. In this paper, we have explored fused feature extraction approach to handle large training complexity using mel frequency-gammatone frequency cepstral coefficient (MF-GFCC) technique through feature warping method. The efforts have been made to develop children’s ASR engine using data augmentation on limited data scenarios. For that purpose, we have studied in-domain data augmentation that artificially combined noisy and clean corpus to overcome the issue of data scarcity in train set. The combined dataset is processed with a fused feature extraction approach. Apart, the tonal characteristics and child vocal length issues are also overcome by inducing pitch features and train normalization strategy using vocal tract length normalization (VTLN) approach. In addition to that, combined augmented and original speech signals are noted to reduce the Word error rate (WER) performance with larger relative improvement (RI) of 20.59% on noisy and 19.39% on clean environment conditions using hybrid MF-GFCC approach than that on conventional Mel Frequency Cepstral Coefficient (MFCC) and Gammatone Frequency Cepstral Coefficient (GFCC) based ASR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Article Open access 08 May 2024

References

Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Book Google Scholar
Abualigah L (2020) Multi-verse optimizer algorithm: a comprehensive survey of its results variants and applications. Neural Comput Appl. https://doi.org/10.1007/s00521-020-04839-1
Article Google Scholar
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73:4773–4795. https://doi.org/10.1007/s11227-017-2046-2
Article Google Scholar
Al-Ali AKH, Dean D, Senadji B, Baktashmotlagh M, Chandran V (2017) Speaker verification with multi-run ICA based speech enhancement. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), (pp 1–7). IEEE. https://doi.org/10.1109/icspcs.2017.8270505
Alías F, Socoró JC, Sevillano X (2016) A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl Sci 6:143. https://doi.org/10.3390/app6050143
Article Google Scholar
Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100. https://doi.org/10.1016/j.specom.2013.07.008
Article Google Scholar
Chiu YHB, Raj B, Stern RM (2011) Learning-based auditory encoding for robust speech recognition. IEEE Trans Audio Speech Lang Process 20(3):900–914. https://doi.org/10.1109/tasl.2011.2168209
Article Google Scholar
Crandell CC (1993) Speech recognition in noise by children with minimal degrees of sensorineural hearing loss. Int J Pediatr Otorhinolaryngol 28(2–3):262. https://doi.org/10.1016/0165-5876(94)90024-8
Article Google Scholar
Das S, Nix D, Picheny M (1998) Improvements in children's speech recognition performance. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181) (Vol 1, pp 433–436). IEEE. https://doi.org/10.1109/ICASSP.1998.674460
Deka A, Deka MK (2018) Spoken dialog system in bodo language for agro services. In: Advances in Electronics, Communication and Computing. Springer, Singapore, pp 623–631 https://doi.org/10.1007/978-981-10-4765-7_65
Deng L, Acero A, Plumpe M, Huang X (2000) Large-vocabulary speech recognition under adverse acoustic environments. In: Sixth International Conference on Spoken Language Processing (ICSLP), pp 806–809
Dey A, Sarma BD, Lalhminghlui W, Ngente L, Gogoi P, Sarmah P et al (2018) Robust mizo continuous speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2018-2125
Article Google Scholar
Dua M, Aggarwal RK, Kadyan V, Dua S (2012a) Punjabi automatic speech recognition using HTK. Int J Comput Sci Issues (IJCSI) 9:359
Google Scholar
Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Humaniz Comput 10:2301–2314. https://doi.org/10.1007/s12652-018-0828-x
Article Google Scholar
Dua M, Aggarwal RK, Kadyan V, Dua S (2012b) Punjabi speech to text system for connected words. https://doi.org/10.1049/cp.2012.2528
Fant G (1966) A note on vocal tract size factors and non-uniform F-pattern scalings. Speech Transmiss Lab Quart Prog Status Rep 1:22–30
Google Scholar
Gaikwad S, Gawali B, Basil M (2019) SCEHMA: speech corpus of english, hindi, marathi and arabic language for advance speech recognition development. In: In International Conference on Applied Computing to Support Industry: Innovation and Technology. Springer, Cham, pp 123–135 https://doi.org/10.1007/978-3-030-38752-5_10
Ghahremani P, Baba Ali B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014) A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP). In: 2014 IEEE International Conference on (pp 2494–2498). IEEE. https://doi.org/10.1109/icassp.2014.6854049
Ghai S, Sinha R (2009) Exploring the role of spectral smoothing in context of children's speech recognition. In: Tenth Annual Conference of the International Speech Communication Association.
Giurgiu M, Kabir A (2011) Comparison of vocal tract length normalization technique applied for clean and noisy speech. In: 2011 34th International Conference on Telecommunications and Signal Processing (TSP) (pp 351–354). IEEE. https://doi.org/10.1109/tsp.2011.6043710
Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Commun 16(3):261–291. https://doi.org/10.1016/0167-6393(94)00059-J
Article Google Scholar
Guglani J, Mishra AN (2018) Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int J Speech Technol 21:211–216. https://doi.org/10.1007/s10772-018-9497-6
Article Google Scholar
Guglani J, Mishra AN (2020) Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit. Appl Acoust 167:107386. https://doi.org/10.1016/j.apacoust.2020.107386
Article Google Scholar
Gupta N, Mishra AN, Sharma U (2015) Speech Recognition using Hybrid of GFCC and PLP. J Basic Appl Eng Res: 1896–1899
Gustafson J, Sjölander K (2002) Voice transformations for improving children's speech recognition in a publicly available dialogue system. In: 7th International Conference on Spoken Language Processing (ICSLP2002-INTERSPEECH 2002), Denver, Colorado, USA, September 16–20, 2002 (pp 297–300). International Speech Communication Association
Hartmann W, Ng T, Hsiao R, Tsakalidis S, Schwartz RM (2016) Two-stage data augmentation for low-resourced speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2016-1386
Article Google Scholar
Hawley ME, Kryter KD (1957) Effects of noise on speech. In: Harris CM (ed) Handbook of noise control, pp 1-1–1-26
Hermansky H, Morgan N, Hirsch HG (1993) Recognition of speech in additive and convolutional noise based on RASTA spectral processing. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp 83–86). IEEE. https://doi.org/10.1109/icassp.1993.319236
Hönig F, Stemmer G, Hacker C, Brugnara F (2005) Revising perceptual linear prediction (PLP). In: Ninth European Conference on Speech Communication and Technology
Huang X, Acero A, Hon HW, Reddy R (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR
Junqua JC (1993) The Lombard reflex and its role on human listeners and automatic speech recognizers. J Acoust Soc Am 93(1):510–524. https://doi.org/10.1121/1.405631
Article Google Scholar
Kadyan V (2018) Acoustic features optimization for punjabi automatic speech recognition system. Chitkara University, Punjab
Google Scholar
Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int J Speech Technol 20(4):761–769. https://doi.org/10.1007/s10772-017-9446-9
Article Google Scholar
Kadyan V, Mantri A, Aggarwal RK (2018) Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE J Res 64(5):673–688. https://doi.org/10.1080/03772063.2017.1369370
Article Google Scholar
Kaur J, Singh A, Kadyan V (2020) Automatic speech recognition system for tonal languages: state-of-the-art survey. Arch Comput Methods Eng. https://doi.org/10.1007/s11831-020-09414-4
Article Google Scholar
Kaur H, Kadyan V (2020) Feature space discriminatively trained Punjabi children speech recognition system Using Kaldi Toolkit. Available at SSRN 3565906. https://doi.org/10.2139/ssrn.3565906
Kopera HC, Grigos MI (2020) Lexical stress in childhood apraxia of speech: acoustic and kinematic findings. Int J Speech Lang Pathol 22(1):12–23. https://doi.org/10.1080/17549507.2019.1568571
Article Google Scholar
Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715. https://doi.org/10.1147/rd.485.0703
Article Google Scholar
Lee L, Rose RC (1996) Speaker normalization using efficient frequency warping procedures. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol 1, pp 353–356). IEEE. https://doi.org/10.1109/icassp.1996.541105
Lippmann R, Martin E, Paul D (1987) Multi-style training for robust isolated-word speech recognition. In: ICASSP'87. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol 12, pp 705–708). IEEE. https://doi.org/10.1109/icassp.1987.1169544
Majeed SA, Husain H, Samad SA, Idbeaa TF (2015) Mel frequency cepstral coefficients (mfcc) feature extraction enhancement in the application of speech recognition: a comparison study. J Theor Appl Inf Technol 79(1):38–56
Google Scholar
Marsal PP, Font SP, Hagen A, Bourlard H, Nadeu C (2002) Comparison and combination of RASTA-PLP and FF features in a hybrid HMM/MLP speech recognition system. In: Seventh International Conference on Spoken Language Processing. https://doi.org/10.1109/TSA.2004.834466
Martin F, Shikano K, Minami Y (1993) Recognition of noisy speech by composition of hidden Markov models. In: Third European Conference on Speech Communication and Technology, pp 1031–1034
Milenkovic PH, Wagner M, Kent RD, Story BH, Vorperian HK (2020) Effects of sampling rate and type of anti-aliasing filter on linear-predictive estimates of formant frequencies in men, women, and children. J Acous Soc Am 147(3):221–227. https://doi.org/10.1121/10.0000824
Article Google Scholar
Milne B (2002) A comparison of front-end configurations for robust speech recognition. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol 1, pp I-797). IEEE. https://doi.org/10.1109/icassp.2002.5743838
Misurelli SM, Goupell MJ, Burg AE, Jocewicz R, Kan A, Litovsky RY (2020) Auditory attention and spatial unmasking in children with cochlear implants. Trends Hear 24:2331216520946983. https://doi.org/10.1177/2331216520946983
Article Google Scholar
Mitra V, Franco H, Graciarena M, Mandal A (2012) Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4117–4120). IEEE. https://doi.org/10.1109/icassp.2012.6288824
Morris AC, Maier V, Green P (2004) From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: Eighth International Conference on Spoken Language Processing
Mrvaljevic N, Sun Y (2009) Comparison between speaker dependent mode and speaker independent mode for voice recognition. In: 2009 IEEE 35th Annual Northeast Bioengineering Conference, pp 1–2. IEEEhttps://doi.org/10.1109/nebc.2009.4967804
Neuman AC, Wroblewski M, Hajicek J, Rubinstein A (2010) Combined effects of noise and reverberation on speech recognition performance of normal-hearing children and adults. Ear Hear 31(3):336–344. https://doi.org/10.1097/AUD.0b013e3181d3d514
Article Google Scholar
Padmanabhan J, Johnson Premkumar MJ (2015) Machine learning in automatic speech recognition: a survey. IETE Tech Rev 32(4):240–251. https://doi.org/10.1080/02564602.2015.1010611
Article Google Scholar
Paliwal KK (1995) Interpolation properties of linear prediction parametric representations. In: Fourth European Conference on Speech Communication and Technology
Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Proceedings of 2001 a speaker odyssey: the speaker recognition workshop. European Speech Communication Association, pp 213–218. Crete, Greece
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society
Sambur M (1978) Adaptive noise canceling for speech signals. IEEE Trans Acoust Speech Signal Process 26(5):419–423. https://doi.org/10.1109/tassp.1978.1163137
Article Google Scholar
Shahnawazuddin S, Adiga N, Kathania HK, Sai BT (2020) Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recogn Lett 131:213–218. https://doi.org/10.1016/j.patrec.2019.12.019
Article Google Scholar
Shahnawazuddin S, Deepak KT, Pradhan G, Sinha R (2017) Enhancing noise and pitch robustness of children's ASR. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 5225–5229). IEEE. https://doi.org/10.1109/icassp.2017.7953153
Shao Y, Jin Z, Wang D, Srinivasan S (2009) An auditory-based feature for robust speech recognition. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp 4625–4628). IEEE. https://doi.org/10.1186/1687-4722-2014-21
Shrawankar U, Thakare V (2010) Feature extraction for a speech recognition system in noisy environment: a study. In: 2010 Second International Conference on Computer Engineering and Applications. https://doi.org/10.1109/iccea.2010.76
Singh A, Kadyan V, Kumar M, Bassan N (2019) ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages. Artif Intell Rev. https://doi.org/10.1007/s10462-019-09775-8
Article Google Scholar
Sun S, Yeh CF, Ostendorf M, Hwang MY, Xie L (2018) Training augmentation with adversarial examples for robust speech recognition. arXiv preprint https://arxiv.org/abs/1806.02782
Sung YH (2010) Hidden conditional random fields for speech recognition. Doctoral dissertation, Stanford University
Tuerk C, Robinson T (1993) A new frequency shift function for reducing inter-speaker variance. In: Third European Conference on Speech Communication and Technology
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3
Article Google Scholar
Walker E, Sapp C, Oleson J, McCreery RW (2019) Longitudinal speech recognition in noise in children: effects of hearing status and vocabulary. Front Psychol 10:2421. https://doi.org/10.3389/fpsyg.2019.02421
Article Google Scholar
Walt SVD, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30. https://doi.org/10.1109/mcse.2011.37
Article Google Scholar
Warren RM, Hainsworth KR, Brubaker BS, Bashford JA, Healy EW (1997) Spectral restoration of speech: intelligibility is increased by inserting noise in spectral gaps. Percept Psychophys 59(2):275–283. https://doi.org/10.3758/BF03211895
Article Google Scholar
Wu Z, Cao Z (2005) Improved MFCC-based feature for robust speaker identification. Tsinghua Sci Technol 10(2):158–161. https://doi.org/10.1016/s1007-0214(05)70048-1
Article Google Scholar
Wu B, Ren X, Liu C, Zhang Y (2004) A novel speech/noise discrimination method for embedded ASR system. EURASIP J Adv Signal Process 11:951918. https://doi.org/10.1155/S111086570440225X
Article Google Scholar
Xiang B, Chaudhari UV, Navratil J, Ramaswamy GN, Gopinath RA (2002) Short-time Gaussianization for robust speaker verification. In: IEEE International Conference on Acoustics Speech and Signal Processing (Vol. 1, pp. I-681). IEEE. https://doi.org/10.1109/icassp.2002.5743809
Xu D, Yapanel U, Gray S, Gilkerson J, Richards J, Hansen J (2008) Signal processing for young child speech language development. In: First Workshop on Child, Computer and Interaction.
Zhang Z, Furui S (2004) Piecewise-linear transformation-based HMM adaptation for noisy speech. Speech Commun 42(1):43–58. https://doi.org/10.1016/j.specom.2003.08.006
Article Google Scholar
Zhen B, Wu X, Liu Z, Chi H (2000) On the importance of components of the MFCC in speech and speaker recognition. In: Sixth International Conference on Spoken Language Processing (ICSLP)

Download references

Author information

Authors and Affiliations

Speech and Language Research Centre, School of Computer Science, University of Petroleum and Energy Studies (UPES), Energy Acres, Bidholi, Dehradun, 248007, Uttarakhand, India
Virender Kadyan
Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India
Puneet Bawa & Taniya Hasija

Authors

Virender Kadyan
View author publications
You can also search for this author in PubMed Google Scholar
Puneet Bawa
View author publications
You can also search for this author in PubMed Google Scholar
Taniya Hasija
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Virender Kadyan.

Ethics declarations

Conflict of interest

Authors have no conflict of interest in this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kadyan, V., Bawa, P. & Hasija, T. In domain training data augmentation on noise robust Punjabi Children speech recognition. J Ambient Intell Human Comput 13, 2705–2721 (2022). https://doi.org/10.1007/s12652-021-03468-3

Download citation

Received: 22 March 2020
Accepted: 31 August 2021
Published: 13 September 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s12652-021-03468-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In domain training data augmentation on noise robust Punjabi Children speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

In domain training data augmentation on noise robust Punjabi Children speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation