Skip to main content
Log in

In domain training data augmentation on noise robust Punjabi Children speech recognition

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

For building a successful automatic speech recognition (ASR) engine large training data is required. It increases training complexity and become impossible for less resource language like Punjabi which have zero children corpus. Consequently, the issue of data scarcity, and small vocal length of children speakers also degrades the system performance under limited data conditions. Unfortunately, Punjabi is a tonal language and building an optimized ASR for such a language is near impossible. In this paper, we have explored fused feature extraction approach to handle large training complexity using mel frequency-gammatone frequency cepstral coefficient (MF-GFCC) technique through feature warping method. The efforts have been made to develop children’s ASR engine using data augmentation on limited data scenarios. For that purpose, we have studied in-domain data augmentation that artificially combined noisy and clean corpus to overcome the issue of data scarcity in train set. The combined dataset is processed with a fused feature extraction approach. Apart, the tonal characteristics and child vocal length issues are also overcome by inducing pitch features and train normalization strategy using vocal tract length normalization (VTLN) approach. In addition to that, combined augmented and original speech signals are noted to reduce the Word error rate (WER) performance with larger relative improvement (RI) of 20.59% on noisy and 19.39% on clean environment conditions using hybrid MF-GFCC approach than that on conventional Mel Frequency Cepstral Coefficient (MFCC) and Gammatone Frequency Cepstral Coefficient (GFCC) based ASR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin

    Book  Google Scholar 

  • Abualigah L (2020) Multi-verse optimizer algorithm: a comprehensive survey of its results variants and applications. Neural Comput Appl. https://doi.org/10.1007/s00521-020-04839-1

    Article  Google Scholar 

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73:4773–4795. https://doi.org/10.1007/s11227-017-2046-2

    Article  Google Scholar 

  • Al-Ali AKH, Dean D, Senadji B, Baktashmotlagh M, Chandran V (2017) Speaker verification with multi-run ICA based speech enhancement. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), (pp 1–7). IEEE. https://doi.org/10.1109/icspcs.2017.8270505

  • Alías F, Socoró JC, Sevillano X (2016) A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl Sci 6:143. https://doi.org/10.3390/app6050143

    Article  Google Scholar 

  • Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100. https://doi.org/10.1016/j.specom.2013.07.008

    Article  Google Scholar 

  • Chiu YHB, Raj B, Stern RM (2011) Learning-based auditory encoding for robust speech recognition. IEEE Trans Audio Speech Lang Process 20(3):900–914. https://doi.org/10.1109/tasl.2011.2168209

    Article  Google Scholar 

  • Crandell CC (1993) Speech recognition in noise by children with minimal degrees of sensorineural hearing loss. Int J Pediatr Otorhinolaryngol 28(2–3):262. https://doi.org/10.1016/0165-5876(94)90024-8

    Article  Google Scholar 

  • Das S, Nix D, Picheny M (1998) Improvements in children's speech recognition performance. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181) (Vol 1, pp 433–436). IEEE. https://doi.org/10.1109/ICASSP.1998.674460

  • Deka A, Deka MK (2018) Spoken dialog system in bodo language for agro services. In: Advances in Electronics, Communication and Computing. Springer, Singapore, pp 623–631 https://doi.org/10.1007/978-981-10-4765-7_65

  • Deng L, Acero A, Plumpe M, Huang X (2000) Large-vocabulary speech recognition under adverse acoustic environments. In: Sixth International Conference on Spoken Language Processing (ICSLP), pp 806–809

  • Dey A, Sarma BD, Lalhminghlui W, Ngente L, Gogoi P, Sarmah P et al (2018) Robust mizo continuous speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2018-2125

    Article  Google Scholar 

  • Dua M, Aggarwal RK, Kadyan V, Dua S (2012a) Punjabi automatic speech recognition using HTK. Int J Comput Sci Issues (IJCSI) 9:359

    Google Scholar 

  • Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Humaniz Comput 10:2301–2314. https://doi.org/10.1007/s12652-018-0828-x

    Article  Google Scholar 

  • Dua M, Aggarwal RK, Kadyan V, Dua S (2012b) Punjabi speech to text system for connected words. https://doi.org/10.1049/cp.2012.2528

  • Fant G (1966) A note on vocal tract size factors and non-uniform F-pattern scalings. Speech Transmiss Lab Quart Prog Status Rep 1:22–30

    Google Scholar 

  • Gaikwad S, Gawali B, Basil M (2019) SCEHMA: speech corpus of english, hindi, marathi and arabic language for advance speech recognition development. In: In International Conference on Applied Computing to Support Industry: Innovation and Technology. Springer, Cham, pp 123–135 https://doi.org/10.1007/978-3-030-38752-5_10

  • Ghahremani P, Baba Ali B, Povey D, Riedhammer K, Trmal J, Khudanpur S (2014) A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP). In: 2014 IEEE International Conference on (pp 2494–2498). IEEE. https://doi.org/10.1109/icassp.2014.6854049

  • Ghai S, Sinha R (2009) Exploring the role of spectral smoothing in context of children's speech recognition. In: Tenth Annual Conference of the International Speech Communication Association.

  • Giurgiu M, Kabir A (2011) Comparison of vocal tract length normalization technique applied for clean and noisy speech. In: 2011 34th International Conference on Telecommunications and Signal Processing (TSP) (pp 351–354). IEEE. https://doi.org/10.1109/tsp.2011.6043710

  • Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Commun 16(3):261–291. https://doi.org/10.1016/0167-6393(94)00059-J

    Article  Google Scholar 

  • Guglani J, Mishra AN (2018) Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. Int J Speech Technol 21:211–216. https://doi.org/10.1007/s10772-018-9497-6

    Article  Google Scholar 

  • Guglani J, Mishra AN (2020) Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit. Appl Acoust 167:107386. https://doi.org/10.1016/j.apacoust.2020.107386

    Article  Google Scholar 

  • Gupta N, Mishra AN, Sharma U (2015) Speech Recognition using Hybrid of GFCC and PLP. J Basic Appl Eng Res: 1896–1899

  • Gustafson J, Sjölander K (2002) Voice transformations for improving children's speech recognition in a publicly available dialogue system. In: 7th International Conference on Spoken Language Processing (ICSLP2002-INTERSPEECH 2002), Denver, Colorado, USA, September 16–20, 2002 (pp 297–300). International Speech Communication Association

  • Hartmann W, Ng T, Hsiao R, Tsakalidis S, Schwartz RM (2016) Two-stage data augmentation for low-resourced speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2016-1386

    Article  Google Scholar 

  • Hawley ME, Kryter KD (1957) Effects of noise on speech. In: Harris CM (ed) Handbook of noise control, pp 1-1–1-26

  • Hermansky H, Morgan N, Hirsch HG (1993) Recognition of speech in additive and convolutional noise based on RASTA spectral processing. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp 83–86). IEEE. https://doi.org/10.1109/icassp.1993.319236

  • Hönig F, Stemmer G, Hacker C, Brugnara F (2005) Revising perceptual linear prediction (PLP). In: Ninth European Conference on Speech Communication and Technology

  • Huang X, Acero A, Hon HW, Reddy R (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR

  • Junqua JC (1993) The Lombard reflex and its role on human listeners and automatic speech recognizers. J Acoust Soc Am 93(1):510–524. https://doi.org/10.1121/1.405631

    Article  Google Scholar 

  • Kadyan V (2018) Acoustic features optimization for punjabi automatic speech recognition system. Chitkara University, Punjab

    Google Scholar 

  • Kadyan V, Mantri A, Aggarwal RK (2017) A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int J Speech Technol 20(4):761–769. https://doi.org/10.1007/s10772-017-9446-9

    Article  Google Scholar 

  • Kadyan V, Mantri A, Aggarwal RK (2018) Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE J Res 64(5):673–688. https://doi.org/10.1080/03772063.2017.1369370

    Article  Google Scholar 

  • Kaur J, Singh A, Kadyan V (2020) Automatic speech recognition system for tonal languages: state-of-the-art survey. Arch Comput Methods Eng. https://doi.org/10.1007/s11831-020-09414-4

    Article  Google Scholar 

  • Kaur H, Kadyan V (2020) Feature space discriminatively trained Punjabi children speech recognition system Using Kaldi Toolkit. Available at SSRN 3565906. https://doi.org/10.2139/ssrn.3565906

  • Kopera HC, Grigos MI (2020) Lexical stress in childhood apraxia of speech: acoustic and kinematic findings. Int J Speech Lang Pathol 22(1):12–23. https://doi.org/10.1080/17549507.2019.1568571

    Article  Google Scholar 

  • Kumar M, Rajput N, Verma A (2004) A large-vocabulary continuous speech recognition system for Hindi. IBM J Res Dev 48(5.6):703–715. https://doi.org/10.1147/rd.485.0703

    Article  Google Scholar 

  • Lee L, Rose RC (1996) Speaker normalization using efficient frequency warping procedures. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol 1, pp 353–356). IEEE. https://doi.org/10.1109/icassp.1996.541105

  • Lippmann R, Martin E, Paul D (1987) Multi-style training for robust isolated-word speech recognition. In: ICASSP'87. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol 12, pp 705–708). IEEE. https://doi.org/10.1109/icassp.1987.1169544

  • Majeed SA, Husain H, Samad SA, Idbeaa TF (2015) Mel frequency cepstral coefficients (mfcc) feature extraction enhancement in the application of speech recognition: a comparison study. J Theor Appl Inf Technol 79(1):38–56

    Google Scholar 

  • Marsal PP, Font SP, Hagen A, Bourlard H, Nadeu C (2002) Comparison and combination of RASTA-PLP and FF features in a hybrid HMM/MLP speech recognition system. In: Seventh International Conference on Spoken Language Processing. https://doi.org/10.1109/TSA.2004.834466

  • Martin F, Shikano K, Minami Y (1993) Recognition of noisy speech by composition of hidden Markov models. In: Third European Conference on Speech Communication and Technology, pp 1031–1034

  • Milenkovic PH, Wagner M, Kent RD, Story BH, Vorperian HK (2020) Effects of sampling rate and type of anti-aliasing filter on linear-predictive estimates of formant frequencies in men, women, and children. J Acous Soc Am 147(3):221–227. https://doi.org/10.1121/10.0000824

    Article  Google Scholar 

  • Milne B (2002) A comparison of front-end configurations for robust speech recognition. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol 1, pp I-797). IEEE. https://doi.org/10.1109/icassp.2002.5743838

  • Misurelli SM, Goupell MJ, Burg AE, Jocewicz R, Kan A, Litovsky RY (2020) Auditory attention and spatial unmasking in children with cochlear implants. Trends Hear 24:2331216520946983. https://doi.org/10.1177/2331216520946983

    Article  Google Scholar 

  • Mitra V, Franco H, Graciarena M, Mandal A (2012) Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4117–4120). IEEE. https://doi.org/10.1109/icassp.2012.6288824

  • Morris AC, Maier V, Green P (2004) From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: Eighth International Conference on Spoken Language Processing

  • Mrvaljevic N, Sun Y (2009) Comparison between speaker dependent mode and speaker independent mode for voice recognition. In: 2009 IEEE 35th Annual Northeast Bioengineering Conference, pp 1–2. IEEEhttps://doi.org/10.1109/nebc.2009.4967804

  • Neuman AC, Wroblewski M, Hajicek J, Rubinstein A (2010) Combined effects of noise and reverberation on speech recognition performance of normal-hearing children and adults. Ear Hear 31(3):336–344. https://doi.org/10.1097/AUD.0b013e3181d3d514

    Article  Google Scholar 

  • Padmanabhan J, Johnson Premkumar MJ (2015) Machine learning in automatic speech recognition: a survey. IETE Tech Rev 32(4):240–251. https://doi.org/10.1080/02564602.2015.1010611

    Article  Google Scholar 

  • Paliwal KK (1995) Interpolation properties of linear prediction parametric representations. In: Fourth European Conference on Speech Communication and Technology

  • Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Proceedings of 2001 a speaker odyssey: the speaker recognition workshop. European Speech Communication Association, pp 213–218. Crete, Greece

  • Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society

  • Sambur M (1978) Adaptive noise canceling for speech signals. IEEE Trans Acoust Speech Signal Process 26(5):419–423. https://doi.org/10.1109/tassp.1978.1163137

    Article  Google Scholar 

  • Shahnawazuddin S, Adiga N, Kathania HK, Sai BT (2020) Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recogn Lett 131:213–218. https://doi.org/10.1016/j.patrec.2019.12.019

    Article  Google Scholar 

  • Shahnawazuddin S, Deepak KT, Pradhan G, Sinha R (2017) Enhancing noise and pitch robustness of children's ASR. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 5225–5229). IEEE. https://doi.org/10.1109/icassp.2017.7953153

  • Shao Y, Jin Z, Wang D, Srinivasan S (2009) An auditory-based feature for robust speech recognition. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (pp 4625–4628). IEEE. https://doi.org/10.1186/1687-4722-2014-21

  • Shrawankar U, Thakare V (2010) Feature extraction for a speech recognition system in noisy environment: a study. In: 2010 Second International Conference on Computer Engineering and Applications. https://doi.org/10.1109/iccea.2010.76

  • Singh A, Kadyan V, Kumar M, Bassan N (2019) ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages. Artif Intell Rev. https://doi.org/10.1007/s10462-019-09775-8

    Article  Google Scholar 

  • Sun S, Yeh CF, Ostendorf M, Hwang MY, Xie L (2018) Training augmentation with adversarial examples for robust speech recognition. arXiv preprint https://arxiv.org/abs/1806.02782

  • Sung YH (2010) Hidden conditional random fields for speech recognition. Doctoral dissertation, Stanford University

  • Tuerk C, Robinson T (1993) A new frequency shift function for reducing inter-speaker variance. In: Third European Conference on Speech Communication and Technology

  • Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3

    Article  Google Scholar 

  • Walker E, Sapp C, Oleson J, McCreery RW (2019) Longitudinal speech recognition in noise in children: effects of hearing status and vocabulary. Front Psychol 10:2421. https://doi.org/10.3389/fpsyg.2019.02421

    Article  Google Scholar 

  • Walt SVD, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30. https://doi.org/10.1109/mcse.2011.37

    Article  Google Scholar 

  • Warren RM, Hainsworth KR, Brubaker BS, Bashford JA, Healy EW (1997) Spectral restoration of speech: intelligibility is increased by inserting noise in spectral gaps. Percept Psychophys 59(2):275–283. https://doi.org/10.3758/BF03211895

    Article  Google Scholar 

  • Wu Z, Cao Z (2005) Improved MFCC-based feature for robust speaker identification. Tsinghua Sci Technol 10(2):158–161. https://doi.org/10.1016/s1007-0214(05)70048-1

    Article  Google Scholar 

  • Wu B, Ren X, Liu C, Zhang Y (2004) A novel speech/noise discrimination method for embedded ASR system. EURASIP J Adv Signal Process 11:951918. https://doi.org/10.1155/S111086570440225X

    Article  Google Scholar 

  • Xiang B, Chaudhari UV, Navratil J, Ramaswamy GN, Gopinath RA (2002) Short-time Gaussianization for robust speaker verification. In: IEEE International Conference on Acoustics Speech and Signal Processing (Vol. 1, pp. I-681). IEEE. https://doi.org/10.1109/icassp.2002.5743809

  • Xu D, Yapanel U, Gray S, Gilkerson J, Richards J, Hansen J (2008) Signal processing for young child speech language development. In: First Workshop on Child, Computer and Interaction.

  • Zhang Z, Furui S (2004) Piecewise-linear transformation-based HMM adaptation for noisy speech. Speech Commun 42(1):43–58. https://doi.org/10.1016/j.specom.2003.08.006

    Article  Google Scholar 

  • Zhen B, Wu X, Liu Z, Chi H (2000) On the importance of components of the MFCC in speech and speaker recognition. In: Sixth International Conference on Spoken Language Processing (ICSLP)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Virender Kadyan.

Ethics declarations

Conflict of interest

Authors have no conflict of interest in this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kadyan, V., Bawa, P. & Hasija, T. In domain training data augmentation on noise robust Punjabi Children speech recognition. J Ambient Intell Human Comput 13, 2705–2721 (2022). https://doi.org/10.1007/s12652-021-03468-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03468-3

Keywords

Navigation