Explicit Pitch Mapping for Improved Children’s Speech Recognition

Kathania, Hemant Kumar; Ahmad, Waquar; Shahnawazuddin, S.; Samaddar, A. B.

doi:10.1007/s00034-017-0652-0

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Published: 11 September 2017

Volume 37, pages 2021–2044, (2018)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Hemant Kumar Kathania¹,
Waquar Ahmad¹,
S. Shahnawazuddin² &
…
A. B. Samaddar³

371 Accesses
7 Citations
Explore all metrics

Abstract

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pitch adaptive MFCC features for improving children’s mismatched ASR

Article 21 July 2015

Shweta Ghai & Rohit Sinha

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Article 05 May 2018

S. Shahnawazuddin, Chaman Singh, … Gayadhar Pradhan

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Article 12 June 2017

S. Shahnawazuddin & Rohit Sinha

Notes

It is to note that neither CMVN nor fMLLR is applied to the MFCC features used for the analyses reported here even though both CMVN and fMLLR will be used prior to training model parameters.

References

T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training, in International Conference on Spoken Language Processing, vol. 2. (1996), pp. 1137–1140
A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus, in Proceedings of INTERSPEECH (2005), pp. 2761–2764
G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion, in Proceedings of The 8th International Conference on Digital Audio Effects (2005), pp. 116–118
L. Bell, J. Gustafson, Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game, in Proceeding of INTERSPEECH (2007), pp. 2209–2212
D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task, in Proceedings of ICSLP 2 (1996), pp. 1145–1148
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
Article Google Scholar
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). doi:10.1109/TASSP.1980.1163420
Article Google Scholar
J.R. Deller Jr., J.H.L. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals, 2nd edn. (IEEE Press, New York, 2000)
Google Scholar
V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)
Article Google Scholar
J. Driedger, M. Müller, A review of time-scale modification of music signals. Appl. Sci. 6(2), 57 (2016)
Article Google Scholar
J. Driedger, M. Müller, Tsm toolbox: Matlab implementations of time-scale modification algorithms, in Proceeding of the International Conference on Digital Audio Effects (DAFx), Erlangen, Germany (2014), pp. 249–256
J. Driedger, M. Müller, S. Ewert, Improving time-scale modification of music signals using harmonic-percussive separation. IEEE Signal Process. Lett. 21(1), 105–109 (2014)
Article Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley-Interscience, Hoboken, 2000)
MATH Google Scholar
W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recognition research database: specifications and status, in Proceedings of the DARPA Workshop on Speech Recognition (1986). pp. 93–99
M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Article Google Scholar
M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children’s speech. Speech Communun. 49(10–11), 847–860 (2007)
Article Google Scholar
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceeding of the Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8
S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)
S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceeding of INTERSPEECH (2009), pp. 1607–1610
S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition, in Proceedings of the Signal Processing and Communications (SPCOM) (2010)
S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)
Article Google Scholar
S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC, in Proceedings of INTERSPEECH (2011), pp. 2589–2592
S. Ghai, R. Sinha, Pitch adaptive MFCC features for improving children’s mismatch ASR. Int. J. Spech Technol. 18(3), 489–503 (2015)
Article Google Scholar
S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child automatic speech recognition for US English: child interaction with living-room-electronic-devices, in Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)
A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors, in Proceedings of ASRU (2003), pp. 186–191
A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)
Article Google Scholar
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 57(4), 1738–52 (1990)
Article Google Scholar
G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)
Book MATH Google Scholar
H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2014), pp. 1–5
N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
Article Google Scholar
J. Laroche, Time and Pitch Scale Modification of Audio Signals (Springer, Boston, 2002), pp. 279–309
Google Scholar
J. Laroche, M. Dolson, Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 7(3), 323–332 (1999)
Article Google Scholar
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Article Google Scholar
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Article Google Scholar
R. Leonard, A database for speaker-independent digit recognition, in Proceedings of ICASSP (1984), pp. 42.11.1–42.11.4
H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children, in Proceedings of INTERSPEECH (2015), pp. 1611–1615
S. Matsoukas, R. Schwartz, H. Jin, L. Nguyen, Practical implementations of speaker-adaptive training, in Proceedings of DARPA Speech Recognition Workshop (1997)
P. McLeod, Fast, Accurate Pitch Detection Tools for Music Analysis. Ph.D. thesis, University of Otago, Dunedin, New Zealand (2008)
A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners, in Proceedings of INTERSPEECH (2014), pp. 1468–1472
S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)
Article Google Scholar
R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability, in Proceedings of ICASSP, vol. 1 (2004), pp. 433–436
A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
Article Google Scholar
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
Article Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of ASRU (2011)
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, 1993)
MATH Google Scholar
S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH (2013)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of ICASSP, vol 1 (1995), pp. 81–84
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4, ed. By A. Neustein (Springer, 2010), pp. 61–90. doi:10.1007/978-1-4419-5951-5_4
R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to dnn-based children’s and adults’ speech recognition, in Proceedings of the Spoken Language Technology Workshop (SLT) (2014), pp. 135–140
S. Shahnawazuddin, Improving childrens mismatched ASR through adaptive pitch compensation. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2016)
S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR, in Proceedings of ICASSP (2017), pp. 5225–5229
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proceedings of INTERSPEECH (2016)
S. Shahnawazuddin, H.K. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. In: Proceedings of TENCON (2015)
S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition, in Proceedings of INTERSPEECH (2015)
S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)
Article Google Scholar
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
Article Google Scholar
X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction, in Proceedings of ICASSP (2004), pp. 97–100
P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of the Workshop on Child Computer Interaction (2014)
H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition, in Proceedings of ICASSP (1992), pp. 273–276
R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH (2009), pp. 568–571
R. Sinha, S. Shahnawazuddin, P.S. Karthik, Exploring the role of pitch-adaptive cepstral features in context of children’s mismatched ASR, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2016), pp. 1–5
X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Article Google Scholar

Download references

Acknowledgements

The authors would like to express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions which greatly helped in improving the quality of the paper.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, Sikkim, India
Hemant Kumar Kathania & Waquar Ahmad
Department of Electronics and Communication Engineering, National Institute of Technology Patna, Patna, India
S. Shahnawazuddin
Department of Computer Science and Engineering, National Institute of Technology Sikkim, Sikkim, India
A. B. Samaddar

Authors

Hemant Kumar Kathania
View author publications
You can also search for this author in PubMed Google Scholar
Waquar Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
S. Shahnawazuddin
View author publications
You can also search for this author in PubMed Google Scholar
A. B. Samaddar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hemant Kumar Kathania.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kathania, H.K., Ahmad, W., Shahnawazuddin, S. et al. Explicit Pitch Mapping for Improved Children’s Speech Recognition. Circuits Syst Signal Process 37, 2021–2044 (2018). https://doi.org/10.1007/s00034-017-0652-0

Download citation

Received: 10 October 2016
Revised: 28 August 2017
Accepted: 29 August 2017
Published: 11 September 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00034-017-0652-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Pitch adaptive MFCC features for improving children’s mismatched ASR

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Pitch adaptive MFCC features for improving children’s mismatched ASR

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation