Circuits, Systems, and Signal Processing

, Volume 37, Issue 5, pp 2021–2044 | Cite as

Explicit Pitch Mapping for Improved Children’s Speech Recognition

  • Hemant Kumar Kathania
  • Waquar Ahmad
  • S. Shahnawazuddin
  • A. B. Samaddar
Article
  • 67 Downloads

Abstract

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

Keywords

Children’s speech recognition Acoustic mismatch Pitch compensation Timescale modification 

Notes

Acknowledgements

The authors would like to express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions which greatly helped in improving the quality of the paper.

References

  1. 1.
    T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training, in International Conference on Spoken Language Processing, vol. 2. (1996), pp. 1137–1140Google Scholar
  2. 2.
    A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus, in Proceedings of INTERSPEECH (2005), pp. 2761–2764Google Scholar
  3. 3.
    G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion, in Proceedings of The 8th International Conference on Digital Audio Effects (2005), pp. 116–118Google Scholar
  4. 4.
    L. Bell, J. Gustafson, Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game, in Proceeding of INTERSPEECH (2007), pp. 2209–2212Google Scholar
  5. 5.
    D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task, in Proceedings of ICSLP 2 (1996), pp. 1145–1148Google Scholar
  6. 6.
    G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  7. 7.
    S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). doi: 10.1109/TASSP.1980.1163420 CrossRefGoogle Scholar
  8. 8.
    J.R. Deller Jr., J.H.L. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals, 2nd edn. (IEEE Press, New York, 2000)Google Scholar
  9. 9.
    V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)CrossRefGoogle Scholar
  10. 10.
    J. Driedger, M. Müller, A review of time-scale modification of music signals. Appl. Sci. 6(2), 57 (2016)CrossRefGoogle Scholar
  11. 11.
    J. Driedger, M. Müller, Tsm toolbox: Matlab implementations of time-scale modification algorithms, in Proceeding of the International Conference on Digital Audio Effects (DAFx), Erlangen, Germany (2014), pp. 249–256Google Scholar
  12. 12.
    J. Driedger, M. Müller, S. Ewert, Improving time-scale modification of music signals using harmonic-percussive separation. IEEE Signal Process. Lett. 21(1), 105–109 (2014)CrossRefGoogle Scholar
  13. 13.
    R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley-Interscience, Hoboken, 2000)MATHGoogle Scholar
  14. 14.
    W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recognition research database: specifications and status, in Proceedings of the DARPA Workshop on Speech Recognition (1986). pp. 93–99Google Scholar
  15. 15.
    M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)CrossRefGoogle Scholar
  16. 16.
    M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children’s speech. Speech Communun. 49(10–11), 847–860 (2007)CrossRefGoogle Scholar
  17. 17.
    M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceeding of the Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8Google Scholar
  18. 18.
    S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)Google Scholar
  19. 19.
    S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceeding of INTERSPEECH (2009), pp. 1607–1610Google Scholar
  20. 20.
    S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition, in Proceedings of the Signal Processing and Communications (SPCOM) (2010)Google Scholar
  21. 21.
    S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)CrossRefGoogle Scholar
  22. 22.
    S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC, in Proceedings of INTERSPEECH (2011), pp. 2589–2592Google Scholar
  23. 23.
    S. Ghai, R. Sinha, Pitch adaptive MFCC features for improving children’s mismatch ASR. Int. J. Spech Technol. 18(3), 489–503 (2015)CrossRefGoogle Scholar
  24. 24.
    S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child automatic speech recognition for US English: child interaction with living-room-electronic-devices, in Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)Google Scholar
  25. 25.
    A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors, in Proceedings of ASRU (2003), pp. 186–191Google Scholar
  26. 26.
    A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)CrossRefGoogle Scholar
  27. 27.
    H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 57(4), 1738–52 (1990)CrossRefGoogle Scholar
  28. 28.
    G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  29. 29.
    I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)CrossRefMATHGoogle Scholar
  30. 30.
    H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2014), pp. 1–5Google Scholar
  31. 31.
    N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)CrossRefGoogle Scholar
  32. 32.
    J. Laroche, Time and Pitch Scale Modification of Audio Signals (Springer, Boston, 2002), pp. 279–309Google Scholar
  33. 33.
    J. Laroche, M. Dolson, Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 7(3), 323–332 (1999)CrossRefGoogle Scholar
  34. 34.
    L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)CrossRefGoogle Scholar
  35. 35.
    S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)CrossRefGoogle Scholar
  36. 36.
    R. Leonard, A database for speaker-independent digit recognition, in Proceedings of ICASSP (1984), pp. 42.11.1–42.11.4Google Scholar
  37. 37.
    H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children, in Proceedings of INTERSPEECH (2015), pp. 1611–1615Google Scholar
  38. 38.
    S. Matsoukas, R. Schwartz, H. Jin, L. Nguyen, Practical implementations of speaker-adaptive training, in Proceedings of DARPA Speech Recognition Workshop (1997)Google Scholar
  39. 39.
    P. McLeod, Fast, Accurate Pitch Detection Tools for Music Analysis. Ph.D. thesis, University of Otago, Dunedin, New Zealand (2008)Google Scholar
  40. 40.
    A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners, in Proceedings of INTERSPEECH (2014), pp. 1468–1472Google Scholar
  41. 41.
    S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)CrossRefGoogle Scholar
  42. 42.
    R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability, in Proceedings of ICASSP, vol. 1 (2004), pp. 433–436Google Scholar
  43. 43.
    A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)CrossRefGoogle Scholar
  44. 44.
    D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)CrossRefGoogle Scholar
  45. 45.
    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of ASRU (2011)Google Scholar
  46. 46.
    L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, 1993)MATHGoogle Scholar
  47. 47.
    S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH (2013)Google Scholar
  48. 48.
    T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of ICASSP, vol 1 (1995), pp. 81–84Google Scholar
  49. 49.
    M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)Google Scholar
  50. 50.
    J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4, ed. By A. Neustein (Springer, 2010), pp. 61–90. doi: 10.1007/978-1-4419-5951-5_4
  51. 51.
    R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to dnn-based children’s and adults’ speech recognition, in Proceedings of the Spoken Language Technology Workshop (SLT) (2014), pp. 135–140Google Scholar
  52. 52.
    S. Shahnawazuddin, Improving childrens mismatched ASR through adaptive pitch compensation. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2016)Google Scholar
  53. 53.
    S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR, in Proceedings of ICASSP (2017), pp. 5225–5229Google Scholar
  54. 54.
    S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proceedings of INTERSPEECH (2016)Google Scholar
  55. 55.
    S. Shahnawazuddin, H.K. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. In: Proceedings of TENCON (2015)Google Scholar
  56. 56.
    S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition, in Proceedings of INTERSPEECH (2015)Google Scholar
  57. 57.
    S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)CrossRefGoogle Scholar
  58. 58.
    S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)CrossRefGoogle Scholar
  59. 59.
    X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction, in Proceedings of ICASSP (2004), pp. 97–100Google Scholar
  60. 60.
    P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of the Workshop on Child Computer Interaction (2014)Google Scholar
  61. 61.
    H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition, in Proceedings of ICASSP (1992), pp. 273–276Google Scholar
  62. 62.
    R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH (2009), pp. 568–571Google Scholar
  63. 63.
    R. Sinha, S. Shahnawazuddin, P.S. Karthik, Exploring the role of pitch-adaptive cepstral features in context of children’s mismatched ASR, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2016), pp. 1–5Google Scholar
  64. 64.
    X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Hemant Kumar Kathania
    • 1
  • Waquar Ahmad
    • 1
  • S. Shahnawazuddin
    • 2
  • A. B. Samaddar
    • 3
  1. 1.Department of Electronics and Communication EngineeringNational Institute of Technology SikkimSikkimIndia
  2. 2.Department of Electronics and Communication EngineeringNational Institute of Technology PatnaPatnaIndia
  3. 3.Department of Computer Science and EngineeringNational Institute of Technology SikkimSikkimIndia

Personalised recommendations