Skip to main content

A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

Abstract

Here, we present a data augmentation method that improves the robustness of convolutional neural network-based speech recognizers to additive noise. The proposed technique has its roots in the input dropout method because it discards a subset of the input features. However, instead of doing this in a completely random fashion, we introduce two simple heuristics that select the less reliable components of the spectrum of the speech signal as candidates for dropout. The first heuristic retains spectro-temporal maxima, while the second is based on a crude estimation of spectral dominance. The selected components are always retained, while the dropout step discards or retains the unselected ones in a probabilistic manner. Due to the randomness involved in dropout, the whole process may be interpreted as a data augmentation method that perturbs the data by creating new data instances from the existing ones on the fly. We evaluated the method on the Aurora-4 corpus just using the clean training data set, and we got relative word error rate reductions between 22% and 25%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural network concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP, pp. 4277–4280 (2012)

    Google Scholar 

  2. Alam, M.J., Kenny, P., O’Shaughnessy, D.: Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique. Digital Signal Process. 29, 147–157 (2014)

    Article  MathSciNet  Google Scholar 

  3. Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Proc. 2(4), 567–577 (1994)

    Article  Google Scholar 

  4. Baby, D., Gemmeke, J.F., Virtanen, T., Van Hamme, H.: Exemplar-based speech enhancement for deep neural network based automatic speech recognition. In: Proceedings of ICASSP, pp. 4485–4489 (2015)

    Google Scholar 

  5. Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation. ArXiv e-prints (2015)

    Google Scholar 

  6. Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Proceedings of Interspeech, pp. 905–909 (2014)

    Google Scholar 

  7. Chistovich, L., Lublinskaja, V.: The center of gravity effect in vowel spectra and critical distance between the formants. Hear. Res. 1, 185–195 (1979)

    Article  Google Scholar 

  8. Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. In: Proceedings of ICASSP, pp. 5619–5623 (2014)

    Google Scholar 

  9. Deng, L., Abdel-Hamid, O., Yu, D.: A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: Proceedings of ICASSP, pp. 6669–6673 (2013)

    Google Scholar 

  10. Flores, J., Young, S.: Continuous speech recognition in noise using spectral subtraction and HMM adaptation. In: Proceedings of ICASSP, pp. 409–412 (1994)

    Google Scholar 

  11. Ghitza, O.: Auditory nerve representation criteria for speech analysis/synthesis. IEEE Trans. ASSP 35(6), 736–740 (1987)

    Article  Google Scholar 

  12. Graham, B., Reizenstein, J., Robinson, L.: Efficient batchwise dropout training using submatrices. ArXiv e-prints, February 2015

    Google Scholar 

  13. Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.M.: Two-stage data augmentation for low-resourced speech recognition. In: Proceedings of Interspeech, pp. 2378–2382 (2016)

    Google Scholar 

  14. Hillenbrand, J.M., Houde, R.A., Gayvert, R.T.: Speech perception based on spectral peaks versus spectral shape. J. Acoust. Soc. Am. 119(6), 4041–4054 (2006)

    Article  Google Scholar 

  15. Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012)

    Google Scholar 

  16. Hirsch, H.G., Pearce, D.: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)

    Google Scholar 

  17. Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grezl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., Mallidi, S., Hermansky, H., Tsakalidis, S., Schwartz, R.: Robust speech recognition in unknown reverberant and noisy conditions. In: Proceedings of ASRU, pp. 533–538. IEEE, December 2015

    Google Scholar 

  18. Huang, J.T., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: Proceedings of ICASSP, pp. 4989–4993 (2015)

    Google Scholar 

  19. Ikbal, S., Bourlard, H., Magimai-Doss, M.: Peak location estimation for noise robust speech recognition. In: Proceedings of ICASSP, pp. 453–456 (2005)

    Google Scholar 

  20. Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: ICML (2013)

    Google Scholar 

  21. Kanda, N., Takeda, R., Obuchi, Y.: Elastic spectral distortion for low resource speech recognition with deep neural networks. In: Proceedings of ASRU, pp. 309–314. IEEE (2013)

    Google Scholar 

  22. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings of Interspeech, pp. 3586–3589 (2015)

    Google Scholar 

  23. Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., Khudanpur, S.: A study on data augmentation of reverberant speech for robust speech recognition. In: Proceedings of ICASSP (2017)

    Google Scholar 

  24. Kovács, G., Tóth, L.: Joint optimization of spectro-temporal features and deep neural nets for robust automatic speech recognition. Acta Cybernetica 22(1), 117–134 (2015)

    Article  MathSciNet  Google Scholar 

  25. Lockwood, P., Boudy, J., Blanchet, M.: Non-linear spectral subtraction (NSS) and hidden Markov models for robust speech recognition in car noise environments. In: Proceedings of ICASSP (1992)

    Google Scholar 

  26. Miao, Y., Metze, F.: Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training. In: Proceedings of Interspeech, pp. 2237–2241 (2013)

    Google Scholar 

  27. Moore, B.C.J.: An Introduction to the Psychology of Hearing. Academic Press, London (1997)

    Google Scholar 

  28. Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU aspire system: robust LVCSR with tdnns, ivector adaptation and RNN-LMS. In: Proceedings of ASRU, pp. 539–546 (2015)

    Google Scholar 

  29. Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of Interspeech, pp. 810–814. ISCA (2014)

    Google Scholar 

  30. Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of ICASSP, pp. 8614–8618 (2013)

    Google Scholar 

  31. Schroeder, M., Atal, B.S., Hall, J.L.: Optimizing digital speech coders by exploiting masking properties of the human ear. JASA 66(6), 1647–1652 (1979)

    Article  Google Scholar 

  32. Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 25 (2015). https://doi.org/10.1186/s13636-015-0068-3

  33. Wan, W., Au, O., Keung, C., Yim, C.: A novel approach of low bit-rate speech coding based on sinusoidal representation and auditory model. In: Proceedings of Eurospeech, pp. 1555–1558 (1999)

    Google Scholar 

Download references

Acknowledgments

This research was partially supported by the EU-funded Hungarian grant EFOP-3.6.1-16-2016-00008, and by the National Research, Development and Innovation Office of Hungary (FK 124584). László Tóth was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to László Tóth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tóth, L., Kovács, G., Van Compernolle, D. (2018). A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_71

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99579-3_71

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99578-6

  • Online ISBN: 978-3-319-99579-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics