A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models

Tóth, László; Kovács, György; Van Compernolle, Dirk

doi:10.1007/978-3-319-99579-3_71

László Tóth¹⁶,
György Kovács^16,17,18 &
Dirk Van Compernolle¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1545 Accesses
6 Citations

Abstract

Here, we present a data augmentation method that improves the robustness of convolutional neural network-based speech recognizers to additive noise. The proposed technique has its roots in the input dropout method because it discards a subset of the input features. However, instead of doing this in a completely random fashion, we introduce two simple heuristics that select the less reliable components of the spectrum of the speech signal as candidates for dropout. The first heuristic retains spectro-temporal maxima, while the second is based on a crude estimation of spectral dominance. The selected components are always retained, while the dropout step discards or retains the unselected ones in a probabilistic manner. Due to the randomness involved in dropout, the whole process may be interpreted as a data augmentation method that perturbs the data by creating new data instances from the existing ones on the fly. We evaluated the method on the Aurora-4 corpus just using the clean training data set, and we got relative word error rate reductions between 22% and 25%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems

Data Augmentation for Training of Noise Robust Acoustic Models

Training Data Augmentation and Data Selection

References

Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural network concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of ICASSP, pp. 4277–4280 (2012)
Google Scholar
Alam, M.J., Kenny, P., O’Shaughnessy, D.: Robust feature extraction based on an asymmetric level-dependent auditory filterbank and a subband spectrum enhancement technique. Digital Signal Process. 29, 147–157 (2014)
Article MathSciNet Google Scholar
Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Proc. 2(4), 567–577 (1994)
Article Google Scholar
Baby, D., Gemmeke, J.F., Virtanen, T., Van Hamme, H.: Exemplar-based speech enhancement for deep neural network based automatic speech recognition. In: Proceedings of ICASSP, pp. 4485–4489 (2015)
Google Scholar
Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation. ArXiv e-prints (2015)
Google Scholar
Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Proceedings of Interspeech, pp. 905–909 (2014)
Google Scholar
Chistovich, L., Lublinskaja, V.: The center of gravity effect in vowel spectra and critical distance between the formants. Hear. Res. 1, 185–195 (1979)
Article Google Scholar
Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. In: Proceedings of ICASSP, pp. 5619–5623 (2014)
Google Scholar
Deng, L., Abdel-Hamid, O., Yu, D.: A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In: Proceedings of ICASSP, pp. 6669–6673 (2013)
Google Scholar
Flores, J., Young, S.: Continuous speech recognition in noise using spectral subtraction and HMM adaptation. In: Proceedings of ICASSP, pp. 409–412 (1994)
Google Scholar
Ghitza, O.: Auditory nerve representation criteria for speech analysis/synthesis. IEEE Trans. ASSP 35(6), 736–740 (1987)
Article Google Scholar
Graham, B., Reizenstein, J., Robinson, L.: Efficient batchwise dropout training using submatrices. ArXiv e-prints, February 2015
Google Scholar
Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.M.: Two-stage data augmentation for low-resourced speech recognition. In: Proceedings of Interspeech, pp. 2378–2382 (2016)
Google Scholar
Hillenbrand, J.M., Houde, R.A., Gayvert, R.T.: Speech perception based on spectral peaks versus spectral shape. J. Acoust. Soc. Am. 119(6), 4041–4054 (2006)
Article Google Scholar
Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012)
Google Scholar
Hirsch, H.G., Pearce, D.: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
Google Scholar
Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grezl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., Mallidi, S., Hermansky, H., Tsakalidis, S., Schwartz, R.: Robust speech recognition in unknown reverberant and noisy conditions. In: Proceedings of ASRU, pp. 533–538. IEEE, December 2015
Google Scholar
Huang, J.T., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: Proceedings of ICASSP, pp. 4989–4993 (2015)
Google Scholar
Ikbal, S., Bourlard, H., Magimai-Doss, M.: Peak location estimation for noise robust speech recognition. In: Proceedings of ICASSP, pp. 453–456 (2005)
Google Scholar
Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: ICML (2013)
Google Scholar
Kanda, N., Takeda, R., Obuchi, Y.: Elastic spectral distortion for low resource speech recognition with deep neural networks. In: Proceedings of ASRU, pp. 309–314. IEEE (2013)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings of Interspeech, pp. 3586–3589 (2015)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., Khudanpur, S.: A study on data augmentation of reverberant speech for robust speech recognition. In: Proceedings of ICASSP (2017)
Google Scholar
Kovács, G., Tóth, L.: Joint optimization of spectro-temporal features and deep neural nets for robust automatic speech recognition. Acta Cybernetica 22(1), 117–134 (2015)
Article MathSciNet Google Scholar
Lockwood, P., Boudy, J., Blanchet, M.: Non-linear spectral subtraction (NSS) and hidden Markov models for robust speech recognition in car noise environments. In: Proceedings of ICASSP (1992)
Google Scholar
Miao, Y., Metze, F.: Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training. In: Proceedings of Interspeech, pp. 2237–2241 (2013)
Google Scholar
Moore, B.C.J.: An Introduction to the Psychology of Hearing. Academic Press, London (1997)
Google Scholar
Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU aspire system: robust LVCSR with tdnns, ivector adaptation and RNN-LMS. In: Proceedings of ASRU, pp. 539–546 (2015)
Google Scholar
Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of Interspeech, pp. 810–814. ISCA (2014)
Google Scholar
Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of ICASSP, pp. 8614–8618 (2013)
Google Scholar
Schroeder, M., Atal, B.S., Hall, J.L.: Optimizing digital speech coders by exploiting masking properties of the human ear. JASA 66(6), 1647–1652 (1979)
Article Google Scholar
Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 25 (2015). https://doi.org/10.1186/s13636-015-0068-3
Wan, W., Au, O., Keung, C., Yim, C.: A novel approach of low bit-rate speech coding based on sinusoidal representation and auditory model. In: Proceedings of Eurospeech, pp. 1555–1558 (1999)
Google Scholar

Download references

Acknowledgments

This research was partially supported by the EU-funded Hungarian grant EFOP-3.6.1-16-2016-00008, and by the National Research, Development and Innovation Office of Hungary (FK 124584). László Tóth was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.

Author information

Authors and Affiliations

Department of Informatics, University of Szeged, Szeged, Hungary
László Tóth & György Kovács
MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary
György Kovács
MTA Research Institute for Linguistics, Budapest, Hungary
György Kovács
KU Leuven Department of Electrical Engineering (ESAT), Leuven, Belgium
Dirk Van Compernolle

Authors

László Tóth
View author publications
You can also search for this author in PubMed Google Scholar
György Kovács
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Van Compernolle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to László Tóth .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tóth, L., Kovács, G., Van Compernolle, D. (2018). A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_71

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_71
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models

Abstract

Access this chapter

Similar content being viewed by others

A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems

Data Augmentation for Training of Noise Robust Acoustic Models

Training Data Augmentation and Data Selection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models

Abstract

Access this chapter

Similar content being viewed by others

A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems

Data Augmentation for Training of Noise Robust Acoustic Models

Training Data Augmentation and Data Selection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation