Skip to main content

Training Data Augmentation and Data Selection

  • Chapter
  • First Online:
New Era for Robust Speech Recognition

Abstract

Data augmentation is a simple and efficient technique to improve the robustness of a speech recognizer when deployed in mismatched training-test conditions. Our work, conducted during the JSALT 2015 workshop, aimed at the development of: (1) Data augmentation strategies including noising and reverberation. They were tested in combination with two approaches to signal enhancement: a carefully engineered WPE dereverberation and a learned DNN-based denoising autoencoder. (2) Proposing a novel technique for extracting an informative vector from a Sequence Summarizing Neural Network (SSNN). Similarly to i-vector extractor, the SSNN produces a “summary vector”, representing an acoustic summary of an utterance. Such vector can be used directly for adaptation, but the main usage matching the aim of this chapter is for selection of augmented training data. All techniques were tested on the AMI training set and CHiME3 test set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.iarpa.gov/index.php/research-programs/babel.

  2. 2.

    www.freesound.org.

  3. 3.

    https://github.com/ehabets/RIR-Generator.

  4. 4.

    Brno University of Technology open i-vector extractor (see http://voicebiometry.org) was used for these experiments.

References

  1. Ager, M., Cvetkovic, Z., Sollich, P., Bin, Y.: Towards robust phoneme classification: augmentation of PLP models with acoustic waveforms. In: 16th European Signal Processing Conference, 2008, pp. 1–5 (2008)

    Google Scholar 

  2. Beaufays, F., Vanhoucke, V., Strope, B.: Unsupervised discovery and training of maximally dissimilar cluster models. In: Proceedings of Interspeech (2010)

    Google Scholar 

  3. Bellegarda, J.R., de Souza, P.V., Nadas, A., Nahamoo, D., Picheny, M.A., Bahl, L.R.: The metamorphic algorithm: a speaker mapping approach to data augmentation. IEEE Trans. Speech Audio Process. 2(3), 413–420 (1994). doi: 10.1109/89.294355

    Article  Google Scholar 

  4. Bellegarda, J., de Souza, P., Nahamoo, D., Padmanabhan, M., Picheny, M., Bahl, L.: Experiments using data augmentation for speaker adaptation. In: International Conference on Acoustics, Speech, and Signal Processing, 1995, ICASSP-95, vol. 1, pp. 692–695 (1995). doi: 10.1109/ICASSP.1995.479788

  5. Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1469–1477 (2015). doi: 10.1109/TASLP.2015.2438544

    Article  Google Scholar 

  6. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi: 10.1109/TASL.2010.2064307. http://dx.doi.org/10.1109/TASL.2010.2064307

  7. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of REVERB’14 (2014)

    Google Scholar 

  8. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Araki, S., Hori, T., Nakatani, T.: Strategies for distant speech recognition in reverberant environments. EURASIP J. Adv. Signal Process. 2015, Article ID 60, 15 pp. (2015)

    Google Scholar 

  9. Egorova, E., Veselý, K., Karafiát, M., Janda, M., Černocký, J.: Manual and semi-automatic approaches to building a multilingual phoneme set. In: Proceedings of ICASSP 2013, pp. 7324–7328. IEEE Signal Processing Society, Piscataway (2013). http://www.fit.vutbr.cz/research/view_pub.php?id=10323

  10. Gales, M.J.F., College, C.: Model-Based Techniques for Noise Robust Speech Recognition. University of Cambridge, Cambridge (1995)

    Google Scholar 

  11. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ (1996)

    MATH  Google Scholar 

  12. Hinton, G., Bengio, Y.: Visualizing data using t-SNE. In: Cost-Sensitive Machine Learning for Information Retrieval 33 (2008)

    Google Scholar 

  13. Hu, Y., Loizou, P.C.: Subjective comparison of speech enhancement algorithms. In: Proceedings of IEEE International Conference on Speech and Signal Processing, pp. 153–156 (2006)

    Google Scholar 

  14. Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA (2013)

    Google Scholar 

  15. Kalinli, O., Seltzer, M.L., Acero, A.: Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’09, pp. 3825–3828. IEEE Computer Society, Washington (2009) doi: 10.1109/ICASSP.2009.4960461. http://dx.doi.org/10.1109/ICASSP.2009.4960461

  16. Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černocký, J.: iVector-based discriminative adaptation for automatic speech recognition. In: Proceedings of ASRU 2011, pp. 152–157. IEEE Signal Processing Society, Piscataway (2011). http://www.fit.vutbr.cz/research/view_pub.php?id=9762

  17. Karafiát, M., Veselý, K., Szőke, I., Burget, L., Grézl, F., Hannemann, M., Černocký, J.: BUT ASR system for BABEL surprise evaluation 2014. In: Proceedings of 2014 Spoken Language Technology Workshop, pp. 501–506. IEEE Signal Processing Society, Piscataway (2014). http://www.fit.vutbr.cz/research/view_pub.php?id=10799

  18. Karafiát, M., Grézl, F., Burget, L., Szőke, I., Černocký, J.: Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In: Proceedings of Interspeech 2015, pp. 2454–2458. International Speech Communication Association, Grenoble (2015). http://www.fit.vutbr.cz/research/view_pub.php?id=10972

  19. Kinoshita, K., Delcroix, M., Nakatani, T., Miyoshi, M.: Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. IEEE Trans. Audio Speech Lang. Process. 17(4), 534–545 (2009)

    Article  Google Scholar 

  20. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589. ISCA, Grenoble (2015)

    Google Scholar 

  21. Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation. In: Proceedings of ICASSP’08, pp. 85–88 (2008)

    Google Scholar 

  22. Ogata, K., Tachibana, M., Yamagishi, J., Kobayashi, T.: Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis. In: INTERSPEECH, pp. 1328–1331 (2006)

    Google Scholar 

  23. Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp. 810–814 (2014)

    Google Scholar 

  24. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. IEEE, New York (2013)

    Google Scholar 

  25. Siohan, O., Bacchiani, M.: iVector-based acoustic data selection. In: Proceedings of INTERSPEECH, pp. 657–661 (2013)

    Google Scholar 

  26. Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, New York (2013)

    Google Scholar 

  27. Tokuda, K., Zen, H., Black, A.: An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, pp. 135–153. Prentice Hall, Upper Saddle River (2004)

    Google Scholar 

  28. Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH 2013, pp. 2345–2349. International Speech Communication Association, Grenoble (2013). http://www.fit.vutbr.cz/research/view_pub.php?id=10422

  29. Veselý, K., Watanabe, S., Žmolíková, K., Karafiát, M., Burget, L., Černocký, J.: Sequence summarizing neural network for speaker adaptation. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  30. Wang, Y., Gales, M.J.F.: Speaker and noise factorization for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(7), 2149–2158 (2012). http://dblp.uni-trier.de/db/journals/taslp/taslp20.html#WangG12

    Article  Google Scholar 

  31. Wei, K., Liu, Y., Kirchhoff, K., Bartels, C., Bilmes, J.: Submodular subset selection for large-scale speech training data. In: Proceedings of ICASSP, pp. 3311–3315 (2014)

    Google Scholar 

  32. Wu, Y., Zhang, R., Rudnicky, A.: Data selection for speech recognition. In: Proceedings of ASRU, pp. 562–565 (2007)

    Google Scholar 

  33. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  34. Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., Kitamura, T.: Speaker interpolation in HMM-based speech synthesis system. In: Eurospeech, pp. 2523–2526 (1997)

    Google Scholar 

  35. Yoshioka, T., Nakatani, T.: Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012)

    Article  Google Scholar 

  36. Yoshioka, T., Nakatani, T., Miyoshi, M., Okuno, H.G.: Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)

    Article  Google Scholar 

  37. Yoshioka, T., Chen, X., Gales, M.J.F.: Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In: Proceedings of ICASSP’14 (2014)

    Google Scholar 

  38. Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T.: The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of ASRU’15, pp. 436–443 (2015)

    Google Scholar 

  39. Zavaliagkos, G., Siu, M.-H., Colthurst, T., Billa, J.: Using untranscribed training data to improve performance. In: The 5th International Conference on Spoken Language Processing, Incorporating the 7th Australian International Speech Science and Technology Conference, Sydney, Australia, 30 November–4 December 1998 (1998)

    Google Scholar 

Download references

Acknowledgements

Besides the funding for the JSALT 2015 workshop, BUT researchers were supported by the Czech Ministry of Interior project no. VI20152020025, “DRAPAK,” and by the Czech Ministry of Education, Youth, and Sports from the National Program of Sustainability (NPU II) project “IT4 Innovations Excellence in Science—LQ1602.”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Karafiát .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Karafiát, M. et al. (2017). Training Data Augmentation and Data Selection. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics