Training Data Augmentation and Data Selection

Karafiát, Martin; Veselý, Karel; Žmolíková, Kateřina; Delcroix, Marc; Watanabe, Shinji; Burget, Lukáš; “Honza”Černocký, Jan; Szőke, Igor

doi:10.1007/978-3-319-64680-0_10

Martin Karafiát⁵,
Karel Veselý⁵,
Kateřina Žmolíková⁵,
Marc Delcroix⁶,
Shinji Watanabe⁷,
Lukáš Burget⁵,
Jan “Honza”Černocký⁵ &
…
Igor Szőke⁵

2278 Accesses
3 Citations

Abstract

Data augmentation is a simple and efficient technique to improve the robustness of a speech recognizer when deployed in mismatched training-test conditions. Our work, conducted during the JSALT 2015 workshop, aimed at the development of: (1) Data augmentation strategies including noising and reverberation. They were tested in combination with two approaches to signal enhancement: a carefully engineered WPE dereverberation and a learned DNN-based denoising autoencoder. (2) Proposing a novel technique for extracting an informative vector from a Sequence Summarizing Neural Network (SSNN). Similarly to i-vector extractor, the SSNN produces a “summary vector”, representing an acoustic summary of an utterance. Such vector can be used directly for adaptation, but the main usage matching the aim of this chapter is for selection of augmented training data. All techniques were tested on the AMI training set and CHiME3 test set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.iarpa.gov/index.php/research-programs/babel.
2.
www.freesound.org.
3.
https://github.com/ehabets/RIR-Generator.
4.
Brno University of Technology open i-vector extractor (see http://voicebiometry.org) was used for these experiments.

References

Ager, M., Cvetkovic, Z., Sollich, P., Bin, Y.: Towards robust phoneme classification: augmentation of PLP models with acoustic waveforms. In: 16th European Signal Processing Conference, 2008, pp. 1–5 (2008)
Google Scholar
Beaufays, F., Vanhoucke, V., Strope, B.: Unsupervised discovery and training of maximally dissimilar cluster models. In: Proceedings of Interspeech (2010)
Google Scholar
Bellegarda, J.R., de Souza, P.V., Nadas, A., Nahamoo, D., Picheny, M.A., Bahl, L.R.: The metamorphic algorithm: a speaker mapping approach to data augmentation. IEEE Trans. Speech Audio Process. 2(3), 413–420 (1994). doi: 10.1109/89.294355
Article Google Scholar
Bellegarda, J., de Souza, P., Nahamoo, D., Padmanabhan, M., Picheny, M., Bahl, L.: Experiments using data augmentation for speaker adaptation. In: International Conference on Acoustics, Speech, and Signal Processing, 1995, ICASSP-95, vol. 1, pp. 692–695 (1995). doi: 10.1109/ICASSP.1995.479788
Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1469–1477 (2015). doi: 10.1109/TASLP.2015.2438544
Article Google Scholar
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi: 10.1109/TASL.2010.2064307. http://dx.doi.org/10.1109/TASL.2010.2064307
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of REVERB’14 (2014)
Google Scholar
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Araki, S., Hori, T., Nakatani, T.: Strategies for distant speech recognition in reverberant environments. EURASIP J. Adv. Signal Process. 2015, Article ID 60, 15 pp. (2015)
Google Scholar
Egorova, E., Veselý, K., Karafiát, M., Janda, M., Černocký, J.: Manual and semi-automatic approaches to building a multilingual phoneme set. In: Proceedings of ICASSP 2013, pp. 7324–7328. IEEE Signal Processing Society, Piscataway (2013). http://www.fit.vutbr.cz/research/view_pub.php?id=10323
Gales, M.J.F., College, C.: Model-Based Techniques for Noise Robust Speech Recognition. University of Cambridge, Cambridge (1995)
Google Scholar
Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ (1996)
MATH Google Scholar
Hinton, G., Bengio, Y.: Visualizing data using t-SNE. In: Cost-Sensitive Machine Learning for Information Retrieval 33 (2008)
Google Scholar
Hu, Y., Loizou, P.C.: Subjective comparison of speech enhancement algorithms. In: Proceedings of IEEE International Conference on Speech and Signal Processing, pp. 153–156 (2006)
Google Scholar
Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA (2013)
Google Scholar
Kalinli, O., Seltzer, M.L., Acero, A.: Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’09, pp. 3825–3828. IEEE Computer Society, Washington (2009) doi: 10.1109/ICASSP.2009.4960461. http://dx.doi.org/10.1109/ICASSP.2009.4960461
Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černocký, J.: iVector-based discriminative adaptation for automatic speech recognition. In: Proceedings of ASRU 2011, pp. 152–157. IEEE Signal Processing Society, Piscataway (2011). http://www.fit.vutbr.cz/research/view_pub.php?id=9762
Karafiát, M., Veselý, K., Szőke, I., Burget, L., Grézl, F., Hannemann, M., Černocký, J.: BUT ASR system for BABEL surprise evaluation 2014. In: Proceedings of 2014 Spoken Language Technology Workshop, pp. 501–506. IEEE Signal Processing Society, Piscataway (2014). http://www.fit.vutbr.cz/research/view_pub.php?id=10799
Karafiát, M., Grézl, F., Burget, L., Szőke, I., Černocký, J.: Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In: Proceedings of Interspeech 2015, pp. 2454–2458. International Speech Communication Association, Grenoble (2015). http://www.fit.vutbr.cz/research/view_pub.php?id=10972
Kinoshita, K., Delcroix, M., Nakatani, T., Miyoshi, M.: Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. IEEE Trans. Audio Speech Lang. Process. 17(4), 534–545 (2009)
Article Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589. ISCA, Grenoble (2015)
Google Scholar
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation. In: Proceedings of ICASSP’08, pp. 85–88 (2008)
Google Scholar
Ogata, K., Tachibana, M., Yamagishi, J., Kobayashi, T.: Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis. In: INTERSPEECH, pp. 1328–1331 (2006)
Google Scholar
Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp. 810–814 (2014)
Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. IEEE, New York (2013)
Google Scholar
Siohan, O., Bacchiani, M.: iVector-based acoustic data selection. In: Proceedings of INTERSPEECH, pp. 657–661 (2013)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, New York (2013)
Google Scholar
Tokuda, K., Zen, H., Black, A.: An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: New Paradigms and Advances, pp. 135–153. Prentice Hall, Upper Saddle River (2004)
Google Scholar
Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of INTERSPEECH 2013, pp. 2345–2349. International Speech Communication Association, Grenoble (2013). http://www.fit.vutbr.cz/research/view_pub.php?id=10422
Veselý, K., Watanabe, S., Žmolíková, K., Karafiát, M., Burget, L., Černocký, J.: Sequence summarizing neural network for speaker adaptation. In: Proceedings of ICASSP (2016)
Book Google Scholar
Wang, Y., Gales, M.J.F.: Speaker and noise factorization for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(7), 2149–2158 (2012). http://dblp.uni-trier.de/db/journals/taslp/taslp20.html#WangG12
Article Google Scholar
Wei, K., Liu, Y., Kirchhoff, K., Bartels, C., Bilmes, J.: Submodular subset selection for large-scale speech training data. In: Proceedings of ICASSP, pp. 3311–3315 (2014)
Google Scholar
Wu, Y., Zhang, R., Rudnicky, A.: Data selection for speech recognition. In: Proceedings of ASRU, pp. 562–565 (2007)
Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett. 21(1), 65–68 (2014)
Article Google Scholar
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., Kitamura, T.: Speaker interpolation in HMM-based speech synthesis system. In: Eurospeech, pp. 2523–2526 (1997)
Google Scholar
Yoshioka, T., Nakatani, T.: Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012)
Article Google Scholar
Yoshioka, T., Nakatani, T., Miyoshi, M., Okuno, H.G.: Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011)
Article Google Scholar
Yoshioka, T., Chen, X., Gales, M.J.F.: Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In: Proceedings of ICASSP’14 (2014)
Google Scholar
Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T.: The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of ASRU’15, pp. 436–443 (2015)
Google Scholar
Zavaliagkos, G., Siu, M.-H., Colthurst, T., Billa, J.: Using untranscribed training data to improve performance. In: The 5th International Conference on Spoken Language Processing, Incorporating the 7th Australian International Speech Science and Technology Conference, Sydney, Australia, 30 November–4 December 1998 (1998)
Google Scholar

Download references

Acknowledgements

Besides the funding for the JSALT 2015 workshop, BUT researchers were supported by the Czech Ministry of Interior project no. VI20152020025, “DRAPAK,” and by the Czech Ministry of Education, Youth, and Sports from the National Program of Sustainability (NPU II) project “IT4 Innovations Excellence in Science—LQ1602.”

Author information

Authors and Affiliations

Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Brno, Czech Republic
Martin Karafiát, Karel Veselý, Kateřina Žmolíková, Lukáš Burget, Jan “Honza”Černocký & Igor Szőke
NTT Corporation, 2-4, Hikaridai, Seika-cho, Kyoto, Japan
Marc Delcroix
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
Shinji Watanabe

Authors

Martin Karafiát
View author publications
You can also search for this author in PubMed Google Scholar
Karel Veselý
View author publications
You can also search for this author in PubMed Google Scholar
Kateřina Žmolíková
View author publications
You can also search for this author in PubMed Google Scholar
Marc Delcroix
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Lukáš Burget
View author publications
You can also search for this author in PubMed Google Scholar
Jan “Honza”Černocký
View author publications
You can also search for this author in PubMed Google Scholar
Igor Szőke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Karafiát .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Karafiát, M. et al. (2017). Training Data Augmentation and Data Selection. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_10
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics