Toolkits for Robust Speech Processing

Watanabe, Shinji; Hori, Takaaki; Miao, Yajie; Delcroix, Marc; Metze, Florian; Hershey, John R.

doi:10.1007/978-3-319-64680-0_17

Shinji Watanabe⁵,
Takaaki Hori⁵,
Yajie Miao⁶,
Marc Delcroix⁷,
Florian Metze⁶ &
…
John R. Hershey⁵

2274 Accesses

Abstract

Recent robust automatic speech recognition (ASR) techniques have been developed rapidly due to the demand placed on ASR applications in real environments, with the help of publicly available tools developed in the community. This chapter overviews major toolkits available for robust ASR, covering general ASR toolkits, language model toolkits, speech enhancement/microphone array front-end toolkits, deep learning toolkits, and emergent end-to-end ASR toolkits. The aim of this chapter is to provide information about functionalities (features, functions, platform, and language), license, and source location so that readers can easily access such tools to build their own robust ASR systems. Some of the toolkits have actually been used to build state-of-the-art ASR systems for various challenging tasks. The references in this chapter also includes the URLs of the resource webpages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Automatic Speech Recognition

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Notes

1.
The MathWorks Inc., http://www.mathworks.com/products/matlab/.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467. https://www.tensorflow.org/
Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, T.R., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014-112 (2014). https://github.com/Microsoft/CNTK
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: International Conference on Implementation and Application of Automata, pp. 11–23. Springer, New York (2007). http://www.openfst.org/
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin (2015). arXiv preprint arXiv:1512.02595. https://github.com/baidu-research/warp-ctc
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007). http://www.xavieranguera.com/beamformit/
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2016). https://github.com/rizar/attention-lvcsr
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math compiler in Python. In: Proceedings of the 9th Python in Science Conference, pp. 1–7 (2010). http://deeplearning.net/software/theano/
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://mxnet-mli.readthedocs.io/en/latest/
Chen, X., Liu, X., Qian, Y., Gales, M., Woodland, P.: CUED-RNNLM: an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6000–6004. IEEE, New York (2016). http://mi.eng.cam.ac.uk/projects/cued-rnnlm/
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop, EPFL-CONF-192376 (2011). http://torch.ch/
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP: a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE, New York (2014). http://covarep.github.io/covarep/
ELRA: ELDA Portal. http://www.elra.info/en/
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: An open source toolkit for handling large scale language models. In: Interspeech, pp. 1618–1621 (2008). http://hlt-mt.fbk.eu/technologies/irstlm
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, vol. 14, pp. 1764–1772 (2014)
Google Scholar
Grondin, F., Létourneau, D., Ferland, F., Rousseau, V., Michaud, F.: The ManyEars open framework. Auton. Robot. 34(3), 217–232 (2013). https://sourceforge.net/projects/manyears/
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011). http://kheafield.com/code/kenlm/
Hsu, B.J.P., Glass, J.R.: Iterative language model estimation: efficient data structure & algorithms. In: INTERSPEECH, pp. 841–844 (2008). https://github.com/mitlm/mitlm
Idiap Research Institute: Bob 2.4.0 documentation. https://pythonhosted.org/bob/
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014). http://caffe.berkeleyvision.org/
Kumatani, K., McDonough, J., Schacht, S., Klakow, D., Garner, P.N., Li, W.: Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1609–1612. IEEE (2008). http://distantspeechrecognition.sourceforge.net/
Lee, K.F., Hon, H.W., Reddy, R.: An overview of the SPHINX speech recognition system. IEEE Trans. Acoust. Speech Signal Process. 38(1), 35–45 (1990). http://cmusphinx.sourceforge.net/
Lee, A., Kawahara, T., Shikano, K.: Julius – an open source real-time large vocabulary recognition engine. In: Interspeech, pp. 1691–1694 (2001). http://julius.osdn.jp/en_index.php
Linguistic Data Consortium: https://www.ldc.upenn.edu/
Maas, A.L., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2015). https://github.com/amaas/stanford-ctc
Metze, F., Fosler-Lussier, E.: The speech recognition virtual kitchen: An initial prototype. In: Interspeech, pp. 1872–1873 (2012). http://speechkitchen.org/
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174 (2015). https://github.com/srvk/eesen
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, pp. 1045–1048 (2010). http://www.rnnlm.org/
Nakadai, K., Takahashi, T., Okuno, H.G., Nakajima, H., Hasegawa, Y., Tsujino, H.: Design and implementation of robot audition system “hark” open source software for listening to three simultaneous speakers. Adv. Robot. 24(5–6), 739–761 (2010). http://www.hark.jp/
Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012). http://bass-db.gforge.inria.fr/fasst/
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011). http://kaldi-asr.org/
Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, pp. 2111–2114 (2009). https://www-i6.informatik.rwth-aachen.de/rwth-asr/
Schwenk, H.: CSLM – a modular open-source continuous space language modeling toolkit. In: INTERSPEECH, pp. 1198–1202 (2013). http://www-lium.univ-lemans.fr/cslm/
Stolcke, A., et al.: SRILM – an extensible language modeling toolkit. In: Interspeech, vol. 2002, pp. 901–904 (2002). http://www.speech.sri.com/projects/srilm/
Sundermeyer, M., Schlüter, R., Ney, H.: RWTHLM – the RWTH Aachen University neural network language modeling toolkit. In: INTERSPEECH, pp. 2093–2097 (2014). https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php
Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://chainer.org/
Weninger, F., Bergmann, J., Schuller, B.: Introducing CURRENNT – the Munich open-source CUDA RecurREnt neural network toolkit. J. Mach. Learn. Res. 16(3), 547–551 (2015). https://sourceforge.net/projects/currennt/
Yoshioka, T., Nakatani, T., Miyoshi, M., Okuno, H.G.: Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011). http://www.kecl.ntt.co.jp/icl/signal/wpe/
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK Book, vol. 3, p. 175. Cambridge University Engineering Department (2002). http://htk.eng.cam.ac.uk/

Download references

Author information

Authors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
Shinji Watanabe, Takaaki Hori & John R. Hershey
Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
Yajie Miao & Florian Metze
NTT Corporation, 2-4, Hikaridai, Seika-cho, Kyoto, Japan
Marc Delcroix

Authors

Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Takaaki Hori
View author publications
You can also search for this author in PubMed Google Scholar
Yajie Miao
View author publications
You can also search for this author in PubMed Google Scholar
Marc Delcroix
View author publications
You can also search for this author in PubMed Google Scholar
Florian Metze
View author publications
You can also search for this author in PubMed Google Scholar
John R. Hershey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shinji Watanabe .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Watanabe, S., Hori, T., Miao, Y., Delcroix, M., Metze, F., Hershey, J.R. (2017). Toolkits for Robust Speech Processing. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_17
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Toolkits for Robust Speech Processing

Abstract

Access this chapter

Similar content being viewed by others

Automatic speech recognition: a survey

Automatic Speech Recognition

A comprehensive survey on automatic speech recognition using neural networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Toolkits for Robust Speech Processing

Abstract

Access this chapter

Similar content being viewed by others

Automatic speech recognition: a survey

Automatic Speech Recognition

A comprehensive survey on automatic speech recognition using neural networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation