Abstract
Recent robust automatic speech recognition (ASR) techniques have been developed rapidly due to the demand placed on ASR applications in real environments, with the help of publicly available tools developed in the community. This chapter overviews major toolkits available for robust ASR, covering general ASR toolkits, language model toolkits, speech enhancement/microphone array front-end toolkits, deep learning toolkits, and emergent end-to-end ASR toolkits. The aim of this chapter is to provide information about functionalities (features, functions, platform, and language), license, and source location so that readers can easily access such tools to build their own robust ASR systems. Some of the toolkits have actually been used to build state-of-the-art ASR systems for various challenging tasks. The references in this chapter also includes the URLs of the resource webpages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The MathWorks Inc., http://www.mathworks.com/products/matlab/.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467. https://www.tensorflow.org/
Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, T.R., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014-112 (2014). https://github.com/Microsoft/CNTK
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: International Conference on Implementation and Application of Automata, pp. 11–23. Springer, New York (2007). http://www.openfst.org/
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin (2015). arXiv preprint arXiv:1512.02595. https://github.com/baidu-research/warp-ctc
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007). http://www.xavieranguera.com/beamformit/
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2016). https://github.com/rizar/attention-lvcsr
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math compiler in Python. In: Proceedings of the 9th Python in Science Conference, pp. 1–7 (2010). http://deeplearning.net/software/theano/
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://mxnet-mli.readthedocs.io/en/latest/
Chen, X., Liu, X., Qian, Y., Gales, M., Woodland, P.: CUED-RNNLM: an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6000–6004. IEEE, New York (2016). http://mi.eng.cam.ac.uk/projects/cued-rnnlm/
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop, EPFL-CONF-192376 (2011). http://torch.ch/
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP: a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE, New York (2014). http://covarep.github.io/covarep/
ELRA: ELDA Portal. http://www.elra.info/en/
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: An open source toolkit for handling large scale language models. In: Interspeech, pp. 1618–1621 (2008). http://hlt-mt.fbk.eu/technologies/irstlm
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, vol. 14, pp. 1764–1772 (2014)
Grondin, F., Létourneau, D., Ferland, F., Rousseau, V., Michaud, F.: The ManyEars open framework. Auton. Robot. 34(3), 217–232 (2013). https://sourceforge.net/projects/manyears/
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011). http://kheafield.com/code/kenlm/
Hsu, B.J.P., Glass, J.R.: Iterative language model estimation: efficient data structure & algorithms. In: INTERSPEECH, pp. 841–844 (2008). https://github.com/mitlm/mitlm
Idiap Research Institute: Bob 2.4.0 documentation. https://pythonhosted.org/bob/
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014). http://caffe.berkeleyvision.org/
Kumatani, K., McDonough, J., Schacht, S., Klakow, D., Garner, P.N., Li, W.: Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1609–1612. IEEE (2008). http://distantspeechrecognition.sourceforge.net/
Lee, K.F., Hon, H.W., Reddy, R.: An overview of the SPHINX speech recognition system. IEEE Trans. Acoust. Speech Signal Process. 38(1), 35–45 (1990). http://cmusphinx.sourceforge.net/
Lee, A., Kawahara, T., Shikano, K.: Julius – an open source real-time large vocabulary recognition engine. In: Interspeech, pp. 1691–1694 (2001). http://julius.osdn.jp/en_index.php
Linguistic Data Consortium: https://www.ldc.upenn.edu/
Maas, A.L., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2015). https://github.com/amaas/stanford-ctc
Metze, F., Fosler-Lussier, E.: The speech recognition virtual kitchen: An initial prototype. In: Interspeech, pp. 1872–1873 (2012). http://speechkitchen.org/
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174 (2015). https://github.com/srvk/eesen
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, pp. 1045–1048 (2010). http://www.rnnlm.org/
Nakadai, K., Takahashi, T., Okuno, H.G., Nakajima, H., Hasegawa, Y., Tsujino, H.: Design and implementation of robot audition system “hark” open source software for listening to three simultaneous speakers. Adv. Robot. 24(5–6), 739–761 (2010). http://www.hark.jp/
Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012). http://bass-db.gforge.inria.fr/fasst/
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011). http://kaldi-asr.org/
Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, pp. 2111–2114 (2009). https://www-i6.informatik.rwth-aachen.de/rwth-asr/
Schwenk, H.: CSLM – a modular open-source continuous space language modeling toolkit. In: INTERSPEECH, pp. 1198–1202 (2013). http://www-lium.univ-lemans.fr/cslm/
Stolcke, A., et al.: SRILM – an extensible language modeling toolkit. In: Interspeech, vol. 2002, pp. 901–904 (2002). http://www.speech.sri.com/projects/srilm/
Sundermeyer, M., Schlüter, R., Ney, H.: RWTHLM – the RWTH Aachen University neural network language modeling toolkit. In: INTERSPEECH, pp. 2093–2097 (2014). https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php
Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://chainer.org/
Weninger, F., Bergmann, J., Schuller, B.: Introducing CURRENNT – the Munich open-source CUDA RecurREnt neural network toolkit. J. Mach. Learn. Res. 16(3), 547–551 (2015). https://sourceforge.net/projects/currennt/
Yoshioka, T., Nakatani, T., Miyoshi, M., Okuno, H.G.: Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011). http://www.kecl.ntt.co.jp/icl/signal/wpe/
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK Book, vol. 3, p. 175. Cambridge University Engineering Department (2002). http://htk.eng.cam.ac.uk/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Watanabe, S., Hori, T., Miao, Y., Delcroix, M., Metze, F., Hershey, J.R. (2017). Toolkits for Robust Speech Processing. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)