Skip to main content

Toolkits for Robust Speech Processing

  • 2240 Accesses

Abstract

Recent robust automatic speech recognition (ASR) techniques have been developed rapidly due to the demand placed on ASR applications in real environments, with the help of publicly available tools developed in the community. This chapter overviews major toolkits available for robust ASR, covering general ASR toolkits, language model toolkits, speech enhancement/microphone array front-end toolkits, deep learning toolkits, and emergent end-to-end ASR toolkits. The aim of this chapter is to provide information about functionalities (features, functions, platform, and language), license, and source location so that readers can easily access such tools to build their own robust ASR systems. Some of the toolkits have actually been used to build state-of-the-art ASR systems for various challenging tasks. The references in this chapter also includes the URLs of the resource webpages.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The MathWorks Inc., http://www.mathworks.com/products/matlab/.

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467. https://www.tensorflow.org/

  2. Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, T.R., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014-112 (2014). https://github.com/Microsoft/CNTK

  3. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: International Conference on Implementation and Application of Automata, pp. 11–23. Springer, New York (2007). http://www.openfst.org/

  4. Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin (2015). arXiv preprint arXiv:1512.02595. https://github.com/baidu-research/warp-ctc

  5. Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007). http://www.xavieranguera.com/beamformit/

  6. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2016). https://github.com/rizar/attention-lvcsr

  7. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math compiler in Python. In: Proceedings of the 9th Python in Science Conference, pp. 1–7 (2010). http://deeplearning.net/software/theano/

  8. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://mxnet-mli.readthedocs.io/en/latest/

  9. Chen, X., Liu, X., Qian, Y., Gales, M., Woodland, P.: CUED-RNNLM: an open-source toolkit for efficient training and evaluation of recurrent neural network language models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6000–6004. IEEE, New York (2016). http://mi.eng.cam.ac.uk/projects/cued-rnnlm/

  10. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop, EPFL-CONF-192376 (2011). http://torch.ch/

  11. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP: a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE, New York (2014). http://covarep.github.io/covarep/

  12. ELRA: ELDA Portal. http://www.elra.info/en/

  13. Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: An open source toolkit for handling large scale language models. In: Interspeech, pp. 1618–1621 (2008). http://hlt-mt.fbk.eu/technologies/irstlm

  14. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, vol. 14, pp. 1764–1772 (2014)

    Google Scholar 

  15. Grondin, F., Létourneau, D., Ferland, F., Rousseau, V., Michaud, F.: The ManyEars open framework. Auton. Robot. 34(3), 217–232 (2013). https://sourceforge.net/projects/manyears/

  16. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011). http://kheafield.com/code/kenlm/

  17. Hsu, B.J.P., Glass, J.R.: Iterative language model estimation: efficient data structure & algorithms. In: INTERSPEECH, pp. 841–844 (2008). https://github.com/mitlm/mitlm

  18. Idiap Research Institute: Bob 2.4.0 documentation. https://pythonhosted.org/bob/

  19. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014). http://caffe.berkeleyvision.org/

  20. Kumatani, K., McDonough, J., Schacht, S., Klakow, D., Garner, P.N., Li, W.: Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1609–1612. IEEE (2008). http://distantspeechrecognition.sourceforge.net/

  21. Lee, K.F., Hon, H.W., Reddy, R.: An overview of the SPHINX speech recognition system. IEEE Trans. Acoust. Speech Signal Process. 38(1), 35–45 (1990). http://cmusphinx.sourceforge.net/

  22. Lee, A., Kawahara, T., Shikano, K.: Julius – an open source real-time large vocabulary recognition engine. In: Interspeech, pp. 1691–1694 (2001). http://julius.osdn.jp/en_index.php

  23. Linguistic Data Consortium: https://www.ldc.upenn.edu/

  24. Maas, A.L., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2015). https://github.com/amaas/stanford-ctc

  25. Metze, F., Fosler-Lussier, E.: The speech recognition virtual kitchen: An initial prototype. In: Interspeech, pp. 1872–1873 (2012). http://speechkitchen.org/

  26. Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174 (2015). https://github.com/srvk/eesen

  27. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, pp. 1045–1048 (2010). http://www.rnnlm.org/

  28. Nakadai, K., Takahashi, T., Okuno, H.G., Nakajima, H., Hasegawa, Y., Tsujino, H.: Design and implementation of robot audition system “hark” open source software for listening to three simultaneous speakers. Adv. Robot. 24(5–6), 739–761 (2010). http://www.hark.jp/

  29. Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012). http://bass-db.gforge.inria.fr/fasst/

  30. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011). http://kaldi-asr.org/

  31. Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, pp. 2111–2114 (2009). https://www-i6.informatik.rwth-aachen.de/rwth-asr/

  32. Schwenk, H.: CSLM – a modular open-source continuous space language modeling toolkit. In: INTERSPEECH, pp. 1198–1202 (2013). http://www-lium.univ-lemans.fr/cslm/

  33. Stolcke, A., et al.: SRILM – an extensible language modeling toolkit. In: Interspeech, vol. 2002, pp. 901–904 (2002). http://www.speech.sri.com/projects/srilm/

  34. Sundermeyer, M., Schlüter, R., Ney, H.: RWTHLM – the RWTH Aachen University neural network language modeling toolkit. In: INTERSPEECH, pp. 2093–2097 (2014). https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php

  35. Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in 29th Annual Conference on Neural Information Processing Systems (NIPS) (2015). http://chainer.org/

  36. Weninger, F., Bergmann, J., Schuller, B.: Introducing CURRENNT – the Munich open-source CUDA RecurREnt neural network toolkit. J. Mach. Learn. Res. 16(3), 547–551 (2015). https://sourceforge.net/projects/currennt/

  37. Yoshioka, T., Nakatani, T., Miyoshi, M., Okuno, H.G.: Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio Speech Lang. Process. 19(1), 69–84 (2011). http://www.kecl.ntt.co.jp/icl/signal/wpe/

  38. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK Book, vol. 3, p. 175. Cambridge University Engineering Department (2002). http://htk.eng.cam.ac.uk/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shinji Watanabe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Watanabe, S., Hori, T., Miao, Y., Delcroix, M., Metze, F., Hershey, J.R. (2017). Toolkits for Robust Speech Processing. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics