Quaero 2010 Speech-to-Text Evaluation Systems

Abstract

Quaero is a French program with German participation, within which KIT is also working on the problem of Automatic Speech Recognition for audio data from various sources from the World Wide Web. In this paper we describe the development of our English and German speech recognition systems for the 2010 Quaero evaluation for which, at least in part, we have utilized the XC4000 HPC cluster at KIT. Both recognition systems were trained with the help of the Janus Recognition Toolkit developed at the Interactive Systems Laboratory, and both are expansions of the 2009 evaluation systems. Both systems use various front-ends, state-of-the art acoustic models that include discriminative training, and very large language models which require the use of shared memory. Both systems also make use of domain specific acoustic and language model training material which became available for the 2010 evaluation. In total the expansion of the system and the addition of domain-dependent training material let to significant improved performance over the 2009 systems.

Keywords

Linear Discriminant Analysis Automatic Speech Recognition Acoustic Model Word Error Rate Automatic Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A.W. Black and P.A. Taylor. The festival speech synthesis system: System documentation. Technical report, Human Communication Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kingdom, 1997. Google Scholar
  2. 2.
    W.M. Fisher. A statistical text-to-phone function using ngrams and rules. In Proceedings the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, December 1999. IEEE. Google Scholar
  3. 3.
    M.J.F. Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Technical report, Cambridge University, Engineering Department, May 1997. Google Scholar
  4. 4.
    M.J.F. Gales. Semi-tied covariance matrices. 1998. Google Scholar
  5. 5.
    M.J.F. Gales. Semi-tied covariance matrices for hidden Markov models. Technical report, Cambridge University, Engineering Department, February 1998. Google Scholar
  6. 6.
    C. Gollan, M. Bisani, S. Kanthak, R. Schlüter, and H. Ney. Cross domain automatic transcription on the tc-star epps corpus. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, PA, USA, March 2005. Google Scholar
  7. 7.
    Q. Jin and T. Schultz. Speaker segmentation and clustering in meetings. In Proceedings of the 8th International Conference on Spoken Language Processing (Interspeech 2004 – ICSLP), Jeju Island, Korea, October 2004. ISCA. Google Scholar
  8. 8.
    E. Leeuwis, M. Federico, and M. Cettolo. Language modeling and transcription of the TED corpus lectures. In International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, March 2003. Google Scholar
  9. 9.
    C.J. Leggetter and P.C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9:171–185, 1995. CrossRefGoogle Scholar
  10. 10.
    D. Povey and P.C. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, May 2001. Google Scholar
  11. 11.
    S. Stüker, K. Kilgour, and J. Niehues. Quaero speech-to-text and text translation evaluation systems. In High Performance Computing in Science and Engineering ’10 – Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2010. Springer, Heidelberg, 2010. Google Scholar
  12. 12.
    H. Soltau, F. Metze, C. Fügen, and A. Waibel. A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. Trento, Italy, 2001. Google Scholar
  13. 13.
    H. Soltau, F. Metze, C. Fügen, and A. Waibel. A one pass-decoder based on polymorphic linguistic context assignment. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’01), pages 214–217, Madonna di Campiglio Trento, Italy, December 2001. Google Scholar
  14. 14.
    A. Stolcke. SRILM – An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904, Denver, CO, USA, 2002. ISCA. Google Scholar
  15. 15.
    A. Venkataraman and W. Wang. Techniques for effective vocabulary selection. Arxiv preprint cs/0306022, 2003.
  16. 16.
    M.C. Wölfel and J.W. McDonough. Minimum variance distortionless response spectralestimation, review and refinements. IEEE Signal Processing Magazine, 22(5):117–126, September 2005. CrossRefGoogle Scholar
  17. 17.
    P. Zhan and M. Westphal. Speaker normalization based on frequency warping. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, April 1997. Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Sebastian Stüker
    • 1
  • Kevin Kilgour
    • 1
  • Florian Kraft
    • 1
  1. 1.Research Group 3-01 ‘Multilingual Speech Recognition’Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations