Abstract
Quaero is a French program with German participation, within which KIT is also working on the problem of Automatic Speech Recognition for audio data from various sources from the World Wide Web. In this paper we describe the development of our English and German speech recognition systems for the 2010 Quaero evaluation for which, at least in part, we have utilized the XC4000 HPC cluster at KIT. Both recognition systems were trained with the help of the Janus Recognition Toolkit developed at the Interactive Systems Laboratory, and both are expansions of the 2009 evaluation systems. Both systems use various front-ends, state-of-the art acoustic models that include discriminative training, and very large language models which require the use of shared memory. Both systems also make use of domain specific acoustic and language model training material which became available for the 2010 evaluation. In total the expansion of the system and the addition of domain-dependent training material let to significant improved performance over the 2009 systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
A.W. Black and P.A. Taylor. The festival speech synthesis system: System documentation. Technical report, Human Communication Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kingdom, 1997.
W.M. Fisher. A statistical text-to-phone function using ngrams and rules. In Proceedings the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, December 1999. IEEE.
M.J.F. Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Technical report, Cambridge University, Engineering Department, May 1997.
M.J.F. Gales. Semi-tied covariance matrices. 1998.
M.J.F. Gales. Semi-tied covariance matrices for hidden Markov models. Technical report, Cambridge University, Engineering Department, February 1998.
C. Gollan, M. Bisani, S. Kanthak, R. Schlüter, and H. Ney. Cross domain automatic transcription on the tc-star epps corpus. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, PA, USA, March 2005.
Q. Jin and T. Schultz. Speaker segmentation and clustering in meetings. In Proceedings of the 8th International Conference on Spoken Language Processing (Interspeech 2004 – ICSLP), Jeju Island, Korea, October 2004. ISCA.
E. Leeuwis, M. Federico, and M. Cettolo. Language modeling and transcription of the TED corpus lectures. In International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, March 2003.
C.J. Leggetter and P.C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9:171–185, 1995.
D. Povey and P.C. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, May 2001.
S. Stüker, K. Kilgour, and J. Niehues. Quaero speech-to-text and text translation evaluation systems. In High Performance Computing in Science and Engineering ’10 – Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2010. Springer, Heidelberg, 2010.
H. Soltau, F. Metze, C. Fügen, and A. Waibel. A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. Trento, Italy, 2001.
H. Soltau, F. Metze, C. Fügen, and A. Waibel. A one pass-decoder based on polymorphic linguistic context assignment. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’01), pages 214–217, Madonna di Campiglio Trento, Italy, December 2001.
A. Stolcke. SRILM – An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904, Denver, CO, USA, 2002. ISCA.
A. Venkataraman and W. Wang. Techniques for effective vocabulary selection. Arxiv preprint cs/0306022, 2003.
M.C. Wölfel and J.W. McDonough. Minimum variance distortionless response spectralestimation, review and refinements. IEEE Signal Processing Magazine, 22(5):117–126, September 2005.
P. Zhan and M. Westphal. Speaker normalization based on frequency warping. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, April 1997.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stüker, S., Kilgour, K., Kraft, F. (2012). Quaero 2010 Speech-to-Text Evaluation Systems. In: Nagel, W., Kröner, D., Resch, M. (eds) High Performance Computing in Science and Engineering '11. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23869-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-642-23869-7_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23868-0
Online ISBN: 978-3-642-23869-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)