Quaero 2010 Speech-to-Text Evaluation Systems
Quaero is a French program with German participation, within which KIT is also working on the problem of Automatic Speech Recognition for audio data from various sources from the World Wide Web. In this paper we describe the development of our English and German speech recognition systems for the 2010 Quaero evaluation for which, at least in part, we have utilized the XC4000 HPC cluster at KIT. Both recognition systems were trained with the help of the Janus Recognition Toolkit developed at the Interactive Systems Laboratory, and both are expansions of the 2009 evaluation systems. Both systems use various front-ends, state-of-the art acoustic models that include discriminative training, and very large language models which require the use of shared memory. Both systems also make use of domain specific acoustic and language model training material which became available for the 2010 evaluation. In total the expansion of the system and the addition of domain-dependent training material let to significant improved performance over the 2009 systems.
KeywordsLinear Discriminant Analysis Automatic Speech Recognition Acoustic Model Word Error Rate Automatic Speech Recognition System
Unable to display preview. Download preview PDF.
- 1.A.W. Black and P.A. Taylor. The festival speech synthesis system: System documentation. Technical report, Human Communication Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kingdom, 1997. Google Scholar
- 2.W.M. Fisher. A statistical text-to-phone function using ngrams and rules. In Proceedings the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, December 1999. IEEE. Google Scholar
- 3.M.J.F. Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Technical report, Cambridge University, Engineering Department, May 1997. Google Scholar
- 4.M.J.F. Gales. Semi-tied covariance matrices. 1998. Google Scholar
- 5.M.J.F. Gales. Semi-tied covariance matrices for hidden Markov models. Technical report, Cambridge University, Engineering Department, February 1998. Google Scholar
- 6.C. Gollan, M. Bisani, S. Kanthak, R. Schlüter, and H. Ney. Cross domain automatic transcription on the tc-star epps corpus. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, PA, USA, March 2005. Google Scholar
- 7.Q. Jin and T. Schultz. Speaker segmentation and clustering in meetings. In Proceedings of the 8th International Conference on Spoken Language Processing (Interspeech 2004 – ICSLP), Jeju Island, Korea, October 2004. ISCA. Google Scholar
- 8.E. Leeuwis, M. Federico, and M. Cettolo. Language modeling and transcription of the TED corpus lectures. In International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, March 2003. Google Scholar
- 10.D. Povey and P.C. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, May 2001. Google Scholar
- 11.S. Stüker, K. Kilgour, and J. Niehues. Quaero speech-to-text and text translation evaluation systems. In High Performance Computing in Science and Engineering ’10 – Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2010. Springer, Heidelberg, 2010. Google Scholar
- 12.H. Soltau, F. Metze, C. Fügen, and A. Waibel. A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. Trento, Italy, 2001. Google Scholar
- 13.H. Soltau, F. Metze, C. Fügen, and A. Waibel. A one pass-decoder based on polymorphic linguistic context assignment. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ’01), pages 214–217, Madonna di Campiglio Trento, Italy, December 2001. Google Scholar
- 14.A. Stolcke. SRILM – An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904, Denver, CO, USA, 2002. ISCA. Google Scholar
- 15.A. Venkataraman and W. Wang. Techniques for effective vocabulary selection. Arxiv preprint cs/0306022, 2003.
- 17.P. Zhan and M. Westphal. Speaker normalization based on frequency warping. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, April 1997. Google Scholar