Skip to main content

The ISL RT-06S Speech-to-Text System

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4299))

Abstract

This paper describes the 2006 lecture and conference meeting speech-to-text system developed at the Interactive Systems Laboratories (ISL), for the individual head-mounted microphone (IHM), single distant microphone (SDM), and multiple distant microphone (MDM) conditions, which was evaluated in the RT-06S Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely improved acoustic and language models, cross adaptation between systems with different front-ends and phoneme sets, and the use of various automatic speech segmentation algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fügen, C., Kolss, M., Bernreuther, D., Paulik, M., Stüker, S., Vogel, S., Waibel, A.: Open Domain Speech Recognition & Translation: Lectures and Speeches. In: ICASSP (2006)

    Google Scholar 

  2. Wölfel, M., McDonough, J.: Combining Multi-Source Far Distance Speech Recognition Strategies: Beamforming, Blind Channel and Confusion Network Combination. In: INTERSPEECH (2005)

    Google Scholar 

  3. Metze, F., Jin, Q., Fügen, C., Laskowski, K., Pan, Y., Schultz, T.: Issues in Meeting Transcription – The ISL Meeting Transcription System. In: ICSLP (2004)

    Google Scholar 

  4. Wölfel, M., McDonough, J.: Minimum Variance Distortionless Response Spectral Estimation Review and Refinements. IEEE Signal Processing Magazine (September 2005)

    Google Scholar 

  5. Stüker, S., Fügen, C., Burger, S., Wölfel, M.: Cross-System Adaptation and Combination for Continuous Speech Recognition: The Influence of Phoneme Set and Acoustic Front-End. In: INTERSPEECH (2006)

    Google Scholar 

  6. Jin, Q., Schultz, T.: Speaker Segmentation and Clustering in Meetings. In: ICSLP (2004)

    Google Scholar 

  7. Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Jin, Q., Kraft, F., Paulik, M., Raab, M.W.M., Tam, Y.-C.: The ISL TC-STAR Spring 2006 ASR Evaluation Systems. In: TC-Star Workshop on Speech-to-Speech Translation (2006)

    Google Scholar 

  8. Makhoul, J.: Linear Prediction: A Tutorial Review. Proc. of the IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  9. Fügen, C., Wölfel, M., McDonough, J.W., Ikbal, S., Kraft, F., Laskowski, K., Ostendorf, M., Stüker, S., Kumatani, K.: Advances in Lecture Recognition: The ISL RT-06S Evaluation System. In: INTERSPEECH (2006)

    Google Scholar 

  10. Pfau, T., Ellis, D.P.W., Stolcke, A.: Multispeaker Speech Activity Detection for the ICSI Meeting Recorder. In: Proc. ASRU (2001)

    Google Scholar 

  11. Wrigley, S.N., Brown, G.J., Wan, V., Renals, S.: Speech and Crosstalk Detection in Multichannel Audio. IEEE Trans. on Speech and Audio Processing 13, 84–91 (2005)

    Article  Google Scholar 

  12. Laskowski, K., Schultz, T.: Unsupervised Learning of Overlapped Speech Model Parameters for Multichannel Speech Activity Detection in Meetings. In: Proc. ICASSP (2006)

    Google Scholar 

  13. Çetin, Ö., Shriberg, E.: Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap. In: Proc. ICASSP (2006)

    Google Scholar 

  14. Soltau, H., Metze, F., Fügen, C., Waibel, A.: A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. In: ASRU (2001)

    Google Scholar 

  15. Gales, M.J.F.: Semi-tied covariance matrices. In: ICASSP (1998)

    Google Scholar 

  16. McDonough, J., Schaaf, T., Waibel, A.: On Maximum Mutual Information Speaker-Adapted Training. In: ICASSP (2002)

    Google Scholar 

  17. Fisher, W.M.: A Statistical Text-to-Phone Function Using Ngrams and Rules. In: ICASSP (1999)

    Google Scholar 

  18. Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: ICSLP (2002)

    Google Scholar 

  19. Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Computer Science Group, Harvard University, Tech. Rep. TR-10-98 (1998)

    Google Scholar 

  20. Bulyko, I., Ostendorf, M., Stolcke, A.: Getting more Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures. In: Proc. HLT-NAACL (2003)

    Google Scholar 

  21. Çetin, Ö., Stolcke, A.: Language Modeling in the ICSI-SRI Spring 2005 Meeting Speech Recognition Evaluation System. International Computer Science Institute, Berkeley, CA, USA, Tech. Rep. TR-05-006 (2005)

    Google Scholar 

  22. Venkataraman, A., Wang, W.: Techniques for Effective Vocabulary Selection. In: Proc. Eurospeech (2003)

    Google Scholar 

  23. Black, A.W., Taylor, P.A.: The Festival Speech Synthesis System: System documentation. Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kongdom, Tech. Rep. HCRC/TR-83 (1997)

    Google Scholar 

  24. Zhan, P., Westphal, M.: Speaker Normalization Based on Frequency Warping. In: ICASSP (1997)

    Google Scholar 

  25. Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-based Speech Recognition. Cambridge University, Cambridge, United Kingdom, Tech. Rep. (1997)

    Google Scholar 

  26. Leggetter, C.J., Woodland, P.C.: Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language 9, 171–185 (1995)

    Article  Google Scholar 

  27. Yu, H., Tam, Y.-C., Schaaf, T., Stüker, S., Jin, Q., Noamany, M., Schultz, T.: The ISL RT04 Mandarin Broadcast News Evaluation System. In: EARS Rich Transcription Workshop (2004)

    Google Scholar 

  28. Lamel, L., Gauvain, J.-L.: Alternate Phone Models for Conversational Speech. In: ICASSP (2005)

    Google Scholar 

  29. Mangu, L., Brill, E., Stolcke, A.: Finding Consensus among Words: Lattice-based Word Error Minimization. In: EUROSPEECH (1999)

    Google Scholar 

  30. Wölfel, M., Fügen, C., Ikbal, S., McDonough, J.W.: Multi-Source Far-Distance Microphone Selection and Combination for Automatic Transcription of Lectures. In: INTERSPEECH (2006)

    Google Scholar 

  31. CHIL – Computers in the Human Interaction Loop, http://chil.server.de

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fügen, C. et al. (2006). The ISL RT-06S Speech-to-Text System. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_36

Download citation

  • DOI: https://doi.org/10.1007/11965152_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69267-6

  • Online ISBN: 978-3-540-69268-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics