Advertisement

Speech Recognition

  • Gernot A. Fink
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

Today, a number of commercial speech recognition systems are available on the market and, just recently, speech enabled assistive services were introduced for smart phones. Nevertheless, the problem of automatic speech recognition should by no means be considered to be solved.

In order to build a competitive speech recognition system, the integration of a multitude of techniques is required. The probably best documented research systems are the ones developed by Hermann Ney and colleagues at the former Philips Research Lab in Aachen, Germany, and later at RWTH Aachen University, Aachen, Germany. In this chapter we want to put the emphasis on the works at RWTH Aachen University. However, many aspects of the systems within the research tradition are identical with those developed by Philips. Afterwards we want to present the speech recognizer of BBN which, in contrast to most systems developed by private companies, is documented by several scientific publications. The chapter concludes with the description of a speech recognition system of our own developed on the basis of ESMERALDA.

Keywords

Linear Discriminant Analysis Speech Recognition Language Model Automatic Speech Recognition Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 3.
    Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading (1986) Google Scholar
  2. 14.
    Bauckhage, C., Fink, G.A., Fritsch, J., Kummert, F., Lömker, F., Sagerer, G., Wachsmuth, S.: An integrated system for cooperative man-machine interaction. In: IEEE International Symposium on Computational Intelligence in Robotics and Automation, Banff, Canada, pp. 328–333 (2001) Google Scholar
  3. 18.
    Beyerlein, P., Aubert, X.L., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., Molau, S., Pitz, M., Sixtus, A.: The Philips/RWTH system for transcription of broadcast news. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 647–650 (1999) Google Scholar
  4. 20.
    Billa, J., Colhurst, T., El-Jaroudi, A., Iyer, R., Ma, K., Matsoukas, S., Quillen, C., Richardson, F., Siu, M., Zvaliagkos, G., Gish, H.: Recent experiments in large vocabulary conversational speech recognition. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ (1999) Google Scholar
  5. 30.
    Brandt-Pook, H., Fink, G.A., Wachsmuth, S., Sagerer, G.: Integrated recognition and interpretation of speech for a construction task domain. In: Bullinger, H.-J., Ziegler, J. (eds.) Proceedings 8th International Conference on Human-Computer Interaction, München, vol. 1, pp. 550–554 (1999) Google Scholar
  6. 33.
    Brindöpke, C., Fink, G.A., Kummert, F.: A comparative study of HMM-based approaches for the automatic recognition of perceptively relevant aspects of spontaneous German speech melody. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 699–702 (1999) Google Scholar
  7. 34.
    Brindöpke, C., Fink, G.A., Kummert, F., Sagerer, G.: An HMM-based recognition system for perceptive relevant pitch movements of spontaneous German speech. In: Proc. Int. Conf. on Spoken Language Processing, Sydney, vol. 7, pp. 2895–2898 (1998) Google Scholar
  8. 49.
    Colthurst, T., Kimball, O., Richardson, F., Shu, H., Wooters, C., Iyer, R., Gish, H.: The 2000 BBN Byblos LVCSR system. In: 2000 Speech Transcription Workshop, Maryland (2000) Google Scholar
  9. 81.
    Fink, G.A.: Developing HMM-based recognizers with ESMERALDA. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) Text, Speech and Dialogue. Lecture Notes in Artificial Intelligence, vol. 1692, pp. 229–234. Springer, Berlin (1999) CrossRefGoogle Scholar
  10. 82.
    Fink, G.A., Plötz, T.: Integrating speaker identification and learning with adaptive speech recognition. In: 2004: A Speaker Odyssey – The Speaker and Language Recognition Workshop, Toledo, pp. 185–192 (2004) Google Scholar
  11. 83.
    Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Seoul, Korea, vol. 2, pp. 1070–1074 (2005) Google Scholar
  12. 84.
    Fink, G.A., Plötz, T.: Unsupervised estimation of writing style models for improved unconstrained off-line handwriting recognition. In: Proc. Int. Workshop on Frontiers in Handwriting Recognition, La Baule, France, pp. 429–434 (2006) Google Scholar
  13. 85.
    Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems. In: 7th Open German/Russian Workshop on Pattern Recognition and Image Understanding, Ettlingen, Germany (2007) Google Scholar
  14. 86.
    Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems (2007). http://sourceforge.net/projects/esmeralda
  15. 87.
    Fink, G.A., Plötz, T.: On the use of context-dependent modelling units for HMM-based offline handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Curitiba, Brazil, vol. 2, pp. 729–733 (2007) Google Scholar
  16. 88.
    Fink, G.A., Sagerer, G.: Zeitsynchrone Suche mit n-Gramm-Modellen höherer Ordnung (Time-synchonous search with higher-order n-gram models). In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 145–150. VDE Verlag, Berlin (2000) (in German) Google Scholar
  17. 89.
    Fink, G.A., Schillo, C., Kummert, F., Sagerer, G.: Incremental speech recognition for multimodal interfaces. In: Proc. Annual Conference of the IEEE Industrial Electronics Society, Aachen, vol. 4, pp. 2012–2017 (1998) Google Scholar
  18. 90.
    Fink, G.A., Vajda, S., Bhattacharya, U., Parui, S.K., Chaudhuri, B.B.: Online Bangla word recognition using sub-stroke level features and hidden Markov models. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Kolkata, India, pp. 393–398 (2010) Google Scholar
  19. 91.
    Fink, G.A., Wienecke, M.: Experiments in video-based whiteboard reading. In: First Int. Workshop on Camera-Based Document Analysis and Recognition, Seoul, Korea, pp. 95–100 (2005) Google Scholar
  20. 92.
    Fink, G.A., Wienecke, M., Sagerer, G.: Video-based on-line handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 226–230 (2001) Google Scholar
  21. 123.
    Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Englewood Cliffs (2001) Google Scholar
  22. 146.
    Kanthak, S., Molau, S., Sixtus, A., Schlüter, R., Ney, H.: The RWTH large vocabulary speech recognition system for spontaneous speech. In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 249–256. VDE Verlag, Berlin (2000) Google Scholar
  23. 151.
    Kirchhoff, K., Fink, G.A., Sagerer, G.: Conversational speech recognition using acoustic and articulatory input. In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Istanbul (2000) Google Scholar
  24. 152.
    Kirchhoff, K., Fink, G.A., Sagerer, G.: Combining acoustic and articulatory information for robust speech recognition. Speech Commun. 37(3–4), 303–319 (2002) CrossRefMATHGoogle Scholar
  25. 164.
    Kummert, F., Fink, G.A., Sagerer, G.: A hybrid speech recognizer combining HMMs and polynomial classification. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 3, pp. 814–817 (2000) Google Scholar
  26. 182.
    Lööf, J., Bisani, M., Gollan, C., Heigold, G., Hoffmeister, B., Plahl, C., Schlüter, R., Ney, H.: The 2006 RWTH parliamentary speeches transcription system. In: TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, pp. 133–138 (2006) Google Scholar
  27. 195.
    Matsoukas, S., Colthurst, T., Kimball, O., Solomonoff, A., Richardson, F., Quillen, C., Gish, H., Dognin, P.: The 2001 BYBLOS English large vocabulary conversational speech recognition system. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, pp. 721–724 (2002) Google Scholar
  28. 211.
    Nguyen, L., Matsoukas, S., Billa, J., Schwartz, R., Makhoul, J.: The 1999 BBN BYBLOS 10xRT broadcast news transcription system. In: 2000 Speech Transcription Workshop, Maryland (2000) Google Scholar
  29. 212.
    Nguyen, L., Schwartz, R.: The BBN single-phonetic-tree fast-match algorithm. In: Proc. Int. Conf. on Spoken Language Processing, Sydney (1998) Google Scholar
  30. 234.
    Pfeiffer, M.: Architektur eines multimodalen Forschungssystems zur iterativen inhaltsbasierten Bildsuche (Architecture of a multimodal research system for iterative interactive image retrieval). PhD thesis, Bielefeld University, Faculty of Technology, Bielefeld, Germany (2006) (in German) Google Scholar
  31. 235.
    Plötz, T., Fink, G.A.: Robust time-synchronous environmental adaptation for continuous speech recognition systems. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 2, pp. 1409–1412 (2002) Google Scholar
  32. 236.
    Plötz, T., Fink, G.A.: Feature extraction for improved profile HMM based biological sequence analysis. In: Proc. Int. Conf. on Pattern Recognition, pp. 315–318 (2004) Google Scholar
  33. 237.
    Plötz, T., Fink, G.A.: A new approach for HMM based protein sequence modeling and its application to remote homology classification. In: Proc. Workshop Statistical Signal Processing, Bordeaux, France (2005) Google Scholar
  34. 238.
    Plötz, T., Fink, G.A.: Robust remote homology detection by feature based Profile Hidden Markov Models. Stat. Appl. Genet. Mol. Biol. 4(1) (2005) Google Scholar
  35. 239.
    Plötz, T., Fink, G.A.: Pattern recognition methods for advanced stochastic protein sequence analysis using HMMs. Pattern Recognit. 39, 2267–2280 (2006). Special Issue on Bioinformatics CrossRefMATHGoogle Scholar
  36. 240.
    Plötz, T., Fink, G.A.: An efficient method for making un-supervised adaptation of HMM-based speech recognition systems robust against out-of-domain data. In: Proc. 4th Int. Workshop on Natural Language Processing and Cognitive Science, Funchal, Portugal, June 2007 Google Scholar
  37. 242.
    Plötz, T., Fink, G.A.., Husemann, P., Kanies, S., Lienemann, K., Marschall, T., Martin, M., Schillingmann, L., Steinrücken, M., Sudek, H.: Automatic detection of song changes in music mixes using stochastic models. In: Proc. Int. Conf. on Pattern Recognition, pp. 665–668 (2006) Google Scholar
  38. 243.
    Plötz, T., Thurau, C., Fink, G.A.: Camera-based whiteboard reading: new approaches to a challenging task. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Montreal, Canada, pp. 385–390 (2008) Google Scholar
  39. 253.
    Richarz, J., Fink, G.A.: Visual recognition of 3d emblematic gestures in an hmm framework. J. Ambient Intell. Smart Environ. 3(3), 193–211 (2011). Thematic Issue on Computer Vision for Ambient Intelligence Google Scholar
  40. 255.
    Rosenberg, A.E., Lee, C.-H., Soong, F.K.: Cepstral channel normalization techniques for HMM-based speaker verification. In: Proc. Int. Conf. on Spoken Language Processing, Yokohama, Japan, vol. 4, pp. 1835–1838 (1994) Google Scholar
  41. 258.
    Rothacker, L., Rusinol, M., Fink, G.A.: Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: Proc. Int. Conf. on Document Analysis and Recognition, Washington DC, USA (2013) Google Scholar
  42. 259.
    Rothacker, L., Vajda, S., Fink, G.A.: Bag-of-features representations for offline handwriting recognition applied to Arabic script. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Bari, Italy (2012) Google Scholar
  43. 262.
    Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, Brighton, UK, pp. 2111–2114 (2009) Google Scholar
  44. 269.
    Schillo, C., Fink, G.A., Kummert, F.: Grapheme based speech recognition for large vocabularies. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 4, pp. 584–587 (2000) Google Scholar
  45. 285.
    Sixtus, A., Molau, S., Kanthak, S., Schlüter, R., Ney, H.: Recent improvements of the RWTH large vocabulary speech recognition system on spontaneous speech. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Istanbul, pp. 1671–1674 (2000) Google Scholar
  46. 287.
    Spiess, T., Wrede, B., Kummert, F., Fink, G.A.: Data-driven pronunciation modeling for ASR using acoustic subword units. In: Proc. European Conf. on Speech Communication and Technology, Geneva, pp. 2549–2552 (2003) Google Scholar
  47. 297.
    Wachsmuth, S., Fink, G.A., Kummert, F., Sagerer, G.: Using speech in visual object recognition. In: Sommer, G., Krüger, N., Perwass, C. (eds.) Mustererkennung 2000, 22. DAGM-Symposium Kiel. Informatik Aktuell, pp. 428–435. Springer, Berlin (2000) CrossRefGoogle Scholar
  48. 298.
    Wachsmuth, S., Fink, G.A., Sagerer, G.: Integration of parsing and incremental speech recognition. In: Proceedings of the European Signal Processing Conference, Rhodes, Sep. 1998, vol. 1, pp. 371–375 (1998) Google Scholar
  49. 302.
    Welling, L., Kanthak, S., Ney, H.: Improved methods for vocal tract normalisation. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ, pp. 761–764 (1999) Google Scholar
  50. 304.
    Wendt, S., Fink, G.A., Kummert, F.: Forward masking for increased robustness in automatic speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 1, pp. 615–618 (2001) Google Scholar
  51. 305.
    Wendt, S., Fink, G.A., Kummert, F.: Dynamic search-space pruning for time-constrained speech recognition. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 1, pp. 377–380 (2002) Google Scholar
  52. 309.
    Westphal, M.: The use of cepstral means in conversational speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Rhodes, Greece, vol. 3, pp. 1143–1146 (1997) Google Scholar
  53. 313.
    Wienecke, M., Fink, G.A., Sagerer, G.: A handwriting recognition system based on visual input. In: Schiele, B., Sagerer, G. (eds.) Computer Vision Systems. Lecture Notes in Computer Science, pp. 63–72. Springer, Berlin (2001) Google Scholar
  54. 314.
    Wienecke, M., Fink, G.A., Sagerer, G.: Experiments in unconstrained offline handwritten text recognition. In: Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, Niagara on the Lake, Canada, August 2002 Google Scholar
  55. 315.
    Wienecke, M., Fink, G.A., Sagerer, G.: Towards automatic video-based whiteboard reading. In: Proc. Int. Conf. on Document Analysis and Recognition, Edinburgh, vol. 1, pp. 87–91 (2003) Google Scholar
  56. 317.
    Wienecke, M., Fink, G.A., Sagerer, G.: Toward automatic video-based whiteboard reading. Int. J. Doc. Anal. Recognit. 7(2–3), 188–200 (2005) CrossRefGoogle Scholar
  57. 320.
    Wrede, B., Fink, G.A., Sagerer, G.: Influence of duration on static and dynamic properties of German vowels in spontaneous speech. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 1, pp. 82–85 (2000) Google Scholar
  58. 321.
    Wrede, B., Fink, G.A., Sagerer, G.: An investigation of modelling aspects for rate-dependent speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 4, pp. 2527–2530 (2001) Google Scholar
  59. 332.
    Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models, 2nd edn. Springer Series in Information Sciences, vol. 22. Springer, Berlin (1999) CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Gernot A. Fink
    • 1
  1. 1.Department of Computer ScienceTU Dortmund UniversityDortmundGermany

Personalised recommendations