Skip to main content

Automatic Speech Recognition

  • Chapter
  • First Online:
Fundamentals of Artificial Intelligence

Abstract

There are basically two application modes for automatic speech recognition (ASR): using speech as spoken input or as knowledge source. Spoken   input addresses applications like dictation systems and navigation (transactional) systems. Using speech as a knowledge source has applications like multimedia indexing systems. The chapter presents the stages of speech recognition process, resources of ASR, role and functions of speech engine—like, Julius speech recognition engine, voice-over web resources, ASR algorithms, language model and acoustic models—like HMM (hidden Markov models). Many open-source tools like—Kaldi speech recognition toolkit, CMU-Sphinx, HTK, and Deep speech tools’ introduction, and guidelines for their usages are presented. These tools have interfaces with high-level languages like C/C++ and Python. The is followed with chapter summary and set of exercises.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    for example, the word “speech” can be also pronounced as “spee...ech”, repeating the sound of ’e’, which creates a self-loop.

  2. 2.

    chanting of Aum is a common practice during meditation and yoga (IPA:/ɐwm/), where sound of IPA ’‘m’ is repeated.

  3. 3.

    Lombard effect: Involuntary tendency of speakers to increase their vocal effort particularly when speaking in loud background to enhance the audibility of their voice. Due to the Lombard effect, not only the loudness increases, but also the other acoustic features such as pitch, rate, and duration of syllables. The Lombard effect also results in an increase in the signal-to-noise ratio of the speaker’s signal.

References

  1. Hannun A et al (2014) Deep Speech: Scaling up end-to-end speech recognition. https://arxiv.org/abs/1412.5567. Accessed Dec 19, 2017

  2. http://julius.osdn.jp/book/Julius-3.2-book-e.pdf. Accessed Dec 19, 2017

  3. http://kaldi.sf.net/. Accessed Dec 19, 2017

  4. Padmanabham M, Picheny M (2002) Large-vocabulary speech recognition algorithms. Computer 4:42–50

    Article  Google Scholar 

  5. Provey D (2011) The Kaldi speech recognition toolkit. IEEE workshop on automatic speech recognition and understanding. US IEEE Signal Processing Society, Hawaii

    Google Scholar 

  6. Ronald C et al (1997) Survey of the state of art in human language technology. Studies in Natural Language Processing, Cambridge University Press

    Google Scholar 

  7. Savitha S, Eric B (2002) Is speech recognition becoming mainstream? Computer 4:38–41

    Google Scholar 

  8. http://www.w3.org/Voice/. Accessed Dec 19, 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. R. Chowdhary .

Exercises

Exercises

  1. 1.

    Consider alphabet set \(\Sigma = \{a, b, c, d\}\). Create finite automata (recognizers) for following strings.

    1. a.

      All strings which start with letter a.

    2. b.

      All strings which end with letter d.

    3. c.

      All strings where every c is followed letter d.

    4. d.

      All strings which have odd number of c’s.

  2. 2.

    Answer followings in brief, giving suitable examples.

    1. a.

      What is the difference between phoneme and morpheme?

    2. b.

      What is the difference between language and dialect?

  3. 3.

    Write an equation to compute trigram probability.

  4. 4.

    The text processing algorithms are usually written in Python, while the ASR algorithms, which produce the same text, are written in C/C++. Explain what could have been the reason behind this?

  5. 5.

    What is the fundamental difference between the language model and acoustic model? Why are they the same so?

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature India Private Limited

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chowdhary, K.R. (2020). Automatic Speech Recognition. In: Fundamentals of Artificial Intelligence. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3972-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-3972-7_20

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-3970-3

  • Online ISBN: 978-81-322-3972-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics