Skip to main content

A Study on Speech Processing

  • Conference paper
  • First Online:
  • 1536 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 435))

Abstract

Speech is the most natural means of communication in human-to-human interactions. Automatic Speech Recognition (ASR) is the application of technology in developing machines that can autonomously transcribe a speech into a text in the real-time. This paper presents a short review of ASR systems. Fundamentally, the design of speech recognition system involves three major processes such as feature extraction, acoustic modeling and classification. Consequently, emphasis is laid on describing essential principles of the various techniques employed in each of these processes. On the other hand, it also presents the milestones in the speech processing research to date.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Christian Gottlieb Kratzenstein, Sur la naissance de la formation des voyelles, J. Phys., Vol 21, pp. 358–380, 1782.

    Google Scholar 

  2. Dudley, Homer, and Thomas H. Tarnoczy. “The speaking machine of Wolfgang von Kempelen.” The Journal of the Acoustical Society of America 22.2 (1950): 151–166.

    Google Scholar 

  3. Bowers, Brian, ed. Sir Charles Wheatstone FRS: 1802–1875. No. 29. IET, 2001.

    Google Scholar 

  4. Lindsay, David. “Talking Head: In the mid-1800s, Joseph Faber spent seventeen years working on his speech synthesizer.” American Heritage of Invention and Technology 13 (1997): 56–63.

    Google Scholar 

  5. Cater, John C. “Electronically Speaking: Computer Speech Generation.” Sams, 1983.

    Google Scholar 

  6. Fletcher, Harvey. “The Nature of Speech and Its Interpretation1.” Bell System Technical Journal 1.1 (1922): 129–144.

    Google Scholar 

  7. Dudley, Homer, R. R. Riesz, and S. S. A. Watkins. “A synthetic speaker.” Journal of the Franklin Institute 227.6 (1939): 739–764.

    Google Scholar 

  8. Davis, K. H., R. Biddulph, and S. Balashek. “Automatic recognition of spoken digits.” The Journal of the Acoustical Society of America 24.6 (1952): 637–642.

    Google Scholar 

  9. Dersch, W.C. SHOEBOX- a voice responsive machine, DATAMATION, 8:47–50, June 1962.

    Google Scholar 

  10. Lowerre, Bruce T. “The HARPY speech recognition system.” (1976).

    Google Scholar 

  11. Baker, James. “The DRAGON system–An overview.” Acoustics, Speech and Signal Processing, IEEE Transactions on 23.1 (1975): 24–29.

    Google Scholar 

  12. Jelinek, Frederick. “Continuous speech recognition by statistical methods.” Proceedings of the IEEE 64. 1976.

    Google Scholar 

  13. Kurzweil, Raymond. “The Kurzweil reading machine: A technical overview.” Science, Technology and the Handicapped (1976): 3–11.

    Google Scholar 

  14. Averbuch, Ar, et al. “Experiments with the TANGORA 20,000 word speech recognizer.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87. Vol. 12. IEEE, 1987.

    Google Scholar 

  15. Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection for phoneme recognition using a cooperative game theory based framework”. Multimedia, Communication and Computing Application: Proceedings of the 2014 International Conference on Multimedia, Communication and Computing Application (MCCA 2014), 191–195, CRC Press, 2015.

    Google Scholar 

  16. Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection using game theory for phoneme based speech recognition.” Contemporary Computing and Informatics (IC3I), 2014 International Conference on. IEEE, 2014.

    Google Scholar 

  17. Toh, Aik Ming, Roberto Togneri, and Sven Nordholm. “Spectral entropy as speech features for speech recognition.” Proceedings of PEECS 1 (2005).

    Google Scholar 

  18. Gelbart, David, Nelson Morgan, and Alexey Tsymbal. “Hill-climbing feature selection for multi-stream ASR.” In: INTERSPEECH, pp. 2967–2970 (2009).

    Google Scholar 

  19. Paliwal, K. K. Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer, Digital Signal Processing 2(3) (1992): 157–173.

    Google Scholar 

  20. Fant, Gunnar. Acoustic Theory of Speech Production. No. 2. Walter de Gruyter, 1970.

    Google Scholar 

  21. Makhoul, John. “Linear prediction: A tutorial review.” Proceedings of the IEEE 63.4 (1975): 561–580.

    Google Scholar 

  22. Bogert, B. P., and G. E. Peterson. “The acoustics of speech.” Handbook of speech pathology (1957): 109–173.

    Google Scholar 

  23. Davis, Steven, and Paul Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.” Acoustics, Speech and Signal Processing, IEEE Transactions on 28.4 (1980): 357–366.

    Google Scholar 

  24. Mermelstein, Paul. “Distance measures for speech recognition, psychological and instrumental.” Pattern recognition and artificial intelligence 116 (1976): 374–388.

    Google Scholar 

  25. Zheng, Fang, Guoliang Zhang, and Zhanjiang Song. “Comparison of different implementations of MFCC.” Journal of Computer Science and Technology 16.6 (2001): 582–589.

    Google Scholar 

  26. Hermansky, Hynek. “Perceptual linear predictive (PLP) analysis of speech.” the Journal of the Acoustical Society of America 87.4 (1990): 1738–1752.

    Google Scholar 

  27. Stevens, Stanley S. “On the psychophysical law.” Psychological review 64.3 (1957): 153.

    Google Scholar 

  28. Hermansky, Hynek, and Nelson Morgan. “RASTA processing of speech.” Speech and Audio Processing, IEEE Transactions on 2.4 (1994): 578–589.

    Google Scholar 

  29. Hermansky, Hynek, et al. “RASTA-PLP speech analysis technique.” Acoustics, Speech, and Signal Processing, IEEE International Conference on. Vol. 1. IEEE, 1992.

    Google Scholar 

  30. Misra, H., Ikbal, S., Bourlard, H., & Hermansky, H. “Spectral entropy based feature for robust ASR.” Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP’04). IEEE International Conference on. Vol. 1. IEEE, 2004.

    Google Scholar 

  31. Sakoe, Hiroaki, et al. “Dynamic programming algorithm optimization for spoken word recognition.” Readings in speech recognition 159 (1990).

    Google Scholar 

  32. Baum, Leonard E., and Ted Petrie. “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics (1966): 1554–1563.

    Google Scholar 

  33. Debyeche, Mohamed, Jean Paul Haton, and Amrane Houacine. “Improved Vector Quantization Approach for Discrete HMM Speech Recognition System.” Int. Arab J. Inf. Technol. 4.4 (2007): 338–344.

    Google Scholar 

  34. Cheng, Chih-Chieh, Fei Sha, and Lawrence K. Saul. “Matrix updates for perceptron training of continuous density hidden markov models.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.

    Google Scholar 

  35. Gauvain, Jean-Luc, and Chin-Hui Lee. “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities.” Speech Communication 11.2 (1992): 205–213.

    Google Scholar 

  36. Rabiner, L. R., et al. “Recognition of isolated digits using hidden Markov models with continuous mixture densities.” AT&T Technical Journal 64.6 (1985): 1211–1234.

    Google Scholar 

  37. Razavi, Marzieh, and Ramya Rasipuram. On Modeling Context-dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches. No. EPFL-REPORT-192598. Idiap, 2013.

    Google Scholar 

  38. Dupont, Stéphane, et al. “Context Independent and Context Dependent Hybrid HMM/ANN Systems for Training Independent Tasks.” Proceedings of the EUROSPEECH’97. 1997.

    Google Scholar 

  39. Woodland, Philip C., and Steve J. Young. “The HTK tied-state continuous speech recogniser.” Eurospeech. 1993.

    Google Scholar 

  40. Levinson, Stephen E. “Continuously variable duration hidden Markov models for automatic speech recognition.” Computer Speech & Language 1.1 (1986): 29–45.

    Google Scholar 

  41. Russell, Martin. “A segmental HMM for speech pattern modelling.” Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on. Vol. 2. IEEE, 1993.

    Google Scholar 

  42. Kenny, Patrick, Matthew Lennig, and Paul Mermelstein. “A linear predictive HMM for vector-valued observations with applications to speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 38.2 (1990): 220–225.

    Google Scholar 

  43. Petr Schwarz, and Jan Cernocky. (2008) “Phoneme Recognition Based on Long Temporal Context.”, Ph.D. Thesis, Brno University of Technology, Czech Republic.

    Google Scholar 

  44. Evermann, Gunnar, et al. The HTK book. Vol. 2. Cambridge: Entropic Cambridge Research Laboratory, 1997.

    Google Scholar 

  45. Rabiner, Lawrence. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77.2 (1989): 257–286.

    Google Scholar 

  46. Ostendorf, M., and V. Digalakis. “The stochastic segment model for continuous speech recognition.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.

    Google Scholar 

  47. Ostendorf, Mari, and Salim Roukos. “A stochastic segment model for phoneme-based continuous speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 37.12 (1989): 1857–1869.

    Google Scholar 

  48. Morris, Jeremy J. “A study on the use of conditional random fields for automatic speech recognition.” PhD diss., The Ohio State University, 2010.

    Google Scholar 

  49. Gunawardana, Asela, et al. “Hidden conditional random fields for phone classification.” INTERSPEECH. 2005.

    Google Scholar 

  50. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.

    Google Scholar 

  51. Pinto, Joel Praveen. Multilayer perceptron based hierarchical acoustic modeling for automatic speech recognition. Diss. Ecole polytechnique fédérale de Lausanne, 2010.

    Google Scholar 

  52. Fukuda, Yohji, and Haruya Matsumoto. “Speech Recognition Using Modular Organizations Based On Multiple Hopfield Neural Networks”, Speech Science and Technology (SST-92), 1992: 226–231.

    Google Scholar 

  53. Minghu, Jiang, et al. “Fast learning algorithms for time-delay neural networks phoneme recognition.” Signal Processing Proceedings, 1998. ICSP’98. 1998 Fourth International Conference on. IEEE, 1998.

    Google Scholar 

  54. Venkateswarlu, R. L. K., and R. Vasantha Kumari. “Novel approach for speech recognition by using self—Organized maps.” Emerging Trends in Networks and Computer Communications (ETNCC), 2011 International Conference on. IEEE, 2011.

    Google Scholar 

  55. Ganapathiraju, Aravind, Jonathan E. Hamaker, and Joseph Picone. “Applications of support vector machines to speech recognition.” Signal Processing, IEEE Transactions on 52.8 (2004): 2348–2355.

    Google Scholar 

  56. N.D. Smith and M. Niranjan. Data-dependent Kernels in SVM Classification of Speech Patterns. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), volume 1, pages 297–300, Beijing, China, 2000.

    Google Scholar 

  57. A. Ganapathiraju, J. Hamaker, and J. Picone. Hybrid SVM/HMM Architectures for Speech Recognition. In Proceedings of the 2000 Speech Transcription Workshop, volume 4, pages 504–507, Maryland (USA), May 2000.

    Google Scholar 

  58. J. Padrell-Sendra, D. Martın-Iglesias, and F. Dıaz-de-Marıa. Support vector machines for continuous speech recognition. In Proceedings of the 14th European Signal Processing Conference, Florence, Italy, 2006.

    Google Scholar 

  59. Makhoul, John, Salim Roucos, and Herbert Gish. “Vector quantization in speech coding.” Proceedings of the IEEE 73.11 (1985): 1551–1588.

    Google Scholar 

  60. Furui, Sadaoki. “Vector-quantization-based speech recognition and speaker recognition techniques.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.

    Google Scholar 

  61. Zaharia, Tiberius, et al. “Quantized dynamic time warping (DTW) algorithm.” Communications (COMM), 2010 8th International Conference on. IEEE, 2010.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Ujwala Rekha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Ujwala Rekha, J., Shahu Chatrapati, K., Vinaya Babu, A. (2016). A Study on Speech Processing. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 435. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2757-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2757-1_22

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2756-4

  • Online ISBN: 978-81-322-2757-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics