A Study on Speech Processing

Ujwala Rekha, J.; Shahu Chatrapati, K.; Vinaya Babu, A.

doi:10.1007/978-81-322-2757-1_22

A Study on Speech Processing

J. Ujwala Rekha⁶,
K. Shahu Chatrapati⁷ &
A. Vinaya Babu⁶

Conference paper
First Online: 04 February 2016

1536 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 435))

Abstract

Speech is the most natural means of communication in human-to-human interactions. Automatic Speech Recognition (ASR) is the application of technology in developing machines that can autonomously transcribe a speech into a text in the real-time. This paper presents a short review of ASR systems. Fundamentally, the design of speech recognition system involves three major processes such as feature extraction, acoustic modeling and classification. Consequently, emphasis is laid on describing essential principles of the various techniques employed in each of these processes. On the other hand, it also presents the milestones in the speech processing research to date.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Christian Gottlieb Kratzenstein, Sur la naissance de la formation des voyelles, J. Phys., Vol 21, pp. 358–380, 1782.
Google Scholar
Dudley, Homer, and Thomas H. Tarnoczy. “The speaking machine of Wolfgang von Kempelen.” The Journal of the Acoustical Society of America 22.2 (1950): 151–166.
Google Scholar
Bowers, Brian, ed. Sir Charles Wheatstone FRS: 1802–1875. No. 29. IET, 2001.
Google Scholar
Lindsay, David. “Talking Head: In the mid-1800s, Joseph Faber spent seventeen years working on his speech synthesizer.” American Heritage of Invention and Technology 13 (1997): 56–63.
Google Scholar
Cater, John C. “Electronically Speaking: Computer Speech Generation.” Sams, 1983.
Google Scholar
Fletcher, Harvey. “The Nature of Speech and Its Interpretation1.” Bell System Technical Journal 1.1 (1922): 129–144.
Google Scholar
Dudley, Homer, R. R. Riesz, and S. S. A. Watkins. “A synthetic speaker.” Journal of the Franklin Institute 227.6 (1939): 739–764.
Google Scholar
Davis, K. H., R. Biddulph, and S. Balashek. “Automatic recognition of spoken digits.” The Journal of the Acoustical Society of America 24.6 (1952): 637–642.
Google Scholar
Dersch, W.C. SHOEBOX- a voice responsive machine, DATAMATION, 8:47–50, June 1962.
Google Scholar
Lowerre, Bruce T. “The HARPY speech recognition system.” (1976).
Google Scholar
Baker, James. “The DRAGON system–An overview.” Acoustics, Speech and Signal Processing, IEEE Transactions on 23.1 (1975): 24–29.
Google Scholar
Jelinek, Frederick. “Continuous speech recognition by statistical methods.” Proceedings of the IEEE 64. 1976.
Google Scholar
Kurzweil, Raymond. “The Kurzweil reading machine: A technical overview.” Science, Technology and the Handicapped (1976): 3–11.
Google Scholar
Averbuch, Ar, et al. “Experiments with the TANGORA 20,000 word speech recognizer.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87. Vol. 12. IEEE, 1987.
Google Scholar
Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection for phoneme recognition using a cooperative game theory based framework”. Multimedia, Communication and Computing Application: Proceedings of the 2014 International Conference on Multimedia, Communication and Computing Application (MCCA 2014), 191–195, CRC Press, 2015.
Google Scholar
Rekha, J. Ujwala, Shahu, K. Chatrapati, Babu, A. Vinaya. “Feature selection using game theory for phoneme based speech recognition.” Contemporary Computing and Informatics (IC3I), 2014 International Conference on. IEEE, 2014.
Google Scholar
Toh, Aik Ming, Roberto Togneri, and Sven Nordholm. “Spectral entropy as speech features for speech recognition.” Proceedings of PEECS 1 (2005).
Google Scholar
Gelbart, David, Nelson Morgan, and Alexey Tsymbal. “Hill-climbing feature selection for multi-stream ASR.” In: INTERSPEECH, pp. 2967–2970 (2009).
Google Scholar
Paliwal, K. K. Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer, Digital Signal Processing 2(3) (1992): 157–173.
Google Scholar
Fant, Gunnar. Acoustic Theory of Speech Production. No. 2. Walter de Gruyter, 1970.
Google Scholar
Makhoul, John. “Linear prediction: A tutorial review.” Proceedings of the IEEE 63.4 (1975): 561–580.
Google Scholar
Bogert, B. P., and G. E. Peterson. “The acoustics of speech.” Handbook of speech pathology (1957): 109–173.
Google Scholar
Davis, Steven, and Paul Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.” Acoustics, Speech and Signal Processing, IEEE Transactions on 28.4 (1980): 357–366.
Google Scholar
Mermelstein, Paul. “Distance measures for speech recognition, psychological and instrumental.” Pattern recognition and artificial intelligence 116 (1976): 374–388.
Google Scholar
Zheng, Fang, Guoliang Zhang, and Zhanjiang Song. “Comparison of different implementations of MFCC.” Journal of Computer Science and Technology 16.6 (2001): 582–589.
Google Scholar
Hermansky, Hynek. “Perceptual linear predictive (PLP) analysis of speech.” the Journal of the Acoustical Society of America 87.4 (1990): 1738–1752.
Google Scholar
Stevens, Stanley S. “On the psychophysical law.” Psychological review 64.3 (1957): 153.
Google Scholar
Hermansky, Hynek, and Nelson Morgan. “RASTA processing of speech.” Speech and Audio Processing, IEEE Transactions on 2.4 (1994): 578–589.
Google Scholar
Hermansky, Hynek, et al. “RASTA-PLP speech analysis technique.” Acoustics, Speech, and Signal Processing, IEEE International Conference on. Vol. 1. IEEE, 1992.
Google Scholar
Misra, H., Ikbal, S., Bourlard, H., & Hermansky, H. “Spectral entropy based feature for robust ASR.” Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP’04). IEEE International Conference on. Vol. 1. IEEE, 2004.
Google Scholar
Sakoe, Hiroaki, et al. “Dynamic programming algorithm optimization for spoken word recognition.” Readings in speech recognition 159 (1990).
Google Scholar
Baum, Leonard E., and Ted Petrie. “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics (1966): 1554–1563.
Google Scholar
Debyeche, Mohamed, Jean Paul Haton, and Amrane Houacine. “Improved Vector Quantization Approach for Discrete HMM Speech Recognition System.” Int. Arab J. Inf. Technol. 4.4 (2007): 338–344.
Google Scholar
Cheng, Chih-Chieh, Fei Sha, and Lawrence K. Saul. “Matrix updates for perceptron training of continuous density hidden markov models.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
Google Scholar
Gauvain, Jean-Luc, and Chin-Hui Lee. “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities.” Speech Communication 11.2 (1992): 205–213.
Google Scholar
Rabiner, L. R., et al. “Recognition of isolated digits using hidden Markov models with continuous mixture densities.” AT&T Technical Journal 64.6 (1985): 1211–1234.
Google Scholar
Razavi, Marzieh, and Ramya Rasipuram. On Modeling Context-dependent Clustered States: Comparing HMM/GMM, Hybrid HMM/ANN and KL-HMM Approaches. No. EPFL-REPORT-192598. Idiap, 2013.
Google Scholar
Dupont, Stéphane, et al. “Context Independent and Context Dependent Hybrid HMM/ANN Systems for Training Independent Tasks.” Proceedings of the EUROSPEECH’97. 1997.
Google Scholar
Woodland, Philip C., and Steve J. Young. “The HTK tied-state continuous speech recogniser.” Eurospeech. 1993.
Google Scholar
Levinson, Stephen E. “Continuously variable duration hidden Markov models for automatic speech recognition.” Computer Speech & Language 1.1 (1986): 29–45.
Google Scholar
Russell, Martin. “A segmental HMM for speech pattern modelling.” Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on. Vol. 2. IEEE, 1993.
Google Scholar
Kenny, Patrick, Matthew Lennig, and Paul Mermelstein. “A linear predictive HMM for vector-valued observations with applications to speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 38.2 (1990): 220–225.
Google Scholar
Petr Schwarz, and Jan Cernocky. (2008) “Phoneme Recognition Based on Long Temporal Context.”, Ph.D. Thesis, Brno University of Technology, Czech Republic.
Google Scholar
Evermann, Gunnar, et al. The HTK book. Vol. 2. Cambridge: Entropic Cambridge Research Laboratory, 1997.
Google Scholar
Rabiner, Lawrence. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77.2 (1989): 257–286.
Google Scholar
Ostendorf, M., and V. Digalakis. “The stochastic segment model for continuous speech recognition.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.
Google Scholar
Ostendorf, Mari, and Salim Roukos. “A stochastic segment model for phoneme-based continuous speech recognition.” Acoustics, Speech and Signal Processing, IEEE Transactions on 37.12 (1989): 1857–1869.
Google Scholar
Morris, Jeremy J. “A study on the use of conditional random fields for automatic speech recognition.” PhD diss., The Ohio State University, 2010.
Google Scholar
Gunawardana, Asela, et al. “Hidden conditional random fields for phone classification.” INTERSPEECH. 2005.
Google Scholar
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks.” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
Google Scholar
Pinto, Joel Praveen. Multilayer perceptron based hierarchical acoustic modeling for automatic speech recognition. Diss. Ecole polytechnique fédérale de Lausanne, 2010.
Google Scholar
Fukuda, Yohji, and Haruya Matsumoto. “Speech Recognition Using Modular Organizations Based On Multiple Hopfield Neural Networks”, Speech Science and Technology (SST-92), 1992: 226–231.
Google Scholar
Minghu, Jiang, et al. “Fast learning algorithms for time-delay neural networks phoneme recognition.” Signal Processing Proceedings, 1998. ICSP’98. 1998 Fourth International Conference on. IEEE, 1998.
Google Scholar
Venkateswarlu, R. L. K., and R. Vasantha Kumari. “Novel approach for speech recognition by using self—Organized maps.” Emerging Trends in Networks and Computer Communications (ETNCC), 2011 International Conference on. IEEE, 2011.
Google Scholar
Ganapathiraju, Aravind, Jonathan E. Hamaker, and Joseph Picone. “Applications of support vector machines to speech recognition.” Signal Processing, IEEE Transactions on 52.8 (2004): 2348–2355.
Google Scholar
N.D. Smith and M. Niranjan. Data-dependent Kernels in SVM Classification of Speech Patterns. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), volume 1, pages 297–300, Beijing, China, 2000.
Google Scholar
A. Ganapathiraju, J. Hamaker, and J. Picone. Hybrid SVM/HMM Architectures for Speech Recognition. In Proceedings of the 2000 Speech Transcription Workshop, volume 4, pages 504–507, Maryland (USA), May 2000.
Google Scholar
J. Padrell-Sendra, D. Martın-Iglesias, and F. Dıaz-de-Marıa. Support vector machines for continuous speech recognition. In Proceedings of the 14th European Signal Processing Conference, Florence, Italy, 2006.
Google Scholar
Makhoul, John, Salim Roucos, and Herbert Gish. “Vector quantization in speech coding.” Proceedings of the IEEE 73.11 (1985): 1551–1588.
Google Scholar
Furui, Sadaoki. “Vector-quantization-based speech recognition and speaker recognition techniques.” Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991.
Google Scholar
Zaharia, Tiberius, et al. “Quantized dynamic time warping (DTW) algorithm.” Communications (COMM), 2010 8th International Conference on. IEEE, 2010.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, JNTUH College of Engineering, Hyderabad, India
J. Ujwala Rekha & A. Vinaya Babu
Department of Computer Science and Engineering, JNTUH College of Engineering, Jagtial, India
K. Shahu Chatrapati

Authors

J. Ujwala Rekha
View author publications
You can also search for this author in PubMed Google Scholar
K. Shahu Chatrapati
View author publications
You can also search for this author in PubMed Google Scholar
A. Vinaya Babu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Ujwala Rekha .

Editor information

Editors and Affiliations

Department of Computer Science Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India
Suresh Chandra Satapathy
Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
University of Hyderabad, Hyderabad, India
Siba K. Udgata
Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ujwala Rekha, J., Shahu Chatrapati, K., Vinaya Babu, A. (2016). A Study on Speech Processing. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 435. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2757-1_22

Download citation

DOI: https://doi.org/10.1007/978-81-322-2757-1_22
Published: 04 February 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2756-4
Online ISBN: 978-81-322-2757-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics