Skip to main content
Log in

A review on speech processing using machine learning paradigm

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech processing plays a crucial role in many signal processing applications, while the last decade has bought gigantic evolution based on machine learning prototype. Speech processing has a close relationship with computer linguistics, human–machine interaction, natural language processing, and psycholinguistics. This review article majorly discusses the feature extraction techniques and machine learning classifiers employed in speech processing and recognition activities. The performance of several machine learning techniques is validated for speech emotion recognition application on Berlin EmoDB database. Further, it gives the broad application areas and challenges in machine learning for speech processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Abbosovna, A. Z. (2020). Interactive games as a way to improve speech skills in foreign language lessons. Asian Journal of Multidimensional Research (AJMR), 9(6), 165–171.

    Article  Google Scholar 

  • Abdellah, K., Francis, G., Juan, R. O., & Jean, S. (2020). Principal component analysis of the spectrogram of the speech signal: Interpretation and application to dysarthric speech. Computer Speech & Language, 59, 114–122.

    Article  Google Scholar 

  • Afshan, A., Guo, J., Park, S. J., Ravi, V., Flint, J., & Alwan, A. (2018, September). Effectiveness of voice quality features in detecting depression. In Interspeech (pp. 1676–1680).

  • Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76.

    Article  Google Scholar 

  • Alhargan, A., Cooke, N., & Binjammaz, T. (2017). Multimodal affect recognition in an interactive gaming environment using eye tracking and speech signals. In: Proceedings of the 19th ACM international conference on multimodal interaction, pp. 479–486.

  • Ali, L., Zhu, C., Zhang, Z., & Liu, Y. (2019). Automated detection of Parkinson’s disease based on multiple types of sustained phonations using linear discriminant analysis and genetically optimized neural network. IEEE Journal of Translational Engineering in Health and Medicine, 7, 1–10.

    Article  Google Scholar 

  • Alves, S. F., Silva, I. N., Ranieri, C. M., & Ferasoli Filho, H. (2014). Assisted robot navigation based on speech recognition and synthesis. In 5th ISSNIP-IEEE biosignals and biorobotics conference (2014): Biosignals and robotics for better and safer living (BRC), pp. 1–5.

  • Amberkar, A., Awasarmol, P., Deshmukh, G., & Dave, P. (2018). Speech recognition using recurrent neural networks. In: International conference on current trends towards converging technologies (ICCTCT), Coimbatore, pp. 1–4.

  • Anjana, J. S., & Poorna, S. S. (2018, March). Language identification from speech features using SVM and LDA. In: 2018 international conference on wireless communications, signal processing and networking (WiSPNET), pp. 1–4.

  • Anusuya, M. A., & Katti, S. K. (2009). Speech recognition by machine: A review. International Journal of Computer Science and Information Security, 6(3), 181–205.

    Google Scholar 

  • Babaee, E., Anuar, N. B., Abdul Wahab, A. W., Shamshirband, S., & Chronopoulos, A. T. (2017). An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence, 31(9–10), 661–714.

    Article  Google Scholar 

  • Baig, M., Masud, S., & Awais, M. (2006). Support vector machine based voice activity detection (pp. 319–322). Tottori: International Symposium on Intelligent Signal Processing and Communications.

    Google Scholar 

  • Bakshi, A., & Kopparapu, S. K. (2019). Spoken Indian language classification using GMM supervectors and artificial neural networks. IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, pp. 1–6.

  • Barde, S., & Kaimal, V. (2020). Speech recognition technique for identification of raga. Cognitive Informatics, Computer Modelling, and Cognitive Science, 11, 101–117.

    Article  Google Scholar 

  • Barizão, A. H., Fermino, M. A., Dajer, M. E., Liboni, L. H. B., & Spatti, D. H. (2018). Voice disorder classification using MLP and wavelet packet transform. International joint conference on neural networks (IJCNN), Rio de Janeiro, pp. 1–8

  • Bavkar, S., & Sahare, S. (2013). PCA based single channel speech enhancement method for highly noisy environment. International conference on advances in computing, communications and informatics (ICACCI), pp. 1103–1107.

  • Bhakre, S. K., & Bang, A. (2016). Emotion recognition on the basis of audio signal using Naive Bayes classifier. International conference on advances in computing, communications and informatics (ICACCI), Jaipur, pp. 2363–2367.

  • Bhangale, K. B., et al. (2018). Synthetic speech spoofing detection using Mfcc And Svm. IOSR Journal of Engineering (IOSRJEN), 8(6), 55–61.

    Google Scholar 

  • Bhanja, C. C., Laskar, M. A., & Laskar, R. H. (2019). A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circuits, Systems, and Signal Processing, 38(5), 2266–2296.

    Article  Google Scholar 

  • Bharali, S. S., & Kalita, S. K. (2017). Speaker identification using vector quantization and I-vector with reference to Assamese language. In: International conference on wireless communications, signal processing and networking (WiSPNET), Chennai, pp. 164–168.

  • Bharath, K. P., & Kumar, R. M. (2019). Multitaper based MFCC feature extraction for robust speaker recognition system. In: Innovations in power and advanced computing technologies (i-PACT), Vellore, pp. 1–5.

  • Biswas, S., & Solanki, S. S. (2020). Speaker recognition: An enhanced approach to identify singer voice using neural network. International Journal of Speech Technology, 1, 1–13.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European conference on speech communication and technology. Interspeech, (pp. 1517–1520).

  • Chan, William, Jaitly, Navdeep, Le, Quoc, & Vinyals, Oriol. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition (pp. 4960–4964). Speech and Signal Processing (ICASSP): IEEE Int. Conf. on Acoustics.

    Google Scholar 

  • Chen, X., Li, H., Ma, L., Liu, X., & Chen, J. (2015). Teager Mel and PLP fusion feature based speech emotion recognition. In: Fifth international conference on instrumentation and measurement, computer, communication and control (IMCCC), Qinhuangdao, pp. 1109–1114.

  • Chittaragi, N. B., & Koolagudi, S. G. (2020). Sentence-based dialect identification system using extreme gradient boosting algorithm. Smart computing paradigms: New progresses and challenges (pp. 131–138). Singapore: Springer.

    Chapter  Google Scholar 

  • Chougala, M., & Kuntoji, S. (2016). Novel text independent speaker recognition using LPC based formants. In: International conference on electrical, electronics, and optimization techniques (ICEEOT), Chennai, pp. 510–513.

  • Cuiling, L. (2016). English Speech Recognition Method Based on Hidden Markov Model. International Conference on Smart Grid and Electrical Automation (ICSGEA), Zhangjiajie, 94-97.

  • Cumani, S., & Laface, P. (2014). Large-scale training of pairwise support vector machines for speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(11), 1590–1600.

    Article  Google Scholar 

  • Dahmani, M., & Guerti, M. (2017). Vocal folds pathologies classification using Naïve Bayes networks. 6th International Conference on Systems and Control (ICSC), Batna. pp. 426–432.

  • Dai, J., Vijayarajan, V., Peng, X., Tan, L. & Jiang, J. (2018). Speech recognition using sparse discrete wavelet decomposition feature extraction. In: IEEE international conference on electro/information technology (EIT), Rochester, MI, pp. 812–816.

  • Deka, B. K., & Das, P. (2019). An analysis of an isolated assamese digit recognition using MFCC and DTW. 6th international conference on computing for sustainable global development (INDIACom), New Delhi, India, pp. 46–50.

  • Delic, V., et al. (2019). Speech technology progress based on new machine learning paradigm. Computational Intelligence and Neuroscience, 2019, 1–19.

    Article  Google Scholar 

  • Diez, M., Burget, L., Landini, F., & Černocký, J. (2020). Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 355–368.

    Article  Google Scholar 

  • Djamal, E. C., Nurhamidah, N., & Ilyas, R. (2017). Spoken word recognition using MFCC and learning vector quantization. In: 4th international conference on electrical engineering, computer science and informatics (EECSI), Yogyakarta, pp. 1–6.

  • Eronen, A. J., & Klapuri, A. P. (2010). Music tempo estimation with k-NN regression. IEEE Transactions on Audio, Speech and Language Processing, 11(1), 50–57.

    Article  Google Scholar 

  • Everest, F. A., & Pohlmann, K. C. (2009). Master handbook of acoustics (5th ed.). New York, NY: McGraw-Hill.

    Google Scholar 

  • Fan, L., Ke, D., Fu, X., Lu, S., & Xu, B. (2012). Power-normalized PLP (PNPLP) feature for robust speech recognition. In: 8th international symposium on Chinese spoken language processing, Kowloon, pp. 224–228.

  • Garg, A., & Sharma, P. (2016). Survey on acoustic modeling and feature extraction for speech recognition. In: 3rd international conference on computing for sustainable global development (INDIACom, pp. 2291–2295).

  • Gillespie, S., Logan, Y. Y., Moore, E., Laures-Gore, J., Russell, S., & Patel, R. (2017, August). Cross-database models for the classification of dysarthria presence. In Interspeech (pp. 3127–3131).

  • Gonçalves, C., Rocha, T., Reis, A., & Barroso, J. (2017). AppVox: An application to assist people with speech impairments in their speech therapy sessions. In: World conference on information systems and technologies. Springer, pp. 581–591.

  • Guerchi, D., & Mohamed, E. E. (2012). LPC-Based Narrowband Speech Steganography. In: Benlamri R. (eds) Networked Digital Technologies. NDT 2012. Communications in Computer and Information Science, 294, 277–288.

    Article  Google Scholar 

  • Guiming, D., Xia, W., Guangyan, W., Yan, Z., & Dan, L. (2016). Speech recognition based on convolutional neural networks. In: IEEE International Conference On Signal And Image Processing (ICSIP), Beijing, pp. 708–711.

  • Gupta, H., & Gupta, D. (2016). LPC and LPCC method of feature extraction in Speech Recognition System. 6th international conference - cloud system and big data engineering (Confluence), Noida, pp. 498–502.

  • Gupta, K., & Gupta, D. (2016). An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system. 2016 6th international conference - cloud system and big data engineering (confluence), Noida, pp. 493–497.

  • Han, E., & Cha, H. (2020). Adaptive feature generation for speech emotion recognition. IEIE Transactions on Smart Processing & Computing, 9(3), 185–192.

    Article  Google Scholar 

  • Hazrat, A., Ahmad, N., & Zhou, X. (2015). Automatic speech recognition of Urdu words using linear discriminant analysis. Journal of Intelligent and Fuzzy Systems, 28, 2369–2375.

    Article  Google Scholar 

  • Heck, P., & Chou, K. C. (1994). Gaussian mixture model classifiers for machine monitoring. Proceedings of ICASSP ‘94. IEEE international conference on acoustics, speech and signal processing, Adelaide, SA, Vol. 6, pp. 133–136.

  • Hidayat, R., Bejo, A., Sumaryono, S., & Winursito, A. (2018). Denoising speech for MFCC feature extraction using wavelet transformation in speech recognition system. 10th international conference on information technology and electrical engineering (ICITEE), Kuta, pp. 280–284.

  • Hsieh, H., Chien, J., Shinoda, K., & Furui, S. (2009). Independent component analysis for noisy speech recognition. IEEE international conference on acoustics, speech and signal processing, pp. 4369–4372.

  • Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Upper Saddle River, NJ: Prentice Hall PTR.

    Google Scholar 

  • Huang, X., Liu, Z., Lu, W., Liu, H., & Xiang, S. (2020). Fast and effective copy-move detection of digital audio based on auto segment (pp. 127–142). In Digital Forensics and Forensic Investigations: Breakthroughs in Research and Practice.

    Google Scholar 

  • Huang, Y., Xiao, J., Tian, K., Wu, A., & Zhang, G. (2019). Research on robustness of emotion recognition under environmental noise conditions. IEEE Access, 7, 142009–142021.

    Article  Google Scholar 

  • Ing-Jr, D., & Ming, Y. H. (2014). An HMM-like dynamic time warping scheme for automatic speech recognition. Mathematical Problems in Engineering, pp. 1–8.

  • Ishimoto, Y., Teraoka, T., & Enomoto, M. (2017). End-of-utterance prediction by prosodic features and phrase-dependency structure in spontaneous Japanese speech. Interspeech, pp. 1681–1685.

  • Jacob, A. (2016, April). Speech emotion recognition based on minimal voice quality features. In 2016 International conference on communication and signal processing (ICCSP) (pp. 0886–0890). IEEE.

  • Jena, B., & Singh, S. S. (2018). Analysis of stressed speech on Teager energy operator (TEO). International Journal of Pure and Applied Mathematics, 118(16), 667–680.

    Google Scholar 

  • Jo, J., Yoo, H., & Park, I. (2016). Energy-efficient floating-point MFCC extraction architecture for speech recognition systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24(2), 754–758.

    Article  Google Scholar 

  • Jung, J., & Kim, G. (2017). Machine learning based speech disorder detection system. Journal of Broadcast Engineering, 22(2), 253–256.

    Google Scholar 

  • Kandpal, N., & Madhusudan B. R. (2010). Implementation of PCA & ICA for voice ecognition and separation of speech. IEEE International Conference on Advanced Management Science(ICAMS 2010), pp. 536–538.

  • Kanhe, A., & Aghila, G. (2018). A DCT–SVD-based speech steganography in voiced frames. Circuits, Systems, and Signal Processing, 37(11), 5049–5068.

    Article  Google Scholar 

  • Kathania, H. K., Shahnawazuddin, S., Adiga, N., & Ahmad, W. (2018, April). Role of prosodic features on children’s speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5519–5523.

  • Ke, X., Zhu, Y., Wen, L., & Zhang, W. (2018). Speech emotion recognition based on SVM and ANN. International Journal of Machine Learning and Computing, 8(3), 198–202.

    Article  Google Scholar 

  • Khan, A. & Roy, U. K. (2017). Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In: International conference on wireless communications, signal processing and networking (WiSPNET), Chennai, pp. 1017–1021.

  • Khunarsa, P. (2017). Single-signal entity approach for sung word recognition with artificial neural network and time–frequency audio features. The Journal of Engineering, 2017(12), 634–645.

    Article  Google Scholar 

  • Kim, M., Kim, Y., Yoo, J., Wang, J., & Kim, H. (2017). Regularized speaker adaptation of KL-HMM for dysarthric speech recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(9), 1581–1591.

    Article  Google Scholar 

  • Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1315–1329.

    Article  Google Scholar 

  • Koehler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In: Proceedings of ICASSP ‘94. IEEE international conference on acoustics, speech and signal processing, Adelaide, SA, 1, pp. 421–424.

  • Kohler, M., Vellasco, M. M., & Cataldo, E. (2016). Analysis and classification of voice pathologies using glottal signal parameters. Journal of Voice, 30(5), 549–556.

    Article  Google Scholar 

  • Kohlschein, C., Schmitt, M., Schüller, B., Jeschke, S., & Werner, C. J. (2017). A machine learning based system for the automatic evaluation of aphasia speech. In: IEEE 19th international conference on e-health networking, applications and services, pp. 1–6.

  • Laleye, F. A. A., Ezin, E. C., & Motamed, C. (2014). Weighted combination of Naive Bayes and LVQ classifier for Fongbe phoneme classification. Tenth international conference on signal-image technology and internet-based systems, Marrakech, pp. 7–13.

  • Le, H., Oparin, I., Allauzen, A., Gauvain, J., & Yvon, F. (2013). Structured output layer neural network language models for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 21(1), 197–206.

    Article  Google Scholar 

  • Lee, S. (2015). Hybrid Naïve Bayes K-nearest neighbor method implementation on speech emotion recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, pp. 349–353.

  • Lee, K., Moon, C., & Nam, Y. (2018). Diagnosing vocal disorders using cobweb clustering of the jitter, shimmer, and harmonics-to-noise ratio. KSII Transactions on Internet & Information Systems, 12(11), 5541–5554.

    Google Scholar 

  • Lee, D., Park, H., Lim, M., & Kim, J. (2019). Dynamic time warping-based Korean spoken word detection system using euclidean distance in intelligent personal assistants. IEEE 8th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, pp. 519–520.

  • Li, X., Tao, J., Johnson, M. T., Soltis, J., Savage, A., Leong, K. M., & Newman, J. D. (2007, April). Stress and emotion classification using jitter and shimmer features. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, 4, IV-1081.

  • Lin, J., & Zhang, B. (2018). A music retrieval method based on hidden markov model. In 2018 International conference on intelligent transportation, big data & smart city (ICITBS), (pp. 732–735).

  • Liu, P., Li, S., & Wang, H. (2017). Steganography integrated into linear predictive coding for low bit-rate speech codec. Multimed Tools Appl, 76, 2837–2859.

    Article  Google Scholar 

  • Liu, L., & Yang, J. (2020). Study on feature complementarity of statistics, energy, and principal information for spoofing detection. IEEE Access, 8, 141170–141181.

    Article  Google Scholar 

  • Lovato, A., Bonora, C., Genovese, E., Amato, C., Maiolino, L., & de Filippis, C. (2020). A panel of jitter/shimmer may identify functional dysphonia at risk of failure after speech therapy. American Journal of Otolaryngology, 289, 102455.

    Article  Google Scholar 

  • Lovato, A., Colle, W. D., Giacomelli, L., Piacente, A., Righetto, L., Marioni, G., et al. (2016). Multi-dimensional voice program (MDVP) vs praat for assessing euphonic subjects: A preliminary study on the gender-discriminating power of acoustic analysis software. Journal of Voice, 30, 765.e1–765.e5.

    Article  Google Scholar 

  • Lu, Liang, & Steve, R. (2014a). Probabilistic linear discriminant analysis for acoustic modeling. IEEE Signal Processing Letters, 21, 702–706.

    Article  Google Scholar 

  • Lu, L., & Steve, R. (2014b). Tied probabilistic linear discriminant analysis for speech recognition. ArXiv abs/1411.0895, pp. 1–5.

  • Maghsoodi, N., Sameti, H., Zeinali, H., & Stafylakis, T. (2019). Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11), 1815–1825.

    Article  Google Scholar 

  • Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2020). Automatic speech recognition: A survey. Multimedia Tools and Applications, 463, 1–47.

    Google Scholar 

  • Manurung, D. B., Dirgantoro, B., & Setianingsih, C. (2018). Speaker recognition for digital forensic audio analysis using learning vector quantization method. In: IEEE international conference on internet of things and intelligence system (IOTAIS), Bali, pp. 221–226.

  • Mao, J., He, Y., & Liu, Z. (2018). Speech emotion recognition based on linear discriminant analysis and support vector machine decision tree. 37th Chinese Control Conference (CCC), Wuhan, pp. 5529–5533.

  • Mary, L. (2019). Extraction and representation of prosody for speaker, language, emotion, and speech recognition. In Extraction of prosody for automatic speaker, language, emotion and speech recognition (pp. 23–43). Cham: Springer.

  • Matza, A., & Bistritz, Y. (2014). Skew Gaussian mixture models for speaker recognition. IET Signal Processing, 8(8), 860–867.

    Article  Google Scholar 

  • Mesallam, T. A., Farahat, M., Malki, K. H., Alsulaiman, M., Ali, Z., Al-Nasheri, A., et al. (2017). Development of the arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. Journal of Healthcare Engineering, 78, 1–14.

    Article  Google Scholar 

  • Mohanaprasad, K., & Arulmozhivarman, P. (2014). Wavelet based adaptive filtering algorithms for acoustic noise cancellation. International Review on Computers and Software, 9(10), 1675–1681.

    Google Scholar 

  • Mohanaprasad, K., & Arulmozhivarman, P. (2015a). Wavelet based ICA using maximisation of non-Gaussianity for acoustic echo cancellation during double talk situation. Applied Acoustics, 97, 37–45.

    Article  Google Scholar 

  • Mohanaprasad, K., & Arulmozhivarman, P. (2015b). Wavelet-based ica using maximum likelihood estimation and information-theoretic measure for acoustic echo cancellation during double talk situation. Circuits, Systems, and Signal Processing, 34(12), 3915–3931.

    Article  MathSciNet  Google Scholar 

  • Mohanaprasad, K., & Sankarganesh, S. (2015). Speech separation using wavelet based independent component analysis. International Journal of Applied Engineering Research, 10(55), 1004–1008.

    Google Scholar 

  • Mohanaprasad, K., Singh, A., Sinha, K., et al. (2019). Noise reduction in speech signals using adaptive independent component analysis (ICA) for hands free communication devices. International Journal of Speech Technology, 22, 169–177.

    Article  Google Scholar 

  • Murphy, P. J. (2020). Spectral characterization of jitter, shimmer, and additive noise in synthetically generated voice signals. The Journal of the Acoustical Society of America, 107, 978–988.

    Article  Google Scholar 

  • Narendra, N. P., & Alku, P. (2018). Dysarthric speech classification using glottal features computed from non-words, words and sentences. Interspeech, pp. 3403–3407.

  • Nath, M. K. (2009). Independent component analysis of real data. In: Seventh international conference on advances in pattern recognition, pp. 149–152.

  • Nayana, P. K., Mathew, D., & Thomas, A. (2017). Performance comparison of speaker recognition systems using GMM and i-Vector methods with PNCC and RASTA PLP features. In: International conference on intelligent computing, instrumentation and control technologies (ICICICT), Kannur, pp. 438–443.

  • Nehe, N. S., & Holambe, R. S. (2012). DWT and LPC based feature extraction methods for isolated word recognition. Journal of audio speech music proc., 7, 1–7.

    Google Scholar 

  • Shabani S., & Norouzi, Y. (2016). Speech recognition using principal components analysis and neural networks. IEEE 8th international conference on intelligent systems (IS), Sofia, pp. 90–95.

  • Nyodu, K., & Sambyo, K. (2018). Automatic identification of arunachal language using K-nearest neighbor algorithm. In: International conference on advances in computing, communication control and networking (ICACCCN), Greater Noida (UP), India, pp. 213–216.

  • Perotin, L., Serizel, R., Vincent, E., & Guérin, A. (2018). Multichannel speech separation with recurrent neural networks from high-order ambisonics recordings. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 36–40.

  • Qian, G. (2019). A music retrieval approach based on hidden markov model. 11th international conference on measuring technology and mechatronics automation (ICMTMA), Qiqihar, China, pp. 721–725.

  • Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(12), 1–194.

    Article  MATH  Google Scholar 

  • Ram, S., & Preeti, R. (2007). Spectral Subtraction Speech Enhancement with RASTA Filtering. Proc. of National Conference on Communications (NCC), pp. 1–5.

  • Ramaiah, V. S., & Rao, R. R. (2016). Multi-speaker activity detection using zero crossing rate. International conference on communication and signal processing (ICCSP), Melmaruvathur, pp. 23–26.

  • Ranny. (2016). Voice recognition using k nearest neighbor and double distance method. In: International conference on industrial engineering, management science and application (ICIMSA), Jeju, pp. 1–5.

  • Reddy, M. K., Alku, P., & Rao, K. S. (2020). Detection of specific language impairment in children using glottal source features. IEEE Access, 8, 15273–15279.

    Article  Google Scholar 

  • Ren, Y., Liu, J., Tan, X., Zhang, C., Tao, Q. I. N., Zhao, Z., & Liu, T. Y. (2020). SimulSpeech: End-to-end simultaneous speech to text translation. In: Proceedings of the 58th annual meeting of the association for computational linguistic, pp. 3787–3796.

  • Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.

    Article  Google Scholar 

  • Rizwan, M., & Anderson, D. V. (2014) Using k-nearest neighbor and speaker ranking for phoneme prediction. In: 13th international conference on machine learning and applications, Detroit, MI, pp. 383–387.

  • Rossing, T. D. (2007). Springer handbook of acoustics. New York, NY: Springer Nature.

    Book  Google Scholar 

  • Rudresh, M. D., Latha, A. S., Suganya, J., & Nayana, C. G. (2017). Performance analysis of speech digit recognition using cepstrum and vector quantization. In: International conference on electrical, electronics, communication, computer, and optimization techniques (ICEECCOT), Mysuru, pp. 1–6.

  • Ruzanski, E., Hansen, J. H., Finan, D., Meyerhoff, J., Norris, W., & Wollert, T. (2005). Improved” TEO” feature-based automatic stress detection using physiological and acoustic speech sensors. In Ninth European conference on speech communication and technology. pp. 2653–2656.

  • Sadaoki, F. (2005). 50 years of progress in speech and speaker recognition research. ECTI Transactions on Computer and Information Technology, 1(2), 64–74.

    Google Scholar 

  • Sanchis, A., Juan, A., & Vidal, E. (2012). A word-based Naïve Bayes classifier for confidence estimation in speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 565–574.

    Google Scholar 

  • Sangeetha, R., & Nalini, N. J. (2020). Singer identification using MFCC and CRP features with support vector machines. In Computational intelligence in pattern recognition (pp. 295–306). Springer, Singapore.

  • Sarfjoo, S. S., Demiroğlu, S., & King, S. (2017). Using eigenvoices and nearest-neighbors in HMM-based cross-lingual speaker adaptation with limited data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4), 839–851.

    Article  Google Scholar 

  • Sayed, W. S., Tolba, M. F., Radwan, A. G., & Abd-El-Hafiz, S. K. (2019). FPGA realization of a speech encryption system based on a generalized modified chaotic transition map and bit permutation. Multimedia Tools and Applications, 78(12), 16097–16127.

    Article  Google Scholar 

  • Selva, S. N., & Shantha, R. S. K. (2014). Text independent voice based students attendance system under noisy environment using RASTA-MFCC feature. International conference on communication and network technologies, Sivakasi, pp. 182–187.

  • Shahamiri, S. R., & Salim, S. S. B. (2014a). Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Advanced Engineering Informatics, 28(1), 102–110.

    Article  Google Scholar 

  • Shahamiri, S. R., & Salim, S. S. B. (2014b). A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 22(5), 1053–1063.

    Article  Google Scholar 

  • Shahbakhi, M., Far, D. T., & Tahami, E. (2014). Speech analysis for diagnosis of Parkinson’s disease using genetic algorithm and support vector machine. Journal of Biomedical Science and Engineering, 2014, 1–13.

    Google Scholar 

  • Solera, R. U., Garcia-Moral, A. I., Pelaez-Moreno, C., Martinez-Ramon, M., & Diaz-de-Maria, F. (2012). Real-time robust automatic speech recognition using compact support vector machines. IEEE Transactions on Audio, Speech and Language Processing, 20(4), 1347–1361.

    Article  Google Scholar 

  • Sonawane, A., Inamdar, M. U. & Bhangale, K. B. (2017). Sound based human emotion recognition using MFCC & multiple SVM. International conference on information, communication, instrumentation and control (ICICIC), Indore, pp. 1–4.

  • Song, P., Zheng, W., Liu, J., Li, J., & Xinran, Z. (2015). A novel speech emotion recognition method via transfer PCA and sparse coding. Chinese conference on biometric recognition, pp. 393–400.

  • Sreehari, V. R., & Mary, L. (2018). Automatic speaker recognition using stationary wavelet coefficients of LP residual. IEEE Region 10 Conference, Jeju, Korea (South), pp. 1595–1600.

  • Stratos, K., Collins, M., & Hsu, D. (2016). Unsupervised part-of-speech tagging with anchor hidden markov models. Transactions of the Association for Computational Linguistics, 4, 245–257.

    Article  Google Scholar 

  • Su, R., Liu, X., & Wang, L. (2015). Automatic complexity control of generalized variable parameter HMMs for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1), 102–114.

    Google Scholar 

  • Sun, L., Fu, S., & Wang, F. (2019). Decision tree SVM model with Fisher feature selection for speech emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 2.

    Article  Google Scholar 

  • Sunita, D., & Yusuf, M. (2014). Speech processing: A review. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 3(8), 1275–1278.

    Google Scholar 

  • Tadeusiewicz, R. (2010). Speech in human system interaction. In: 3rd international conference on human system interaction, Rzeszow, pp. 2–13.

  • Teixeira, J. P., & Fernandes, P. O. (2014). Jitter, shimmer and HNR classification within gender, tones and vowels in healthy voices. Procedia Technology, 16, 1228–1237.

    Article  Google Scholar 

  • Teixeira, J. P., Fernandes, P. O., & Alves, N. (2017). Vocal acoustic analysis–classification of dysphonic voices with artificial neural networks. Procedia Computer Science, 121, 19–26.

    Article  Google Scholar 

  • Teixeira, J. P., Oliveira, C., & Lopes, C. (2013). Vocal acoustic analysis – jitter, shimmer and hnr parameters. Procedia Technology, 9, 1112–1122.

    Article  Google Scholar 

  • Vacher, M., Lecouteux, B., Romero, J. S., Ajili, M., Portet, F., & Rossato, S. (2015). Speech and speaker recognition for home automation: Preliminary results. IEEE international conference on speech technology and human-computer dialogue (SpeD), pp. 1–10.

  • Vachhani, B. B., & Patil, H. A. (2013). Use of PLP cepstral features for phonetic segmentation. In: International conference on Asian language processing, Urumqi, pp. 143–146.

  • Varghese, D., & Mathew, D. (2016). Phoneme classification using Reservoirs with MFCC and Rasta-PLP features. International conference on computer communication and informatics (ICCCI), Coimbatore, pp. 1–6.

  • Velankar, M., Deshpande, A., & Kulkarni, P. (2018). Melodic pattern recognition in Indian classical music for raga identification. International Journal of Information Technology, 216, 1–8.

    Google Scholar 

  • Wang, C. (2018). Interpreting neural network hate speech classifiers. In: Proceedings of the 2nd workshop on abusive language online (ALW2), pp. 86–92.

  • Wu, Z., & Ortega-Llebaria, M. (2017). Pitch shape modulates the time course of tone vs pitch-accent identification in Mandarin Chinese. The Journal of the Acoustical Society of America, 141(3), 2263–2276.

    Article  Google Scholar 

  • Wu, J., & Zhang, X. (2011). Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Processing Letters, 18(8), 466–469.

    Article  Google Scholar 

  • Xiao-chun, L., Jun-xun, Y., & Wei-ping, H. (2012). A text-independent speaker recognition system based on Probabilistic Principle Component Analysis. 3rd international conference on system science, engineering design and manufacturing informatization, pp. 255–260.

  • Xihao, S., & Miyanaga, Y. (2013). Dynamic time warping for speech recognition with training part to reduce the computation (pp. 1–4). Circuits and Systems: International Symposium on Signals.

    Google Scholar 

  • Xue, Y., Mu, K., Wang, Y., Chen, Y., Zhong, P., & Wen, J. (2019). Robust speech steganography using differential SVD. IEEE Access, 7, 153724–153733.

    Article  Google Scholar 

  • Yaman, S., & Pelecanos, J. (2013). Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Letters, 20(9), 901–904.

    Article  Google Scholar 

  • Yao, X., Xu, N., Gao, M., Jiang, A., & Liu, X. (2016, December). Comparison analysis of classifiers for speech under stress. In 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 429–432.

  • Yurika, P., Erwin, H., & Erwin, P. A. (2019). Speech recognition using Dynamic Time Warping (DTW). Journal of Physics: Conference Series, 1366, 1–6.

    Google Scholar 

  • Zaw, T. H., & War, N. (2017). The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection. 20th International conference of computer and information technology (ICCIT), Dhaka, pp. 1–5.

  • Zhang, Y., & Abdulla W. H. (2007b). Eigenanalysis applied to speaker identification using gammatone auditory filterbank and independent component analysis. 9th international symposium on signal processing and its applications, pp. 1–4.

  • Zhang, Y., & Abdulla, W. H. (2007a). Robust speaker identification in noisy environment using cross diagonal GTF-ICA feature. In: 6th International conference on information, communications & signal processing, pp. 1–4.

  • Zhang, L., Qu, Y., Jin, B., Jing, L., Gao, Z., & Liang, Z. (2020). An intelligent mobile-enabled system for diagnosing Parkinson disease: Development and validation of a speech impairment detection system. JMIR Medical Informatics, 8(9), e18689.

    Article  Google Scholar 

  • Zhang, L., Zhao, Y., Zhang, P., Yan, K., & Zhang, W. (2015). Chinese accent detection research based on RASTA - PLP algorithm. Proceedings of 2015 international conference on intelligent computing and internet of things, Harbin, pp. 31–34.

  • Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., & Wang, Y. (2019). Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction. IEEE Access, 7, 27874–27882.

    Article  Google Scholar 

  • Zhu, J., Zhang, J., Chen, Q., & Tu, P. (2017). Speaker recognition based on the improved double-threshold endpoint algorithm and multistage vector quantization. IEEE 9th international conference on communication software and networks (ICCSN), Guangzhou, pp. 1056–1061.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Mohanaprasad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhangale, K.B., Mohanaprasad, K. A review on speech processing using machine learning paradigm. Int J Speech Technol 24, 367–388 (2021). https://doi.org/10.1007/s10772-021-09808-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09808-0

Keywords

Navigation