Abstract
Endpoint detection is one of the most important steps in speech recognition. In a high SNR environment, the algorithm based on short-time energy and zero rate could be used. But when the SNR is low, this method may not be accurate. Some researchers proposed an algorithm which is based on MFCC Euclidean distance. It has a better performance in a noise environment. But that algorithm needs two thresholds to find the start and end point. However, when the values of two thresholds are not suitable, the detected result could be extremely bad. In this paper, we proposed an improved algorithm which is based on MFCC cosine value. This method can reduce errors, since it only needs one single threshold. The benefit of this improved algorithm is that the result can surely contain the real voice component. According to the experiment data, this improved algorithm can improve the speech recognition rate by 10% even in noise environment (SNR = 0). Thus, it proved that this improved methods has better robustness.
Similar content being viewed by others
References
Beh, J., Baran, R. H., & Ko, H. (2006). Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment. IEEE Transactions on Consumer Electronics, 52(2), 583–589.
Beh, J., & Ko, H. (2003). Spectral subtraction using spectral harmonics for robust speech recognition in car environments. In Computational science, vol. 2660 of lecture notes in computer science (pp. 1109–1116). Springer.
Zhang, J., Zhang, D., & Cui, L. (2015). One speech endpoint detection with a robust adaptive threshold. Xi’an University of Electronic Technology (10), 42(5), 115–119.
Wilpon, J. G., & Rabiner, L. R. (1987). Application of hidden Markov models to automatic speech endpoint detection. Computer Speech & Language, 2(3–4), 321–341.
Wu, B.-F., & Wang, K.-C. (2005). Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments. IEEE Transactions on Speech and Audio Processing, 13(5), 762–774.
Marcheret, E., Chu, S., Goel, V., & Potamianos, G. (2004). “Ef-ficient Likelihood Computation in Multi-Stream HMM Based Audio-Visual Speech Recognition”, in Int (p. 2004). Speech and Language Processing: Conf.
Povey, D., & Woodland, P. C. (2002). Minimum phone error and i-smoothing for improved discriminative training. In Proceedings of the ICASSP.
Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., & Zweig, G. (2005). fMPE: Discriminatively trained features for speech recognition. In Proceedings of the ICASSP.
Huang, J., & Povey, D. (2005). Discriminatively trained features using fMPE for multi-stream audio–visual speech recognition. In Proceedings of the interspeech.
Huang, J., & Visweswariah, K. (2009). Combined discriminative training for multi-stream HMM-based audio–visual speech recognition. In Proceedings of the interspeech.
Shen, J. L., Hung J. W., & Lee L. S. (1998). Robust entropy-based endpoint detection for speech recognition in noisy environments. In International conference on spoken language processing (pp. 232–238), Sydney, Australia.
Medina, C. A., & Alcaim, A. (2008). Wavelet denoising of speech using neural networks for threshold election. Electronics Letter, 39(25), 1869–1871.
Hwanga, I., Parkb, H.-M., & Changa, J.-H. (2016). Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Computer Speech & Language, 38, 1–12.
Wilpon, J. C., Rabiner, L. R., & Martin, T. (1984). An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints. AT&T Bell Laboratories Technical Journal, 63, 479–498.
Chengalvarayan, R. (1999). Robust energy normalization using speech/non-speech discriminator for German connected digit recognition. In: Proceedings of the Euro speech 99 (pp. 61–64), Budapest, Hungary.
Haigh, J. A., & Mason, J. S. (1993). Robust voice activity detection using cepstral features. In Proceedings of the IEEE TENCON (pp. 321–324).
Zhang, R., & Cui, H. (2005). Study endpoint detection algorithm based on short-term energy. Audio Engineering, 7, 52–54.
Wu, B.–F., & Wang, K.–C. (2005). Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments. IEEE Transactions on Speech and Audio Processing, 13(5), 762–774.
Liu, H., Li, X., Xu, B., & Jiang, N. (2008). The summary and outlook for speech signal endpoint detection. Computer Application sand Research (10), 25(8), 2278–2283.
Rabiner, L. R., & Sambur, M. R. (1977). Voiced-unvoiced-silence detection using the Itakura LPC distance measure. In Proceedings of the ICASSP (pp. 323–326).
Haign, J. A., & Mason, J. S. (1993). Robust voice activity detection using cepstral features. In Proceedings of the IEEE TEN-CON (pp. 321–324).
Chengalvarayan, R. (1999). Robust energy normalization using speech/non-speech discriminator for German connected digit recognition. In Proceedings of the Euro speech (pp. 61–64).
Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic channels, ETSI EN 301 708 recommendation, ETSI, 1999.
Speech processing, transmission and quality aspects (STQ), distributed speech recognition; front-end feature extraction algorithm; compression algorithm, ETSI ES 202 050 recommendation, ETSI, 2002.
Yang, L. (2015). Voice endpoint detection based on MFCC distance. Information and Communication, 7, 31–32.
Han, H., Wang, B., & Duan, S. (2014). Voice activity detection technology research and development. Computer Applications and Research, 4, 1220–1226.
Shu, Q., & Li, Y. (2007). Speech endpoint detection based on MFCC. Communications Technology, 40(11), 374–375.
Wang, H., Xu, Y., & Li, M. (2011). Study on the MFCC similarity-based voice activity detection algorithm. In 2nd International conference on artificial intelligence.
Kotta, M., & Preen, R. (2006). Speech enhancement in non-stationary noise environments using noise properties. Speech Communication, 48(1), 96–109.
Liu, J., Xu, Z., Zheng, Z., & Cheng, Q. (2005). DTW-based speech recognition and speaker recognition feature selection. Pattern Recognition and Artificial Intelligence, 18(1), 50–54.
Acknowledgements
This paper is supported by the National Natural Science Foundation of China (41471303), training project for outstanding young teachers of North China University of Technology, Special Research Foundation of North China University of Technology (PXM2017_014212_000014), Beijing Natural Science Foundation (4162022), and advantage disciplinary projects of North China University of Technology.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors confirm that this article content has no conflicts of interest.
Rights and permissions
About this article
Cite this article
Cao, D., Gao, X. & Gao, L. An Improved Endpoint Detection Algorithm Based on MFCC Cosine Value. Wireless Pers Commun 95, 2073–2090 (2017). https://doi.org/10.1007/s11277-017-3958-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-017-3958-0