Abstract
In general, speech is constituted of quasi-repetitive patterns called pitches representing the speech fundamental period and tonal information of the voice. Extraction of pitch information that is crucial for many speech processing techniques, usually faces a noise problem and interference caused by high-order harmonic components. This paper introduces a novel, noise-robust method for determining speech fundamental frequency and pitch segmentation, based on a short-time energy waveform (SEW), defined as a moving average squared signal. When applying a moving average filter with a window size closed to the fundamental period, nearly repetitive patterns, with fewer ripples, synchronizing with actual pitches can clearly be observed in the SEW. The DC component in the SEW is removed using morphological top-hat and bottom-hat transforms. The fundamental frequency is determined as the frequency corresponding to the largest peak of the power spectrum of the DC-removed SEW. Finally, a time-domain window search is then performed to locate local extrema associated with pitches. Compared to traditional pitch detection techniques, the proposed technique yields pitch segmentation results with a higher rate of accuracy and greater noise robustness.
References
Bereksi-Reguig, F., & Taouli, S. A. (2013). ECG signal denoising by morphological top-hat transform. Global Journal of Computer Science and Technology, 13(5).
Antonios (2012). An improved time domain pitch detection algorithm for pathological voice. American Journal of Applied Sciences, 9(1), 93–102.
Chamnongthai, K., Pichitwong, W., & Ayudhya,N. P. (2005). Final consonant segmentation for Thai syllable by using vowel characteristics and wavelet packet transform. ECTI-CIT Transactions on Communications and Information Technology, 1(1), 50–62.
de Cheveigneb, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111, 1917–1930.
Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 43–52.
Gerhard, D. (2002). Pitch extraction and fundamental frequency: History and current techniques. Technical Report TR-CS.
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498).
Huang, Q., Wang, D., & Lu, Y. (2009) Single channel speech enhancement based on prominent pitch estimation. In IET international communication conference on wireless mobile and computing (CCWMC) (pp. 205–208).
Hui, L., Dai, B.-Q., & Wei, L. (2006). A pitch detection algorithm based on AMDF and ACF. In IEEE international conference on acoustics speech and signal processing proceedings (Vol. 1).
Hunt, M., & Lefebvre, C. (1987). Speech recognition using an auditory model with pitch-synchronous analysis. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 12, pp. 813–816).
Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using non-zero-pitch concept. In IEEE international workshop on robot and human interactive communication (pp. 312–316).
Jdira, M. B., Jemâa, I., & Ouni, K. (2014). Speaker recognition system based on pitch estimation. In International conference on electrical sciences and technologies (CISTEM) (pp. 1–5).
Kammoun, M., & Ellouze, N. (2006) Pitch and energy contribution in emotion and speaking styles recognition enhancement. In IMACS multiconference on computational engineering in systems applications (Vol. 1, pp. 97–100).
Khulage, A. A. (2012). Extraction of pitch, duration and formant frequencies for emotion recognition system. In Communication and computing (ARTCom2012) (pp. 7–9).
Kim, S., Eriksson, T., Kang, H.-G., & Youn, D. H. (2004). A pitch synchronous feature extraction method for speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings (ICASSP’04) (Vol. 1, p. I-405-8).
Krishnakumar, S., Kumar, K. R. P., & Balakrishnan, N. (2003). Pitch maxima for robust speaker recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 2, p. II-201-4).
Li, D., Yang, Y., & Huang, T. (2009). Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition. In 2009 3rd international conference on affective computing and intelligent interaction and workshops (pp. 1–4).
McLaughlin, S., Leith, D., & Mann, I. (2002). Using Gaussian processes to synthesize voiced speech with natural pitch variations. In International conference on digital signal processing.
Muhammad, G. (2010). Noise-robust pitch detection using auto-correlation function with enhancements. Journal of King Saud University Computer and Information Sciences, 22, 13–28.
Perez-Pueyo, R., Soneira, M. J., & Ruiz-Moreno, S. (2010). Morphology-based automated baseline removal for Raman spectra of artistic pigments. Applied Spectroscopy, 64(6), 595–600.
Qiang, H., & Youwei, Z. (1998). On prefiltering and endpoint detection of speech signal. In International conference on signal processing proceedings (Vol. 1, pp. 749–752).
Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech and Signal Processing, 25(1), 24–33.
Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.
Ramalho, M. A., & Mammone, R. J. (1993). New speech enhancement techniques using the pitch mode modulation model. In Proceedings of the 36th midwest symposium on circuits and systems (Vol. 2, pp. 1531–1534).
Ru-Wei, L., Long-Tao, C., & Yang, L. (2013). Pitch detection method for noisy speech signals based on wavelet transform and autocorrelation function. In Ninth international conference on intelligent information hiding and multimedia signal processing (pp. 153–156).
Shimamura, T. (2010). An efficient pitch estimation method using windowless and normalized autocorrelation functions in noisy environments. ResearchGate, 6(3), 197–204.
Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions Speech and Audio Processing, 9(7), 727–730.
Stephenson, T. A., Escofet, J., Magimai-Doss, M., & Bourlard, H. (2002). Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables. In Proceedings 12th IEEE workshop on neural networks for signal processing (pp. 637–646).
Sun, Y., Chan, K. L., & Krishnan, S. M. (2002). ECG signal conditioning by morphological filtering. Computers in Biology and Medicine, 32(6), 465–479.
Swee, T. T., Salleh, S. H. S., & Jamaludin, M. R. (2010). Speech pitch detection using short-time energy. In International conference on computer and communication engineering (ICCCE) (pp. 1–6).
Tabrikian, J., Dubnov, S., & Dickalov, Y. (2002). Speech enhancement by harmonic modeling via map pitch tracking. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I-549–I-552).
Wang, Y. B., Li, S. W., & s Lee, L. (2006). An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2006–2014.
Xu, X., Zhang, T. Q, Shi, S., & Zhang, Y. (2014). An improved pitch detection of speech combined with speech enhancement. In 7th international congress on image and signal processing (CISP) (pp. 778–782).
Zhu, J., Sun, S., Liu, X., & Lei, B. (2009). Pitch in speaker recognition. In Ninth international conference on hybrid intelligent systems (Vol. 1, pp. 33–36).
Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 467–478.
Acknowledgements
Financial support from Uttaradit Rajabhat University. Computer Engineering Research and Development Group, Department of Computer Engineering, Faculty of Engineering, Khon Kaen University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wiriyarattanakul, S., Eua-anant, N. Pitch segmentation of speech signals based on short-time energy waveform. Int J Speech Technol 20, 907–917 (2017). https://doi.org/10.1007/s10772-017-9459-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9459-4