Pitch segmentation of speech signals based on short-time energy waveform

318 Accesses
2 Citations
Explore all metrics

Abstract

In general, speech is constituted of quasi-repetitive patterns called pitches representing the speech fundamental period and tonal information of the voice. Extraction of pitch information that is crucial for many speech processing techniques, usually faces a noise problem and interference caused by high-order harmonic components. This paper introduces a novel, noise-robust method for determining speech fundamental frequency and pitch segmentation, based on a short-time energy waveform (SEW), defined as a moving average squared signal. When applying a moving average filter with a window size closed to the fundamental period, nearly repetitive patterns, with fewer ripples, synchronizing with actual pitches can clearly be observed in the SEW. The DC component in the SEW is removed using morphological top-hat and bottom-hat transforms. The fundamental frequency is determined as the frequency corresponding to the largest peak of the power spectrum of the DC-removed SEW. Finally, a time-domain window search is then performed to locate local extrema associated with pitches. Compared to traditional pitch detection techniques, the proposed technique yields pitch segmentation results with a higher rate of accuracy and greater noise robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bereksi-Reguig, F., & Taouli, S. A. (2013). ECG signal denoising by morphological top-hat transform. Global Journal of Computer Science and Technology, 13(5).
Antonios (2012). An improved time domain pitch detection algorithm for pathological voice. American Journal of Applied Sciences, 9(1), 93–102.
Article Google Scholar
Chamnongthai, K., Pichitwong, W., & Ayudhya,N. P. (2005). Final consonant segmentation for Thai syllable by using vowel characteristics and wavelet packet transform. ECTI-CIT Transactions on Communications and Information Technology, 1(1), 50–62.
Google Scholar
de Cheveigneb, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111, 1917–1930.
Article Google Scholar
Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 43–52.
Article Google Scholar
Gerhard, D. (2002). Pitch extraction and fundamental frequency: History and current techniques. Technical Report TR-CS.
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498).
Huang, Q., Wang, D., & Lu, Y. (2009) Single channel speech enhancement based on prominent pitch estimation. In IET international communication conference on wireless mobile and computing (CCWMC) (pp. 205–208).
Hui, L., Dai, B.-Q., & Wei, L. (2006). A pitch detection algorithm based on AMDF and ACF. In IEEE international conference on acoustics speech and signal processing proceedings (Vol. 1).
Hunt, M., & Lefebvre, C. (1987). Speech recognition using an auditory model with pitch-synchronous analysis. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 12, pp. 813–816).
Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using non-zero-pitch concept. In IEEE international workshop on robot and human interactive communication (pp. 312–316).
Jdira, M. B., Jemâa, I., & Ouni, K. (2014). Speaker recognition system based on pitch estimation. In International conference on electrical sciences and technologies (CISTEM) (pp. 1–5).
Kammoun, M., & Ellouze, N. (2006) Pitch and energy contribution in emotion and speaking styles recognition enhancement. In IMACS multiconference on computational engineering in systems applications (Vol. 1, pp. 97–100).
Khulage, A. A. (2012). Extraction of pitch, duration and formant frequencies for emotion recognition system. In Communication and computing (ARTCom2012) (pp. 7–9).
Kim, S., Eriksson, T., Kang, H.-G., & Youn, D. H. (2004). A pitch synchronous feature extraction method for speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings (ICASSP’04) (Vol. 1, p. I-405-8).
Krishnakumar, S., Kumar, K. R. P., & Balakrishnan, N. (2003). Pitch maxima for robust speaker recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 2, p. II-201-4).
Li, D., Yang, Y., & Huang, T. (2009). Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition. In 2009 3rd international conference on affective computing and intelligent interaction and workshops (pp. 1–4).
McLaughlin, S., Leith, D., & Mann, I. (2002). Using Gaussian processes to synthesize voiced speech with natural pitch variations. In International conference on digital signal processing.
Muhammad, G. (2010). Noise-robust pitch detection using auto-correlation function with enhancements. Journal of King Saud University Computer and Information Sciences, 22, 13–28.
Article Google Scholar
Perez-Pueyo, R., Soneira, M. J., & Ruiz-Moreno, S. (2010). Morphology-based automated baseline removal for Raman spectra of artistic pigments. Applied Spectroscopy, 64(6), 595–600.
Article Google Scholar
Qiang, H., & Youwei, Z. (1998). On prefiltering and endpoint detection of speech signal. In International conference on signal processing proceedings (Vol. 1, pp. 749–752).
Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech and Signal Processing, 25(1), 24–33.
Article Google Scholar
Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.
Article Google Scholar
Ramalho, M. A., & Mammone, R. J. (1993). New speech enhancement techniques using the pitch mode modulation model. In Proceedings of the 36th midwest symposium on circuits and systems (Vol. 2, pp. 1531–1534).
Ru-Wei, L., Long-Tao, C., & Yang, L. (2013). Pitch detection method for noisy speech signals based on wavelet transform and autocorrelation function. In Ninth international conference on intelligent information hiding and multimedia signal processing (pp. 153–156).
Shimamura, T. (2010). An efficient pitch estimation method using windowless and normalized autocorrelation functions in noisy environments. ResearchGate, 6(3), 197–204.
Google Scholar
Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions Speech and Audio Processing, 9(7), 727–730.
Article Google Scholar
Stephenson, T. A., Escofet, J., Magimai-Doss, M., & Bourlard, H. (2002). Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables. In Proceedings 12th IEEE workshop on neural networks for signal processing (pp. 637–646).
Sun, Y., Chan, K. L., & Krishnan, S. M. (2002). ECG signal conditioning by morphological filtering. Computers in Biology and Medicine, 32(6), 465–479.
Article Google Scholar
Swee, T. T., Salleh, S. H. S., & Jamaludin, M. R. (2010). Speech pitch detection using short-time energy. In International conference on computer and communication engineering (ICCCE) (pp. 1–6).
Tabrikian, J., Dubnov, S., & Dickalov, Y. (2002). Speech enhancement by harmonic modeling via map pitch tracking. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I-549–I-552).
Wang, Y. B., Li, S. W., & s Lee, L. (2006). An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2006–2014.
Article Google Scholar
Xu, X., Zhang, T. Q, Shi, S., & Zhang, Y. (2014). An improved pitch detection of speech combined with speech enhancement. In 7th international congress on image and signal processing (CISP) (pp. 778–782).
Zhu, J., Sun, S., Liu, X., & Lei, B. (2009). Pitch in speaker recognition. In Ninth international conference on hybrid intelligent systems (Vol. 1, pp. 33–36).
Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 467–478.
Article Google Scholar

Download references

Acknowledgements

Financial support from Uttaradit Rajabhat University. Computer Engineering Research and Development Group, Department of Computer Engineering, Faculty of Engineering, Khon Kaen University.

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Khon Kaen University, Khon Kaen, 40002, Thailand
Sopon Wiriyarattanakul & Nawapak Eua-anant

Authors

Sopon Wiriyarattanakul
View author publications
You can also search for this author in PubMed Google Scholar
Nawapak Eua-anant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sopon Wiriyarattanakul.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wiriyarattanakul, S., Eua-anant, N. Pitch segmentation of speech signals based on short-time energy waveform. Int J Speech Technol 20, 907–917 (2017). https://doi.org/10.1007/s10772-017-9459-4

Download citation

Received: 24 May 2017
Accepted: 12 September 2017
Published: 19 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10772-017-9459-4

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Pitch segmentation of speech signals based on short-time energy waveform

Abstract

Access this article

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation