Abstract
The pitch is a crucial parameter in speech and music signals. However, due to severe noisy conditions, missing harmonics, unsuitable physical vibration, the determination of pitch presents a great challenge when desiring to get a good accuracy. In this paper, we propose a method for pitch estimation of speech and music sounds. Our method is based on the fast Fourier transform (FFT) of the multi-scale product (MP) provided by a feature auditory model of the sound signals. The auditory model simulates the spectral behaviour of the cochlea by a gammachirp filter-bank, and the out/middle ear filtering by a low-pass filter. For the two output channels, the FFT function of the MP is computed over frames. The MP is based on constituting the product of the speech and music wavelet transform coefficients at three scales. The experimental results show that our method estimates the pitch with high accuracy. Besides, our proposed method outperforms several other pitch detection algorithms in clean and noisy environments.
Similar content being viewed by others
References
Bello, J. P., Daudet, L., Abdallah, S., & Duxbury, C. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Speech, Audio Processing, 13, 1035–1048.
Ben Messaoud, M. A., Bouzid, A., & Ellouze, N. (2015). Automatic segmentation of the clean speech signal. World Academy of Science, Engineering and Technology International Journal of Electrical, Computer, Electronics and Communication Engineering, 9, 114–117.
Brown, J., & Zhang, B. (1991). Musical frequency tracking using the methods of conventional and ’narrowed’ autocorrelation. Journal of the Acoustic Society of America, 89, 2346–2354.
Camacho, A., & Harris, J. (2008). A sawtooth waveform inspired pitch estimator for speech and music. Journal of the Acoustic Society of America, 124, 1638–1652.
De Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustic Society of America, 111, 1917–1930.
Gavat, I., Zira, M., & Sabac, B. (2002). Pitch estimation by block and instantaneous methods. International Journal of Speech Technology, 5, 269–279.
Hess, W. J. (1992). Pitch and voicing determination. In S. Furni, M. Sondhi, & M. Dekker (Eds.), Advances in speech signal processing. New York: Marcel Dekker, Inc.,
Irino, T., & Patterson, R. D. (2006). A dynamic compressive gammachirp auditory filterbank. IEEE Transactions on Audio, Speech and Language Processing, 14, 2222–2253.
Kawahara, H., Katayose, H., De Cheveigné, A., & Patterson, R. D. (1999). Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. Proceedings 6th EUROSPEECH (pp. 2781–2784).
Klapuri, A. (2000). Qualitative and quantitative aspects in the design of periodicity estimation algorithms. European signal processing conference proceedings (pp. 2069–2072).
Klapuri, A. (2004). Automatic music transcription as we know it today. Journal of New Music Research, 33, 269–282.
Kunieda, N., Shimamura, T., & Suzuki, J. (1996). Robust method of measurement of fundamental frequency by aclos: autocorrelation of log spectrum. International conference on acoustics, speech, and signal processing proceedings (pp. 232–235). Atlanta, GA.
Li, H., Dai, B., & Lu, W. (2006). A pitch detection algorithm based on AMDF and ACF. International conference on acoustics, speech and signal processing proceedings. Toulouse (pp. 377–380).
Lyon, R. F., Katsiamis, A. G., & Drakakis, E. M. (2010). History and future of auditory filter models. Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS) (pp. 3809–3820).
Mahmoodzadeh, A., Abutalebi, H. R., Soltanian-Zadeh, H., Sheikhzadeh, H. (2012). Single channel speech separation with a frame-based pitch range estimation method in modulation frequency. International symposium on telecommunications (pp. 609–613).
Mallat, S. (1999). A wavelet tour of signal processing. San Diego: Academic Press.
Meddis, R., Lopez-Poveda, E. A., Fay, R. R., & Popper, A. N. (2010). Computational models of the auditory system., Springer Handbook of Auditory Research New York: Springer.
Meddis, R., & O’Mard, L. (1997). A unitary model for pitch perception. Journal of the Acoustic Society of America, 102, 1811–1820.
Meyer, G., Plante, F., & Ainsworth, W. A. (1995). A pitch extraction reference database. 4th European Conference on Speech Communication and Technology. EUROSPEECH’95, Madrid, pp. 837–840.
Muller, M., Ellis, D., Klapuri, A., & Richard, G. (2011). Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5, 1088–1110.
Patterson, R. D., Unoki, M., & Irino, T. (2003). Extending the domain of centre frequencies for the compressive gammachirp auditory filter. Journal of the Acoustic Society of America, 114, 1529–1570.
Prasanna, S. R. M., & Yegnanarayana, B. (2004). Extraction of pitch in adverse conditions. International conference on acoustics, speech and signal processing proceedings (pp. 109–112).
Roy, S. J., Molla, M. K. I., Hirose, K., & Hasan, M. K. (2011). Harmonic modification and data adaptive filtering based approach to robust pitch estimation. International Journal of Speech Technology, 14, 339–349.
Shahnaz, C., Zhu, W. P., & Ahmad, M. O. (2007). A robust pitch estimation algorithm in noise. International conference on acoustics, speech and signal processing proceedings (pp. 1037–1076).
Shahnaz, C. Zhu, W. P., & Ahmad, M. O. (2008). A pitch extraction algorithm in noise based on temporal and spectral representations. International conference on acoustics, speech and signal processing proceedings (pp. 4477–4480).
Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions on Speech and Audio Processing, 9, 727–730.
Sun, X. (2000). A pitch determination algorithm based on subharmonic-to-harmonic ratio. International conference on spoken language processing proceedings (pp. 676–679). Beijing.
Tolonen, M., & Karjalainen, M. (2000). A computationally efficient multipitch analysis model. IEEE Transactions on Speech and Audio Process, 8, 708–716.
University of lowa. (2012). Electronic music studios. http://theremin.music.uiowa.edu.
Van Immerseel, L. M., & Martens, J. P. (1992). Pitch and voiced/unvoiced determination with an auditory model. Journal of the Acoustic Society of America, 91, 3511–3526.
Varga, A. (1993). Assessment for automatic speech recognition: II. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Elsevier Speech Communication, 12, 247–251.
Wang, D. L., & Brown, G. J. (2006). Principles, computational auditory scene analysis: Algorithms, and applications. Hoboken, NJ: Wiley/IEEE Press.
Xu, Y., Weaver, J., Healy, D., & Lu, J. (1994). Wavelet transform domain filters: A spatially selective noise filtration technique. IEEE Transactions on Image Processing, 3, 747–758.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ben Messaoud, M.A., Bouzid, A. Pitch estimation of speech and music sound based on multi-scale product with auditory feature extraction. Int J Speech Technol 19, 65–73 (2016). https://doi.org/10.1007/s10772-015-9325-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9325-1