Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model

Li, Yongwei; Sakakibara, Ken-Ichi; Akagi, Masato

doi:10.1007/s11265-019-01510-4

Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model

Published: 23 December 2019

Volume 92, pages 831–838, (2020)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

434 Accesses
2 Citations
Explore all metrics

Abstract

Estimating glottal source waveforms and vocal tract shapes is typically done by processing the speech signal using an inverse filter and then fitting the residual signal using the glottal source model. However, due to source-tract interactions, the estimation accuracy is reduced. In this paper, we propose a method to estimate glottal source waveforms and vocal tract shapes simultaneously based on an analysis-by-synthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous (ARX) model and the Liljencrants-Fant (LF) model. Since the optimization of multiple parameters makes simultaneous estimation difficult, we first initialize the glottal source parameters using the inverse filter method, and then simultaneously estimate the accurate parameters of the glottal sources and the vocal tract shapes using an analysis-by-synthesis approach. Experimental results with synthetic and real speech signals showed that the proposed method has higher estimation accuracy than using the inverse filter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Glottal Source Model Selection for Stationary Singing-Voice by Low-Band Envelope Matching

Speech synthesis for glottal activity region processing

Article 03 December 2018

Commonalities of Glottal Sources and Vocal Tract Shapes Among Speakers in Emotional Speech

References

Cohen, J., Kamm, T., Andreou, A.G. (1995). Vocal tract normalization in speech recognition: Compensating for systematic speaker variability. The Journal of the Acoustical Society of America, 97(5), 3246–3247.
Article Google Scholar
Raitio, T., Suni, A., Pulakka, H., Vainio, M., Alku, P. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4564–4567).
Drugman, T., Dubuisson, T., Dutoit, T. (2009). On the mutual information between source and filter contributions for voice pathology detection. In Tenth Annual Conference of the International Speech Communication Association.
Childers, D.G. (1995). Glottal source modeling for voice conversion. Speech Communication, 16(2), 127–138.
Article Google Scholar
Plumpe, M.D., Quatieri, T.F., Reynolds, D.A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.
Article Google Scholar
Iliev, A.I., Scordilis, M.S., Papa, J.P., Falcão, A.X. (2010). Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech & Language, pp. 445–460.
Li, X., & Akagi, M. (2018). A three-layer emotion perception model for valence and arousal-based detection from multilingual speech. In Interspeech (pp. 3643–3647).
Fant, G., Liljencrants, J., Lin, Q.g. (1985). A four-parameter model of glottal flow. STL-QPSR, 4, 1–13.
Google Scholar
Rabiner, L.R., & Schafer, R.W. (1987). Digital processing of speech signals. Prentice-hall Englewood Cliffs, NJ, 100.
Wong, D., Markel, J., Gray, A. (1979). Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4), 350–355.
Article Google Scholar
Alku, P. (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2-3), 109–118.
Article Google Scholar
Drugman, T., Bozkurt, B., Dutoit, T. Complex cepstrum-based decomposition of speech for glottal source estimation. Interspeech, 116–119.
Kane, J., & Gobl, C. (2013). Automating manual user strategies for precise voice source analysis. Speech Communication, 55(3), 397–414.
Article Google Scholar
Klatt, D.H., & Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857.
Article Google Scholar
Fujisaki, H., & Ljungqvist, M. (1986). Proposal and evaluation of models for the glottal source waveform. ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 11, 1605–1608.
Article Google Scholar
Ding, W., Kasuya, H., Adachi, S. (1995). Simultaneous estimation of vocal tract and voice source parameters based on an ARX model. IEICE Transactions on Information and Systems, 78(6), 738–743.
Google Scholar
Fujisaki, H., & Ljungqvist, M. (1996). Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform. In Recent Research Towards Advanced Man-machine Interface Through Spoken Language (pp. 52–60).
Fröhlich, M., Michaelis, D., Strube, H.W. (2001). SIM-simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals. The Journal of the Acoustical Society of America, 110(1), 479–488.
Article Google Scholar
Vincent, D., Rosec, O., Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In Ninth European Conference on Speech Communication and Technology (pp. 333–336).
Fu, Q., & Murphy, P. (2006). Robust glottal source estimation based on joint source-filter model optimization. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 492–501.
Article Google Scholar
Fant, G. (1995). The LF-model revisited Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep., Royal Inst. of Tech. Stockholm, 2(3), 119–156.
Google Scholar
Li, Y., Sakakibara, K.I., Morikawa, D., Akagi, M. (2017). Commonalities of glottal sources and vocal tract shapes among speakers in emotional speech. In International Seminar on Speech Production (pp. 24–34).
Takahashi, K., & Akagi, M. (2018). Estimation of glottal source waveforms and vocal tract shape for singing voices with wide frequency range. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1879–1887).
Drugman, T., Thomas, M., Gudnason, J., Naylor, P., Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.
Article Google Scholar
Kane, J., Yanushevskaya, I., Ní Chasaide, A., Gobl, C. (2012). Exploiting time and frequency domain measures for precise voice source parameterisation. Speech Prosody, 2012, 143–146.
Google Scholar
Lu, H.L. (2002). Toward a high-quality singing synthesizer with vocal texture control. Stanford University.
Kawahara, H., Sakakibara, K.I., Banno, H., Morise, M., Toda, T., Irino, T. (2015). Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015 Asia-Pacific (pp. 520–529).
Drugman, T., Bozkurt, B., Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.
Article Google Scholar

Download references

Acknowledgements

This study was supported by a Grant-in-Aid for Scientific Research (A) (No. 25240026), JST-Mirai Program (JP-MJMI18D1) and China Scholarship Council (CSC).

Author information

Authors and Affiliations

Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Yongwei Li & Masato Akagi
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Yongwei Li
Department of Communication Disorders, Health Science University of Hokkaido, 1757 Kanazawa, Tobetsu-cho, Ishikari-gun, Hokkaido, 061-0293, Japan
Ken-Ichi Sakakibara

Authors

Yongwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Ken-Ichi Sakakibara
View author publications
You can also search for this author in PubMed Google Scholar
Masato Akagi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongwei Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Sakakibara, KI. & Akagi, M. Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model. J Sign Process Syst 92, 831–838 (2020). https://doi.org/10.1007/s11265-019-01510-4

Download citation

Received: 14 February 2019
Revised: 25 September 2019
Accepted: 01 December 2019
Published: 23 December 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11265-019-01510-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model

Abstract

Access this article

Similar content being viewed by others

Glottal Source Model Selection for Stationary Singing-Voice by Low-Band Envelope Matching

Speech synthesis for glottal activity region processing

Commonalities of Glottal Sources and Vocal Tract Shapes Among Speakers in Emotional Speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Simultaneous Estimation of Glottal Source Waveforms and Vocal Tract Shapes from Speech Signals Based on ARX-LF Model

Abstract

Access this article

Similar content being viewed by others

Glottal Source Model Selection for Stationary Singing-Voice by Low-Band Envelope Matching

Speech synthesis for glottal activity region processing

Commonalities of Glottal Sources and Vocal Tract Shapes Among Speakers in Emotional Speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation