Abstract
Voice conversion (VC) consists in modifying the source speaker’s voice toward the voice of the target speaker. In our paper, we are interested in calculating the performance of a conversion system based on GMM, applied to the Arabic language, by exploiting both the information of the pitch dynamics and the spectrum. We study three approaches to obtain the global conversion function of the pitch and the overall spectrum, using the joint probability model. In the first approach, we calculate the joint conversion of pitch and spectrum. In the second approach, the pitch is calculated by linear conversion. In the third approach, we use the relationship between the pitch and the spectrum. For the conversion of noise we use a new technique that consists in modeling the noise of the voiced or unvoiced frames by GMMs. We use the HNM for analysis/synthesis and a regularized discrete cepstrum in order to estimate the spectrum of the speech signal.
Similar content being viewed by others
References
ITU-T Recommendation P.862.2, was approved on 13 November 2007 by ITU-T Study Group 12 (2005–2008) under the ITU-T Recommendation A.8 procedure.
Abe, M., Nakanura, S., Shikano, K., & Kuwabara, H. (1998). Voice conversion through vector quantization. In Proc. of international conference on acoustics, speech and signal processing (ICASSP) (pp. 655–658).
Aissiou, M., & Guerti, M. (2007). Genetic supervised classification of standard Arabic fricative consonants for the automatic speech recognition. Medwell Journal of Applied Sciences 2, 4, 458–476.
Bing, Z., & Yibiao, Y. (2008). Voice conversion based on improved GMM and spectrum with synchronous prosody. In Proc. of international conference on signal processing (ICSP) (pp. 659–662).
Boite, R., & Kunt, M. (1987). Traitement de la parole. Press Polytechniques Romandes.
Cappe, O., Laroche, J., & Moulines, E. (1995). Regularized estimation of cepstrum envelope from discrete Frequency points. In Proc. of IEEE workshop on application of processing to audio and acoustic (WASPAA) (pp. 213–216).
Dempster, A. P., Laird, N. M., & Rubin, B. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39, 1–38.
El-Naijary, T., Rosec, O., & Chonavel, T. (2004). A voice conversion method based on joint pitch and spectral envelope transformation. In Proc of international conference on spoken language processing (ICSLP).
Guerid, A., & Houacine, A. (2009). The performances of conversions system of voice based on GMM applied to Arabic language. In Proc. of ICMENS (pp. 118–121).
Guerid, A., & Houacine, A. (2010). The influence of the Bark’s transformation on the spectral modeling. In Premier Congrès international sur les modèles, optimisation et sécurité des systèmes (ICMOSS) (p. 2010).
International Telecommunication Union (1996). ITU-T Recommendation P.800: methods for subjective determination of transmission quality, August 1996.
Kain, A. (2001). High resolution voice transformation. OGI School of Science and Engineering at Oregon Health and Science University, Portland, Oregon, US.
Kain, A., & Stylianou, Y. (2000). Stochastic modeling of spectral adjustment for high quality pitch modification. In Proc. of international conference on acoustics, speech and signal processing (ICASSP) (pp. 949–952).
Ma, J., & Liu, W. (2005). Voice Conversion based on joint pitch and spectral transformation with component group-GMM. In Proc. of NLP-KE (pp. 199–203).
Marshimo, M., Toda, H., Shikano, K., & Kampbell, N. (2001). “Evaluation of cross-language voice conversion”. In Proc. of Eurospeech (pp. 361–364).
Narendraneth, N., Murthy, H. A., Rajendran, S., & Yegnanrayana, B. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16, 207–216.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vatterling, W. T. (2002). Numerical recipes in C; the art of scientific computing (2nd edn.). Cambridge: Cambridge University Press. ISBN 0-521-43108-5.
Srinivas, D., Raghavendra, E. V., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2009). Voice conversion using artificial neural networks. In Proc. of international conference on acoustics, speech and signal processing (ICASSP) (pp. 3893–3896).
Stylianou, Y. (1996). Harmonic plus noise model for speech combined with statistical methods, for speech and speaker modification. PhD Thesis ENST, Paris, France.
Stylianou, Y., Cappe, O., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6, 131–142.
Sundermann, D., Ney, H., & Hoge, H. (2003). VTLN based cross-language voice conversion. In 8th IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 676–681).
Toda, H., Saruwatari, H., & Shikano, K. (2001). Voice conversion algorithm with dynamic frequency warping of STRAIGHT spectrum. In Proc of international conference on acoustics, speech and signal processing (ICASSP) (pp. 841–844).
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum likelihood estimation of speech parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15, 2222–2235.
Wu, C. H., Hsia, C. C., Liu, T. H., & Wang, J. F. (2006). Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1109–1116.
Yutani, K., et al. (2009). Voice conversion based on simultaneous modeling of spectrum and f0. In Proc. of ICASSP (pp. 3897–3900).
Zhang, M., Tao, J., Tian, J., & Wang, X. (2008). Text-independent voice conversion based on state mapped codebook. In Proc. of international conference on acoustics, speech and signal processing (ICASSP) (pp. 4605–4608).
Zhenjun, Y., Xiang, Z., Yongxing, J., & Hao, W. (2008). Voice conversion using HMM combined with GMM. In Congress on image and signal processing (pp. 366–370).
Acknowledgement
I thank the entire group of the IRIT laboratory and especially the SAMOVA (Structuring, Analysis and modeling Video and Audio documents) team from the University of Toulouse III for their support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guerid, A., Houacine, A., Andre-Obrecht, R. et al. Performance of new voice conversion systems based on GMM models and applied to Arabic language. Int J Speech Technol 15, 477–485 (2012). https://doi.org/10.1007/s10772-012-9145-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9145-5