Advertisement

Circuits, Systems, and Signal Processing

, Volume 37, Issue 5, pp 1935–1957 | Cite as

Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data

  • Mohammad Javad Jannati
  • Abolghasem Sayadiyan
Article
  • 58 Downloads

Abstract

Voice conversion suffers from two drawbacks: requiring a large number of sentences from target speaker and concatenation error (in concatenative methods). In this research, part-syllable transformation-based voice conversion (PST-VC) method, which performs voice conversion with very limited data from a target speaker and simultaneously reduces concatenation error, is introduced. In this method, every syllable is segmented into three parts: left transition, vowel core, and right transition. Using this new language unit called part-syllable (PS), PST-VC, reduces concatenation error by transferring segmentation and concatenation from the transition points to the relatively stable points of a syllable. Since the greatest amount of information from any speaker is contained in the vowels, PST-VC method uses this information to transform the vowels into all of the language PSs. In this approach, a series of transformations are trained that can generate all of the PSs of a target speaker’s voice by receiving one vowel core as the input. Having all of the PSs, any voice of target speaker can be imitated. Therefore, PST-VC reduces the number of training sentences needed to a single-syllable word and also reduces the concatenation error.

Keywords

Voice conversion Very limited training data Part-syllable 

References

  1. 1.
    Abe, M., Nakamura, S., Shikano, K., Kuwahara, H., Voice conversion through vector quantization, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’88, (1988)Google Scholar
  2. 2.
    X. Chen, L. Zhang, High-quality voice conversion system based on GMM statistical parameters and RBF neural network. J. China Univ. Posts Telecommun. 21(5), 68–75 (2014)CrossRefGoogle Scholar
  3. 3.
    L. Chen, Z. Ling, L. Liu, L. Dai, Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans. Audio Speech Language Process. 22(12), 1859–1872 (2014)CrossRefGoogle Scholar
  4. 4.
    D. Childers, B. Yegnanarayana, K. Wu, Voice conversion: Factors responsible for quality, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’85 (1985)Google Scholar
  5. 5.
    J. Collura, Vector Quantization of Linear Predictor Coefficients, Modern Methods of Speech Processing, The Springer International Series in Engineering and Computer Sci- (Springer, Berlin, 1995)CrossRefGoogle Scholar
  6. 6.
    S. Desai, E.V. Ragavendra, B. Yegnanarayana, A.W. Black, K. Prahallad, Voice conversion using artificial neural networks, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’09 (2009)Google Scholar
  7. 7.
    R. Doost, A. Sayadiyan, H. Shamsi, A new perceptually weighted distance measure for vector quantization of the STFT amplitude in the speech applications. IEICE Electron. 6(12), 824–830 (2009)CrossRefGoogle Scholar
  8. 8.
    E. Eide, M. Picheny, Towards pooled-speaker concatenative text-to-speech, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’06 (2006)Google Scholar
  9. 9.
    D. Erro, A. Moreno, Weighted frequency warping for voice conversion, in: Annual Conference of the International Speech Communication Association, INTERSPEECH ’07 (2007)Google Scholar
  10. 10.
    D. Erro, E. Navas, I. Hernaez, Iterative MMSE estimation of vocal tract length normalization factors for voice transformation, in: 13th Annual Conference of the International Speech Communication Association, INTERSPEECH ’12 (2012)Google Scholar
  11. 11.
    D. Erro, E. Navas, I. Hernaez, Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio Speech Language Process. 21(3), 556–566 (2013)CrossRefGoogle Scholar
  12. 12.
    M. Ghorbandoost, A. Sayadiyan, M. Ahangar, H. Sheikhzadeh, A. Sabzi, J. Amini, Voice conversion based on feature combination with limited training data. Speech Commun. 67(3), 113–128 (2015)CrossRefGoogle Scholar
  13. 13.
    E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Language Process. 18(5), 912–921 (2010)CrossRefGoogle Scholar
  14. 14.
    E. Helander, H. Siln, T. Virtanen, M. Gabbouj, Voice conversion using dynamic kernel partial least squares regression. IEEE Trans. Audio Speech Language Process. 20(3), 806–817 (2012)CrossRefGoogle Scholar
  15. 15.
    Y. Hu, P. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process. 16(1), 229–238 (2008)CrossRefGoogle Scholar
  16. 16.
    B. Juang, D. Wong, A. Gray, Distortion performance of vector quantization for LPC voice coding. IEEE Trans. Acoust. Speech Signal Process. 30(2), 294–304 (1982)CrossRefGoogle Scholar
  17. 17.
    A. Kain, M. Macon, Spectral voice conversion for text-to-speech synthesis, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (1998)Google Scholar
  18. 18.
    H. Kawahara, I. Masuda-Katsuse, A. Cheveign, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27(3), 187–207 (1999)CrossRefGoogle Scholar
  19. 19.
    K. Lee, Statistical approach for voice personality transformation. IEEE Trans. Audio Speech Language Process. 15(2), 641–651 (2007)CrossRefGoogle Scholar
  20. 20.
    P. Mowlaee, A. Sayadiyan, H. Sheikhzadeh, FDMSM roboust signal representation for speech mixtures and noise corrupted audio signals. IEICE Electron. 6(15), 1077–1083 (2009)CrossRefGoogle Scholar
  21. 21.
    K. Nakamura, T. Toda, H. Saruatri, K. Shikano, Evaluation of extremely small sound source signals used in speaking-aid system with statistical voice conversion. IEICE Trans. Inf. syst. 93(7), 1909–1917 (2010)CrossRefGoogle Scholar
  22. 22.
    T. Nakashika, T. Takiguchi, Y. Ariki, Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Trans. Audio Speech Language Process. 23(3), 580–587 (2015)CrossRefGoogle Scholar
  23. 23.
    T. Nakashika, T. Takiguchi, Y. Ariki, Voice conversion using speaker-dependent conditional restricted Boltzmann machine. EURASIP J. Audio Speech Music Process. 2015, 8 (2015)CrossRefGoogle Scholar
  24. 24.
    P. Naylor, N. Gaubitch, Speech Dereverberation (Springer, Berlin, 2010)CrossRefzbMATHGoogle Scholar
  25. 25.
    K. Paliwal, B. Atal, Efficient vector quantization of LPC parameters at 24 bits frame. IEEE Trans. Speech Audio Process. 1(1), 3–14 (1993)CrossRefGoogle Scholar
  26. 26.
    L. Rabiner, B. Juang, Fundamentals of Speech Recognition (Prentice-Hall International, Upper Saddle River, 1993)zbMATHGoogle Scholar
  27. 27.
    B. Ramani, M.P. Actlin Jeeva, P. Vijayalakshmi, T. Nagarajan, A multi-level GMM-based cross-lingual voice conversion using language-specific mixture weights for polyglot synthesis. Circuits Syst. Signal Process. Journal 35(4), 1283–1311 (2015)MathSciNetCrossRefGoogle Scholar
  28. 28.
    K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, An HMM-based singing voice synthesis system, in: Nineth International Conference on Spoken Language Processing, INTERSPEECH ’06 (2006)Google Scholar
  29. 29.
    R. Streijl, S. Winkler, D. Hands, Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives (Springer, Berlin, 2014)Google Scholar
  30. 30.
    Y. Stylianou, Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification, PhD dissertation, Ecole Nationale Suprieure des Tlcommunications (1996)Google Scholar
  31. 31.
    D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, Text-independent voice conversion based on unit selection, in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’06 (2006)Google Scholar
  32. 32.
    T. Tat, K. Kimura, M. Unoki, M. Akagi, A study on restoration of bone-conducted speech with MTF-Based and LP-based models. J. Signal Process. 10(6), 407–417 (2006)Google Scholar
  33. 33.
    T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of straight spectrum, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (2001)Google Scholar
  34. 34.
    T. Toda, A. Black, K. Tokuda, Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’05 (2005)Google Scholar
  35. 35.
    T. Toda, Y. Ohtani, K. Shikano, Eigenvoice conversion based on gaussian mixture model, in: Nineth International Conference on Spoken Language Processing, INTERSPEECH ’06 (2006)Google Scholar
  36. 36.
    A. Toda, W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Language Process. 15(8), 2222–2235 (2007)CrossRefGoogle Scholar
  37. 37.
    H. Valbret, E. Moulines, J. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11(2), 175–187 (1992)CrossRefGoogle Scholar
  38. 38.
    N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, Z. Yang, Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data. Speech Commun. 58, 124–138 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of Computer EngineeringIran University of Science and TechnologyTehranIran
  2. 2.Department of Electrical EngineeringAmirkabir University of TechnologyTehranIran

Personalised recommendations