Quality Improvement of Vietnamese HMM-Based Speech Synthesis System Based on Decomposition of Naturalness and Intelligibility Using Non-negative Matrix Factorization

Dinh, Anh-Tuan; Phan, Thanh-Son; Akagi, Masato

doi:10.1007/978-3-319-49073-1_53

Anh-Tuan Dinh¹⁹,
Thanh-Son Phan²⁰ &
Masato Akagi¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 538))

Included in the following conference series:

International Conference on Advances in Information and Communication Technology

1108 Accesses
1 Citations

Abstract

Hidden Markov model (HMM)-based synthesized speech is intelligible but not natural especially under limited data condition. The goal of this study is to improve naturalness without violating acceptable intelligibility by decomposing the naturalness and intelligibility of synthesized speech using a novel asymmetric bilinear model involving non-negative matrix factorization (NMF). Subjective evaluations carried out on Vietnamese data confirmed that the achieved synthesis quality is higher than other methods under limited data condition. Since F0 contour is important for naturalness and intelligibility, especially in Vietnamese. Proposed method is capable of modifying over-smoothed F0 contour without destroying tonal information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zen, H., Tokuda, K., Black, W.: Statistical parametric speech synthesis. Speech Comm. 51(11), 1039–1064 (2009)
Article Google Scholar
Toda, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. E90–D(05), 816–824 (2007)
Article Google Scholar
Takamichi, S., Toda, T., Black, A., Nakamura, S.: Parameter generation algorithm considering modulation spectrum for HMM-based speech synthesis. In: Proceedings of ICASSP, pp. 4210–4214 (2015)
Google Scholar
Takamichi, S., Toda, T., Neubig, G., Nakamura, S.: A post-filter to modify the modulation spectrum in HMM-based speech synthesis. In: Proceedings of ICASSP, pp. 290–294 (2014)
Google Scholar
Chen, L.H., Raitio, T., Valentini-Botinhao, C., Yamagishi, J., Ling, Z.H.: DNN-based stochastic postfilter for HMM-based speech synthesis. In: Proceedings of Interspeech, pp. 1954–1958 (2014)
Google Scholar
Tenenbaum, J., Freeman, W.: Separating style and content with bilinear models. Neural Comput. 12, 1247–1283 (2000)
Article Google Scholar
Popa, V., Nurminen, J., Gabbouj, M.: A novel technique for voice conversion based on style and content decomposition with bilinear models. In: Proceedings of Interspeech, pp. 2655–2658 (2009)
Google Scholar
Stylianou, Y., Cappe, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Audio, Speech, Lang. Process. 6, 131–142 (1998)
Article Google Scholar
Tokuda, K., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of ICSLP, pp. 1043–1046 (1994)
Google Scholar
Dinh-Anh, T., Morikawa, D., Akagi, M.: Study on quality improvement of HMM-based synthesized voices using asymmetric bilinear model. J. Sig. Process. 20(4), 205–208 (2016)
Article Google Scholar
Vu, T.T., Luong, M.C., Nakamura, S.: An HMM-based vietnamese speech synthesis system. In: Proceedings of Oriental COCOSDA, pp. 116–121 (2009)
Google Scholar
Phan, T.S., Duong, T.C., Dinh, A.T., Vu, T.T., Luong, M.C.: Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information. In: Proceedings of RIVF, pp. 276–281 (2013)
Google Scholar
Doan, T.T.: (Vietnamese Phonetics), pp. 99–148. Hanoi National University Publishing House (1999)
Google Scholar
Mai, L.C., Duc, D.N.: Design of Vietnamese speech corpus and current status. In: Proceedings of ISCSLP 2006, pp. 748–758 (2006)
Google Scholar
Scheffe, H.: An analysis of variance for paired comparisons. J. Am. Stat. Assoc. 37, 381–400 (1952)
MathSciNet MATH Google Scholar
Kawahara, H., Masuda-Katsue, I., de Cheveigne, M.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and a instantaneous frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Comm. 27, 187–207 (1999)
Article Google Scholar

Download references

Acknowledgement

This study was supported by the Grant-in-Aid for Scientic Research (A) (No. 25240026), SECOM Science and Technology Foundation and the JSPS A3 Foresight program.

Author information

Authors and Affiliations

Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, Japan
Anh-Tuan Dinh & Masato Akagi
Faculty of Information Technology, Telecommunications University, 101 Mai Xuan Thuong, Nha Trang, Khanh Hoa, Vietnam
Thanh-Son Phan

Authors

Anh-Tuan Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Son Phan
View author publications
You can also search for this author in PubMed Google Scholar
Masato Akagi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anh-Tuan Dinh .

Editor information

Editors and Affiliations

School of Information Science、Area of Human Life Design , Japan Advanced Institute of Science and Technology, Nomi-shi, Ishikawa, Japan
Masato Akagi
Department of Computer Science, VNU University of Engineering and Technology, Hanoi, Vietnam
Thanh-Thuy Nguyen
Faculty of Information Technology, Thai Nguyen University of Information and Communication Technology, Thai Nguyen, Vietnam
Duc-Thai Vu
Thai Nguyen University of Information and Communication Technology, Thai Nguyen, Vietnam
Trung-Nghia Phung
School of Knowledge Science、Area of Knowledge Management , Japan Advanced Institute of Science and Technology, Nomi-shi, Ishikawa, Japan
Van-Nam Huynh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dinh, AT., Phan, TS., Akagi, M. (2017). Quality Improvement of Vietnamese HMM-Based Speech Synthesis System Based on Decomposition of Naturalness and Intelligibility Using Non-negative Matrix Factorization. In: Akagi, M., Nguyen, TT., Vu, DT., Phung, TN., Huynh, VN. (eds) Advances in Information and Communication Technology. ICTA 2016. Advances in Intelligent Systems and Computing, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-319-49073-1_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-49073-1_53
Published: 12 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49072-4
Online ISBN: 978-3-319-49073-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics