Speech synthesis for glottal activity region processing
- 37 Downloads
The objective of this paper is to demonstrate the significance of combining different features present in the glottal activity region for statistical parametric speech synthesis (SPSS). Different features present in the glottal activity regions are broadly categorized as F0, system, and source features, which represent the quality of speech. F0 feature is computed from zero frequency filter and system feature is computed from 2-D based Riesz transform. Source features include aperiodicity and phase component. Aperiodicity component representing the amount of aperiodic component present in a frame is computed from Riesz transform, whereas, phase component is computed by modeling integrated linear prediction residual. The combined features resulted in better quality compared to STRAIGHT based SPSS both in terms of objective and subjective evaluation. Further, the proposed method is extended to two Indian languages, namely, Assamese and Manipuri, which shows similar improvement in quality.
KeywordsGlottal activity region Speech synthesis Statistical parametric speech synthesis Voicing decision Riesz transform
- Adiga, N. & Prasanna, S. R. M. (2018). Acoustic features modelling for statistical parametric speech synthesis: A review. IETE Technical Review. https://doi.org/10.1080/02564602.2018.1432422
- Airaksinen, M., Bollepalli, B., Juvela, L., Wu, Z., King, S. & Alku, P. (2016). Glottdnna full-band glottal vocoder for statistical parametric speech synthesis. In Proc. Interspeech.Google Scholar
- Ananthapadmanabha, T. V. (1984). Acoustic analysis of voice source dynamics. STL-QPSR 23. Speech, Music and Hearing, Royal Institute of Technology, Stockholm: Tech. Rep.Google Scholar
- Aragonda, H. & Seelamantula, C. (2013) Riesz-transform-based demodulation of narrowband spectrograms of voiced speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., May (pp. 8203–8207).Google Scholar
- Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J. & Sengupta, S. et al. (2017). Deep Voice: Real-time neural text-to-speech. arXiv:1702.07825.
- Eleftherios, B., Daniel, E., Antonio, B., & Asuncion, M. (2008). Flexible harmonic/stochastic modeling for HMM-based speech synthesis. V Jornadas en Tecnologa del Habla.Google Scholar
- Fisher, W. M., Doddington, G. R. & Goudie-Marshall, K. M. (1986). The DARPA speech recognition research database: Specifications and status. In Proc. DARPA workshop on speech recognition (pp. 93–99).Google Scholar
- Flanagan, J . L. (2013). Speech analysis, synthesis and perception (Vol. 3). New York: Springer.Google Scholar
- Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 137–140.Google Scholar
- Hemptinne, C. (2006). Integration of the harmonic plus noise model (HNM) into the Hidden Markov Model-Based speech synthesis system (HTS). Master’s thesis, Idiap Research Institute.Google Scholar
- Kawahara, H., Estill, J. & Osamu, F. (2001). Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight. In Proc. MAVEBA (pp. 59–64).Google Scholar
- Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A. & Bengio, Y. (2016). SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:1612.07837.
- Pantazis, Y. & Stylianou, Y. (2008). Improving the modeling of the noise part in the harmonic plus noise model of speech. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process, March (pp. 4609–4612).Google Scholar
- Patil, H. A., Patel, T. B., Shah, N. J., Sailor, H. B., Krishnan, R., Kasthuri, G., Nagarajan, T., Christina, L., Kumar, N. & Raghavendra V. et al. (2013). A syllable-based framework for unit selection synthesis in 13 Indian languages. In Proc. Oriental COCOSDA (pp. 1–8). IEEE.Google Scholar
- Plante, F., Meyer, G., & Ainsworth, W. (1995). A pitch extraction reference database. Children, 8(12), 30–50.Google Scholar
- Quatieri, T. F. (2002). 2-D processing of speech with application to pitch estimation. In Proc. Interspeech.Google Scholar
- Raitio, T., Suni, A., Pulakka, H., Vainio, M. & Alku, P. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (pp. 4564–4567).Google Scholar
- Sharma, B., Adiga, N. & Prasanna, S. M. (2015). Development of Assamese text-to-speech synthesis system. In Proc. TENCON (pp. 1–6). IEEE.Google Scholar
- Sjölander, K. & Beskow, J. (2000). Wavesurfer—An open source speech tool. In Proc. Interspeech (pp. 464–467).Google Scholar
- Stylianou, I. (1996). Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Ph.D. dissertation, Ecole Nationale Supérieure des TélécommunicationsGoogle Scholar
- Tokuda, K., Kobayashi, T., Masuko, T. & Imai, S. (1994). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In Proceedings of ICSLP.Google Scholar
- van den oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. arXiv:1609.03499.
- Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R. & Saurous, R. A. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv:1703.10135.
- Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. In Proceedings of the speech synthesis workshop (SSW). Sunnyvale, USA: SSW.Google Scholar
- Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of Eurospeech.Google Scholar