Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8854))

Abstract

Online multimedia repositories are growing rapidly. However, language barriers are often difficult to overcome for many of the current and potential users. In this paper we describe a TTS Spanish system and we apply it to the synthesis of transcribed and translated video lectures. A statistical parametric speech synthesis system, in which the acoustic mapping is performed with either HMM-based or DNN-based acoustic models, has been developed. To the best of our knowledge, this is the first time that a DNN-based TTS system has been implemented for the synthesis of Spanish. A comparative objective evaluation between both models has been carried out. Our results show that DNN-based systems can reconstruct speech waveforms more accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahocoder, http://aholab.ehu.es/ahocoder

  2. Coursera, http://www.coursera.org

  3. HMM-Based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp

  4. Khan Academy, http://www.khanacademy.org

  5. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proc. of EMNLP, pp. 355–362 (2011)

    Google Scholar 

  6. Bottou, L.: Stochastic gradient learning in neural networks. In: Proceedings of Neuro-Nîmes 1991. EC2, Nimes, France (1991)

    Google Scholar 

  7. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)

    Article  Google Scholar 

  8. Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing 8(2), 184–194 (2014)

    Article  Google Scholar 

  9. Fan, Y., Qian, Y., Xie, F., Soong, F.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proc. of Interspeech (submitted 2014)

    Google Scholar 

  10. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6), 82–97 (2012)

    Article  Google Scholar 

  11. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of ICASSP, vol. 1, pp. 373–376 (1996)

    Google Scholar 

  12. King, S.: Measuring a decade of progress in text-to-speech. Loquens 1(1), e006 (2014)

    Google Scholar 

  13. Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)

    Google Scholar 

  14. Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In: Proc. of SLTU, pp. 63–68 (2008)

    Google Scholar 

  15. Lopez, A.: Statistical machine translation. ACM Computing Surveys 40(3), 8:1–8:49 (2008)

    Google Scholar 

  16. poliMedia: The polimedia video-lecture repository (2007), http://media.upv.es

  17. Sainz, I., Erro, D., Navas, E., Hernáez, I., Sánchez, J., Saratxaga, I.: Aholab speech synthesizer for albayzin 2012 speech synthesis evaluation. In: Proc. of IberSPEECH, pp. 645–652 (2012)

    Google Scholar 

  18. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent dnn for conversational speech transcription. In: Proc. of ASRU, pp. 24–29 (2011)

    Google Scholar 

  19. Shinoda, K., Watanabe, T.: MDL-based context-dependent subword modeling for speech recognition. Journal of the Acoustical Society of Japan 21(2), 79–86 (2000)

    Article  Google Scholar 

  20. Silvestre-Cerdà, J.A., et al.: Translectures. In: Proc. of IberSPEECH, pp. 345–351 (2012)

    Google Scholar 

  21. TED Ideas worth spreading, http://www.ted.com

  22. The transLectures-UPV Team.: The transLectures-UPV toolkit (TLK), http://translectures.eu/tlk

  23. Toda, T., Black, A.W., Tokuda, K.: Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In: Proc. of ISCA Speech Synthesis Workshop (2004)

    Google Scholar 

  24. Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from hmm using dynamic features. In: Proc. of ICASSP, vol. 1, pp. 660–663 (1995)

    Google Scholar 

  25. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Transactions on Information and Systems 85(3), 455–464 (2002)

    Google Scholar 

  26. transLectures: D3.1.2: Second report on massive adaptation, http://www.translectures.eu/wp-content/uploads/2014/01/transLectures-D3.1.2-15Nov2013.pdf

  27. Turró, C., Ferrando, M., Busquets, J., Cañero, A.: Polimedia: a system for successful video e-learning. In: Proc. of EUNIS (2009)

    Google Scholar 

  28. Videolectures.NET: Exchange ideas and share knowledge, http://www.videolectures.net

  29. Wu, Y.J., King, S., Tokuda, K.: Cross-lingual speaker adaptation for HMM-based speech synthesis. In: Proc. of ISCSLP, pp. 1–4 (2008)

    Google Scholar 

  30. Yamagishi, J.: An introduction to HMM-based speech synthesis. Tech. rep. Centre for Speech Technology Research (2006), https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdf

  31. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. of Eurospeech, pp. 2347–2350 (1999)

    Google Scholar 

  32. Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proc. of ICASSP, pp. 3872–3876 (2014)

    Google Scholar 

  33. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proc. of ICASSP, pp. 7962–7966 (2013)

    Google Scholar 

  34. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11), 1039–1064 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Piqueras, S., del-Agua, M.A., Giménez, A., Civera, J., Juan, A. (2014). Statistical Text-to-Speech Synthesis of Spanish Subtitles. In: Navarro Mesa, J.L., et al. Advances in Speech and Language Technologies for Iberian Languages. Lecture Notes in Computer Science(), vol 8854. Springer, Cham. https://doi.org/10.1007/978-3-319-13623-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13623-3_5

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13622-6

  • Online ISBN: 978-3-319-13623-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics