Advertisement

Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer

  • Milan Sečujski
  • Branislav Gerazov
  • Tamás Gábor Csapó
  • Vlado Delić
  • Philip N. Garner
  • Aleksandar Gjoreski
  • David Guennec
  • Zoran Ivanovski
  • Aleksandar Melov
  • Géza Németh
  • Ana Stojković
  • György Szaszák
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)

Abstract

Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular speaker intention. Prosody transfer within speech-to-speech translation is a recent research area with increasing importance, with one of its most important research topics being the detection and treatment of salient events, i.e. instances of prominence or focus which do not result from syntactic constraints, but are rather products of semantic or pragmatic level effects. This paper presents the design and the guidelines for the creation of a multilingual speech corpus containing prosodically rich sentences, ultimately aimed at training statistical prosody models for multilingual prosody transfer in the context of expressive speech synthesis.

Keywords

Prosody Speech corpus Speech synthesis Speech-to-speech translation 

Notes

Acknowledgments

The authors would like to acknowledge the support of the Swiss National Science Foundation via the research project “SP2: SCOPES Project on Speech Prosody”.

References

  1. 1.
    Adámek, J.: Neural networks controlling prosody of Czech language. Magister thesis, Univerzita Karlova v Praze, Matematicko-fyzikaln fakulta (2002)Google Scholar
  2. 2.
    Agüero, P.D., Adell, J., Bonafonte, A.: Prosody generation for speech-to-speech translation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 557–560 (2006)Google Scholar
  3. 3.
    Beckman, M.E., Hirschberg, J.B., Shattuck-Hufnagel, S.: The original ToBI system and the evolution of the ToBI framework. In: Prosodic Models and Transcription: Towards Prosodic Typology, pp. 9–54 (2004)Google Scholar
  4. 4.
    Boersma, P., et al.: Praat, a system for doing phonetics by computer. Glot Int. 5(9/10), 341–345 (2002)Google Scholar
  5. 5.
    Botinis, A., Fourakis, M., Gawronska, B.: Focus identification in English, Greek and Swedish. In: Proceedings of The 14th International Congress of Phonetic Sciences, San Francisco, pp. 1557–1560 (1999)Google Scholar
  6. 6.
    Bulut, M., Narayanan, S.S., Syrdal, A.K.: Expressive speech synthesis using a concatenative synthesizer. In: Proceedings of Interspeech (2002)Google Scholar
  7. 7.
    Cernak, M., Honnet, P.E.: An empirical model of emphatic word detection. In: Proceedings of Interspeech, Dresden, Germany, September 2015Google Scholar
  8. 8.
    Chen, S., Wang, B., Xu, Y.: Closely related languages, differentways ofrealizing focus. In: Proceedings of Interspeech, pp. 1007–1010 (2009)Google Scholar
  9. 9.
    Gallwitz, F., Niemann, H., Nöth, E., Warnke, V.: Integrated recognition of words and prosodic phrase boundaries. Speech Commun. 36(1), 81–95 (2002)CrossRefzbMATHGoogle Scholar
  10. 10.
    Gerazov, B., Honnet, P.E., Gjoreski, A., Garner, P.N.: Weighted correlation based atom decomposition intonation modelling. In: Proceedings of Interspeech, pp. 1601–1605, Dresden, Germany, September 2015Google Scholar
  11. 11.
    Gjoreski, A., Gerazov, B., Ivanovski, Z.: Atom-decomposition based analysis for the purpose of emphatic word detection. In: XII International Conference ETAI, Ohrid, Macedonia, September 2015Google Scholar
  12. 12.
    Honnet, P.E., Gerazov, B., Garner, P.N.: Atom decomposition-based intonation modelling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4744–4748. IEEE, Brisbane, Australia, April 2015Google Scholar
  13. 13.
    Jeon, J.H., Liu, Y.: Syllable-level prominence detection with acoustic evidence. In: Eleventh Annual Conference of the International Speech Communication Association (2010)Google Scholar
  14. 14.
    Koehn, P., Hoang, H.: Factored translation models. In: EMNLP-CoNLL, pp. 868–876 (2007)Google Scholar
  15. 15.
    Melov, A., Gerazov, B., Ivanovski, Z.: Emphatic word detection based on syllable durations. In: XII International Conference ETAI, Ohrid, Macedonia, September 2015Google Scholar
  16. 16.
    Pitrelli, J.F., Bakis, R., Eide, E.M., Fernandez, R., Hamza, W., Picheny, M.A.: The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14(4), 1099–1108 (2006)CrossRefGoogle Scholar
  17. 17.
    Rosenberg, A.: Automatic detection and classification of prosodic events. Ph.D. thesis, Columbia University, NY, USA (2009)Google Scholar
  18. 18.
    Selkirk, E.: Sentence prosody: intonation, stress, and phrasing. In: Goldsmith, J.A. (ed.) The Handbook of Phonological Theory, Blackwell Handbooks in Linguistics, vol. 1, chap. 16, pp. 550–569. Blackwell Publishers, Oxford, UK (1995)Google Scholar
  19. 19.
    Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Factored translation models for enriching spoken language translation with prosody. In: Proceedings of Interspeech, pp. 2723–2726 (2008)Google Scholar
  20. 20.
    Sridhar, V.K.R., Nenkova, A., Narayanan, S., Jurafsky, D.: Detecting prominence in conversational speech: pitch accent, givenness and focus. In: Proceedings of Speech Prosody, vol. 453, p. 456 (2008)Google Scholar
  21. 21.
    Stojkovic, A., Gerazov, B., Ivanovski, Z.: Emphatic word detection based on relative phoneme energies within syllables. In: XII International Conference ETAI, Ohrid, Macedonia, September 2015Google Scholar
  22. 22.
    Szaszák, G., Beke, A.: Exploiting prosody for automatic syntactic phrase boundary detection in speech. J. Lang. Model. 1, 143–172 (2012)CrossRefGoogle Scholar
  23. 23.
    Szaszák, G., Gábor Csapó, T., Garner, P.N., Gerazov, B., Ivanovski, Z., Németh, G., Tóth, B., Sečujski, M., Delić, V.: The SP2 SCOPES project on speech prosody. In: Proceedings of the DOGS - Digital Speech and Image Processing Conference, pp. 2–10, Novi Sad, Serbia, October 2014Google Scholar
  24. 24.
    Szekely, E., Csapó, T.G., Toth, B., Mihajlik, P., Carson-Berndsen, J.: Synthesizing expressive speech from amateur audiobook recordings. In: Spoken Language Technology Workshop, SLT 2012, pp. 297–302. IEEE (2012)Google Scholar
  25. 25.
    Tatham, M., Morton, K.: Developments in Speech Synthesis. Wiley, New York (2005)CrossRefGoogle Scholar
  26. 26.
    Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRefGoogle Scholar
  27. 27.
    Vicsi, K., Szaszák, G.: Using prosody to improve automatic speech recognition. Speech Commun. 52(5), 413–426 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Milan Sečujski
    • 1
  • Branislav Gerazov
    • 2
  • Tamás Gábor Csapó
    • 3
  • Vlado Delić
    • 1
  • Philip N. Garner
    • 4
  • Aleksandar Gjoreski
    • 2
  • David Guennec
    • 5
  • Zoran Ivanovski
    • 2
  • Aleksandar Melov
    • 2
  • Géza Németh
    • 3
  • Ana Stojković
    • 2
  • György Szaszák
    • 3
  1. 1.Faculty of Technical SciencesUniversity of Novi SadNovi SadSerbia
  2. 2.Faculty of Electrical Engineering and Information TechnologiesUniversity of Ss. Cyril and MethodiusSkopjeMacedonia
  3. 3.Department of Telecommunications and Media InformaticsBudapest University of Technology and EconomicsBudapestHungary
  4. 4.Idiap Research InstituteMartignySwitzerland
  5. 5.IRISA Research InstituteRennesFrance

Personalised recommendations