An end-to-end model for cross-lingual transformation of paralinguistic information

  • Takatomo Kano
  • Shinnosuke Takamichi
  • Sakriani Sakti
  • Graham Neubig
  • Tomoki Toda
  • Satoshi Nakamura


Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.


Paralinguistic information Speech to speech translation Automatic speech recognition Machine translation Text to speech synthesis 



The funding was provided by Japan Society for the Promotion of Science (Grand Nos. 24240032 and 26870371).


  1. Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: ICASSP-88, international conference on acoustics, speech, and signal processing, New York City, vol 1, pp 655–658Google Scholar
  2. Aguero PD, Adell J, Bonafonte A (2006) Prosody generation for speech-to-speech translation. In: 2006 IEEE international conference on acoustics speech and signal processing proceedings, Toulouse, FranceGoogle Scholar
  3. Anumanchipalli GK, Oliveira LC, Black AW (2012) Intent transfer in speech-to-speech machine translation. In: 2012 IEEE spoken language technology workshop (SLT), Miami, FL, pp 153–158Google Scholar
  4. Do QT, Sakti S, Neubig G, Toda T, Nakamura S (2015a) Improving translation of emphasis with pause prediction in speech-to-speech translation systems. In: IWSLT 2015: proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 204–208Google Scholar
  5. Do QT, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2015b) Preserving word-level emphasis in speech-to-speech translation using linear regression HSMMs. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, pp 3665–3669Google Scholar
  6. Do QT, Sakti S, Nakamura S (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, pp 2640–2644Google Scholar
  7. Dreyer M, Dong Y (2015) Apro: all-pairs ranking optimization for mt tuning. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 1018–1023Google Scholar
  8. Duong L, Anastasopoulos A, Chiang D, Bird S, Cohn T (2016) An attentional model for speech translation without transcription. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, CA, pp 949–959Google Scholar
  9. Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of machine translation summit XIII, Xiamen, China, pp 81–88Google Scholar
  10. Kano T, Sakti S, Takamichi S, Neubig G, Toda T, Nakamura S (2012) A method for translation of paralinguistic information. In: 2012 International workshop on spoken language translation, Hong Kong, pp 158–163Google Scholar
  11. Kano T, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2013) Generalizing continuous-space translation of paralinguistic information. In: INTERSPEECH 2013, 14th Annual conference of the international speech communication association, Lyon, France, pp 2614–2618Google Scholar
  12. Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL-2007: proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Czech Republic, pp 868–876Google Scholar
  13. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180Google Scholar
  14. Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE international conference on acoustics, speech, and signal processing, San Diego, CA, pp 328–331Google Scholar
  15. Morishima S, Nakamura S (2002) Multi-modal translation system and its evaluation. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, Pittsburgh, PA, pp 241–246Google Scholar
  16. Neubig G, Duh K, Ogushi M, Kano T, Kiso T, Sakti S, Toda T, Nakamura S (2012) The NAIST machine translation system for IWSLT 2012. In: IWSLT-2012: 9th international workshop on spoken language translation, Hong Kong, pp 54–60Google Scholar
  17. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, PA, pp 311–318Google Scholar
  18. Pearce D, Hirsch HG (2000) The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000—Automatic speech recognition: challenges for the new millenium, Paris, France, pp 181–188Google Scholar
  19. Sridhar VKR, Bangalore S, Narayanan S (2013) Enriching machine-mediated speech-to-speech translation using contextual information. Comput Speech Lang 27(2):492–508CrossRefGoogle Scholar
  20. Székely É, Steiner I, Ahmed Z, Carson-Berndsen J (2014) Facial expression-based affective speech translation. J Multimodal User Interfaces 8(1):87–96CrossRefGoogle Scholar
  21. Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida H, Sugaya F, Yokoo A, Yamamoto S (1998) A Japanese-to-English speech translation system: ATR-MATRIX. In: 5th international conference on spoken language processing, ICSLP’98 proceedings, Sydney, Australia, pp 2779–2883Google Scholar
  22. Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235CrossRefGoogle Scholar
  23. Wahlster W (2001) Robust translation of spontaneous speech: a multi-engine approach. In: Proceedings of seventeenth international joint conference on artificial intelligence, invited papers, Seattle, WA, pp 19–28Google Scholar
  24. Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly transcribe foreign speech. arXiv:1703.08581
  25. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064CrossRefGoogle Scholar
  26. Zhang J, Nakamura S (2003) An efficient algorithm to search for a minimum sentence set for collecting speech database. In: 15th international congress of phonetic sciences (ICPhS-15), Barcelona, Spain, pp 3145–3148Google Scholar

Copyright information

© Springer Science+Business Media B.V., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Graduate School of Information ScienceNara Institute of Science and TechnologyKansai Science CityJapan

Personalised recommendations