Skip to main content
Log in

An end-to-end model for cross-lingual transformation of paralinguistic information

  • Published:
Machine Translation

Abstract

Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Part of the content of this article is based on content that has been published in IWSLT and InterSpeech (Kano et al. 2012, 2013). In this paper we describe these methods using a unified formulation, adding a more complete survey, and a discussion of the results in significantly more depth.

References

  • Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: ICASSP-88, international conference on acoustics, speech, and signal processing, New York City, vol 1, pp 655–658

  • Aguero PD, Adell J, Bonafonte A (2006) Prosody generation for speech-to-speech translation. In: 2006 IEEE international conference on acoustics speech and signal processing proceedings, Toulouse, France

  • Anumanchipalli GK, Oliveira LC, Black AW (2012) Intent transfer in speech-to-speech machine translation. In: 2012 IEEE spoken language technology workshop (SLT), Miami, FL, pp 153–158

  • Do QT, Sakti S, Neubig G, Toda T, Nakamura S (2015a) Improving translation of emphasis with pause prediction in speech-to-speech translation systems. In: IWSLT 2015: proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 204–208

  • Do QT, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2015b) Preserving word-level emphasis in speech-to-speech translation using linear regression HSMMs. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, pp 3665–3669

  • Do QT, Sakti S, Nakamura S (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, pp 2640–2644

  • Dreyer M, Dong Y (2015) Apro: all-pairs ranking optimization for mt tuning. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 1018–1023

  • Duong L, Anastasopoulos A, Chiang D, Bird S, Cohn T (2016) An attentional model for speech translation without transcription. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, CA, pp 949–959

  • Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of machine translation summit XIII, Xiamen, China, pp 81–88

  • Kano T, Sakti S, Takamichi S, Neubig G, Toda T, Nakamura S (2012) A method for translation of paralinguistic information. In: 2012 International workshop on spoken language translation, Hong Kong, pp 158–163

  • Kano T, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2013) Generalizing continuous-space translation of paralinguistic information. In: INTERSPEECH 2013, 14th Annual conference of the international speech communication association, Lyon, France, pp 2614–2618

  • Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL-2007: proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Czech Republic, pp 868–876

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180

  • Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE international conference on acoustics, speech, and signal processing, San Diego, CA, pp 328–331

  • Morishima S, Nakamura S (2002) Multi-modal translation system and its evaluation. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, Pittsburgh, PA, pp 241–246

  • Neubig G, Duh K, Ogushi M, Kano T, Kiso T, Sakti S, Toda T, Nakamura S (2012) The NAIST machine translation system for IWSLT 2012. In: IWSLT-2012: 9th international workshop on spoken language translation, Hong Kong, pp 54–60

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, PA, pp 311–318

  • Pearce D, Hirsch HG (2000) The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000—Automatic speech recognition: challenges for the new millenium, Paris, France, pp 181–188

  • Sridhar VKR, Bangalore S, Narayanan S (2013) Enriching machine-mediated speech-to-speech translation using contextual information. Comput Speech Lang 27(2):492–508

    Article  Google Scholar 

  • Székely É, Steiner I, Ahmed Z, Carson-Berndsen J (2014) Facial expression-based affective speech translation. J Multimodal User Interfaces 8(1):87–96

    Article  Google Scholar 

  • Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida H, Sugaya F, Yokoo A, Yamamoto S (1998) A Japanese-to-English speech translation system: ATR-MATRIX. In: 5th international conference on spoken language processing, ICSLP’98 proceedings, Sydney, Australia, pp 2779–2883

  • Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235

    Article  Google Scholar 

  • Wahlster W (2001) Robust translation of spontaneous speech: a multi-engine approach. In: Proceedings of seventeenth international joint conference on artificial intelligence, invited papers, Seattle, WA, pp 19–28

  • Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly transcribe foreign speech. arXiv:1703.08581

  • Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  • Zhang J, Nakamura S (2003) An efficient algorithm to search for a minimum sentence set for collecting speech database. In: 15th international congress of phonetic sciences (ICPhS-15), Barcelona, Spain, pp 3145–3148

Download references

Acknowledgements

The funding was provided by Japan Society for the Promotion of Science (Grand Nos. 24240032 and 26870371).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takatomo Kano.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kano, T., Takamichi, S., Sakti, S. et al. An end-to-end model for cross-lingual transformation of paralinguistic information. Machine Translation 32, 353–368 (2018). https://doi.org/10.1007/s10590-018-9217-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-018-9217-7

Keywords

Navigation