Skip to main content

A CycleGAN-Based Method for Translating Recordings of Interjections

  • Conference paper
  • First Online:
Creativity in Intelligent Technologies and Data Science (CIT&DS 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1909))

Included in the following conference series:

  • 196 Accesses

Abstract

This article is dedicated to a new method for translating audio recordings of interjections between two domains. The original speaker's voice is transformed to the target one within a given voice pair, and vice versa. Basing on an overview conducted, mel-frequency cepstral coefficients are selected as main features of signal. The introduced method is implemented and approbated on the grounds of a book-reading training dataset and sample interjections for a number of voice pairs, including human-robotic. Recommendations are given on the method applicability and on dataset recording. Obtained results testify that we found the solution to the problem of overcoming constraints of existing speech-synthesizing software, namely that of the limitedness of interjections forms and that of the poor intonations variety. The method is to be applied in order to fill reactions databases for dialogue systems designed for affective communication, such as the F-2 interlocutor robot. The introduced method will enable spoken reactions corpora developers to record interjections of interest and to translate them to the selected synthetic voice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahirwar, K.: Generative Adversarial Networks Projects. Packt Publishing, Birmingham (2019)

    Google Scholar 

  2. Antsiferova, V.I., Pesetskaya, T.V., Yuldoshev, I.I., Syanyan, L., Tsin, V., Lavlinskiy, V.V.: One of approaches towards determining relevant features for audiosignals: a sample research on interjections (in Russian). In: Zol’nikov, V.K. (ed.) Sovremennye aspekty modelirovaniya sistem i protsessov: Materialy Vserossiyskoy nauchno-prakticheskoy konferentsii, 2021, pp. 15–20. Voronezh State University of Forestry and Technologies named after G.F. Morozov, Voronezh (2021)

    Google Scholar 

  3. Becker, C.W., Kopp, S., Wachsmuth I.: Simulating the emotion dynamics of a multimodal conversational agent. In: Affective Dialogue Systems, Tutorial and Research Workshop, ADS 2004, LNCS 3068, pp. 154–165. Springer, Heidelberg (2004)

    Google Scholar 

  4. Bloomfield, L.: An introduction to the study of language. Holt, New York (1914)

    Google Scholar 

  5. Cai, Y.: Empathic Computing. In: Cai, Y., Abascal, J. (eds.) Ambient Intelligence in Everyday Life. LNCS (LNAI), vol. 3864, pp. 67–85. Springer, Heidelberg (2006). https://doi.org/10.1007/11825890_3

    Chapter  Google Scholar 

  6. Campbell, N.: Recording techniques for capturing natural everyday speech. In: Proc. Language Resources and Evaluation Conference (LREC-02), pp. 2029–2032. European Language Resources Association, Paris (2002)

    Google Scholar 

  7. Campbell, N.: Extra-semantic protocols; input requirements for the synthesis of dialogue speech. In: Proc. Affective Dialogue Systems, Tutorial and Research Workshop, ADS 2004, LNCS 3068, pp. 221–228. Springer, Heidelberg (2004)

    Google Scholar 

  8. Dauphin, Y.N., Fan, A., Auli, M., Grangie, D.: Language modeling with gated convolutional networks. In: Precup, D., Teh, Y.W. (eds.) Proc. of the 34th International Conference on Machine Learning, PMLR 70. MLResearchPress (2017)

    Google Scholar 

  9. Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  10. Dingemanse, M.: Interjections (preprint). In: van Lier, E. (ed.) The Oxford Handbook of Word Classes. Oxford University Press, Oxford (2021)

    Google Scholar 

  11. Drijvers, L., Holler, J.: The multimodal facilitation effect in human communication. Psychon. Bull. Rev. 30(2), 792–801 (2022)

    Article  Google Scholar 

  12. Efimov, A.P., Nikonov, A.V., Sapozhkov, M.A., Shorov, V.I.: Acoustics: Spravochnik (in Russian). In: Sapozhkov, M.A. (ed.) Radio i svyaz’, Moskva (1989)

    Google Scholar 

  13. Elffers, E.: Interjections and the language functions debate. Asia Paci. J. Human Resou. 50(1), 17–29 (2008)

    Google Scholar 

  14. Goffman, E.: Response cries. Language 54(4), 787–815 (1978)

    Article  Google Scholar 

  15. Goodfellow, I., et al.: Generative adversarial networks. Advances in Neural Information Processing Systems (NIPS 2014) 27, 2672–2680 (2014)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Piscataway (2016)

    Google Scholar 

  17. Huang, X., Acero, A., Hon, H.-W.: Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR, New Jersey (2001)

    Google Scholar 

  18. Ippolitova, N.A.: Ritorics. Prospekt, Moscow (2013). (in Russian)

    Google Scholar 

  19. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR 2017, pp. 5967–5976. IEEE, Piscataway (2017)

    Google Scholar 

  20. Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. In: 30th British Machine Vision Conference, pp. 78.1–78.15. BMVA Press, Durham (2019)

    Google Scholar 

  21. Kozlov, A.V., Kudashev, O., Matveev, Y.: A system for dictors identification by voice for NIST SRE 2012 contest (in Russian). Trudy SPIIRAN 2, 350–370 (2013)

    Google Scholar 

  22. Kudashev, O.Y.: A system for dictors separation based on probabilistic linear discriminant analysis. Ph.d. thesis (in Russian). ITMO, Saint-Petersburg (2014)

    Google Scholar 

  23. Liu, K., Zhang, J., Yan, Y.: High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin. In: Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), vol. 4, pp. 410–414. IEEE, Piscataway (2007)

    Google Scholar 

  24. Malkina, M., Zinina, A., Arinkin, N., Kotov, A.: Multimodal hedges for companion robots: a politeness strategy or an emotional expression? In: Selegey, V.P., et al. (eds.) Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, issue 22, pp. 319–326. RSUH, Moscow (2023)

    Google Scholar 

  25. Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE transactions on information and systems, E99-D(7), 1877–1884 (2016)

    Google Scholar 

  26. Morise, M.: Platinum: A method to extract excitation signals for voice synthesis system. Acoustic Science & Technology 33(2), 123–125 (2012)

    Article  Google Scholar 

  27. Morise, M.: Cheaptrick, a spectral envelope estimator for high-quality speech synthesis. Speech Commun. 67, 1–7 (2015)

    Article  Google Scholar 

  28. Morise, M.: Harvest: a high-performance fundamental frequency estimator from speech signals. In: Proc. INTERSPEECH 2017, pp. 2321–2325. ISCA, Baixas (2017)

    Google Scholar 

  29. Morise, M.: Implementation of sequential real-time waveform generator for high-quality vocoder. In: Proceedings of APSIPA Annual Summit and Conference, pp. 821–825. IEEE, Pictasaway (2020)

    Google Scholar 

  30. Niewiadomski, R., Bevacqua, E., Mancini, M, Pelachaud, C.: Greta: an interactive expressive ECA system. In: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems, vol. 2, pp. 1399–1400. International Foundation for Autonomous Agents and Multiagent Systems, Richland (2009)

    Google Scholar 

  31. Ohtani, Y., Toda, T., Saruwatari, H., Shikano, K.: Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation. In: Proc. INTERSPEECH 2006, pp. 2266–2269. IEEE, Pictasaway (2007)

    Google Scholar 

  32. Plotnikov, V.N., Sukhanov, V.A., Zhigulevtsev, Y.: Spoken dialogue in control systems. Mashinostroeniye, Moscow (1988). (in Russian)

    Google Scholar 

  33. Ronzhin, A.L., Karpov, A.A., Lee, I.V.: Speech and multimodal interfaces. Nauka, Moscow (2006). (in Russian)

    Google Scholar 

  34. RusCorpora: the Russian National Corpus of texts, https://ruscorpora.ru/en/, last accessed 15 June 2023

  35. Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012)

    Article  Google Scholar 

  36. Shröder, M.: Towards a standards-based framework for building emotion-oriented systems. Advances in Human-Computer Interaction 2010, article ID 319406 (2010)

    Google Scholar 

  37. Stijovic, R., Lazic-Konjik, I., Spasojevic, M.: Interjections in the contemporary Serbian language: classification and lexicographic treatment (in Serbian). Juznoslovenski filolog 75, 37–61 (2019)

    Article  Google Scholar 

  38. Volkova, L.S., Shakhovskaya, S.N. (eds.): Logopediya: Uchebnik dlya studentov defektologicheskikh fakul’tetov (Fakul’tetov korrektsionnoj pedagogiki) pedagogicheskikh universitetov i institutov, 3rd edn. Gumanit. izd. tsentr VLADOS, Moscow (1998). (in Russian)

    Google Scholar 

  39. Volkova, L., Ignatev, A., Kotov, N., Kotov, A.: New communicative strategies for the affective robot: F-2 going tactile and complimenting. In: Creativity in Intelligent Technologies and Data Science, CCIS 1448, pp. 163–176. Springer, Heidelberg (2021)

    Google Scholar 

  40. Wharton, T.: Pragmatics and non-verbal Communication. Cambridge University Press, Oxford (2009)

    Book  Google Scholar 

  41. Wierzbicka, A.: The semantics of interjection. J. Pragmat. 18(2–3), 159–192 (1992)

    Article  Google Scholar 

  42. Yandex LLC: Yandex Alice, https://yandex.ru/dev/dialogs/alice/doc/nlu.html, last accessed 16 March 2023

  43. Yandex LLC: Yandex SpeechKit, https://cloud.yandex.ru/services/speechkit, last accessed 16 March 2023

  44. Zhu, J.-Y.: How to know the training should stop. CycleGAN and pix2pix in PyTorch/Jun-Yan Zhu github, https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/issues/166, last accessed 16 March 2023

  45. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. IEEE, Piscataway (2017)

    Google Scholar 

Download references

Acknowledgements

This research is supported by the grant of the Russian Science Foundation № 19-18-00547, https://rscf.ru/project/19-18-00547/.

We would like to thank our volunteer speakers who helped us recording a dataset, and our assessors whose opinions are priceless to our sort of projects, when it comes to crowdsourcing (only the final CycleGan results assessments are stated in this article; inbetween translations were estimated thoroughly as well). We also express gratitude to Edward Klyshinsky for his repeated affirmative “mhm”, which led us to the answer to the puzzling “oohoo” question.

We would like to thank Alexey Yuryevich Popov from IU-6 department, BMSTU, for his kind permission to use his devices of high computational capability in order to train our CycleGANs, and Kirill Leonidovich Tassov from IU-7 department, BMSTU, for his priceless remarks, which helped us to improve our results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liliya Volkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Polianskaya, L., Volkova, L. (2023). A CycleGAN-Based Method for Translating Recordings of Interjections. In: Kravets, A.G., Shcherbakov, M.V., Groumpos, P.P. (eds) Creativity in Intelligent Technologies and Data Science. CIT&DS 2023. Communications in Computer and Information Science, vol 1909. Springer, Cham. https://doi.org/10.1007/978-3-031-44615-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44615-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44614-6

  • Online ISBN: 978-3-031-44615-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics