Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Qian, Xinyuan; Tang, Hao; Yang, Jichen; Zhu, Hongxu; Yin, Xu-Cheng

doi:10.1007/s12369-024-01136-y

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Published: 13 May 2024

(2024)
Cite this article

International Journal of Social Robotics Aims and scope Submit manuscript

Xinyuan Qian ORCID: orcid.org/0000-0002-9511-6713¹,
Hao Tang²,
Jichen Yang³,
Hongxu Zhu⁴ &
…
Xu-Cheng Yin¹

Abstract

Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

Co-speech Gesture Generation with Variational Auto Encoder

References

Li J, Kizilcec R, Bailenson J, Ju W (2016) Social robots and virtual agents as lecturers for video instruction. Comput Hum Behav 55:1222–1230
Article Google Scholar
Liao M-Y, Sung C-Y, Wang H-C, Lin W-C (2019) Virtual classmates: embodying historical learners’ messages as learning companions in a VR classroom through comment mapping. In: IEEE conference on virtual reality and 3D user interfaces. IEEE, pp 163–171
Baur T, Damian I, Gebhard P, Porayska-Pomsta K, André E (2013) A job interview simulation: social cue-based interaction with a virtual character. In: International conference on social computing. IEEE, pp 220–227
Sadoughi N, Busso C (2019) Speech-driven animation with meaningful behaviors. Speech Commun 110:90–100
Article Google Scholar
Karras T, Aila T, Laine S, Herva A, Lehtinen J (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Gr 36(4):1–12
Article Google Scholar
Ahuja C, Morency L-P (2019) Language2pose: natural language grounded pose forecasting. In: International conference on 3D vision. IEEE, pp 719–728
Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions. Learning 2018:2
Google Scholar
Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Gr 39(6):1–16
Article Google Scholar
Robotics S, NAO https://www.aldebaran.com/en/nao
Pouw W, Harrison SJ, Dixon JA (2020) Gesture-speech physics: the biomechanical basis for the emergence of gesture-speech synchrony. J Exp Psychol Gen 149(2):391
Article Google Scholar
Graziano M, Nicoladis E, Marentette P (2020) How referential gestures align with speech: evidence from monolingual and bilingual speakers. Lang Learn 70(1):266–304
Article Google Scholar
Loehr DP (2012) Temporal, structural, and pragmatic synchrony between intonation and gesture. Lab Phonol 3(1):71–89
Article Google Scholar
Chiu C-C, Morency L-P, Marsella S (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In: International conference on intelligent virtual agents. Springer, pp 152–166
Wagner P, Malisz Z, Kopp S (2014) Gesture and speech in interaction: an overview. Elsevier
Google Scholar
Kucherenko T, Hasegawa D, Henter GE, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of ACM international conference on intelligent virtual agents, pp 97–104
Robotics S, Pepper and NAO robots education
17. Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: International conference on robotics and automation. IEEE, pp 4303–4309
Ferstl Y, Neff M, McDonnell R (2020) Adversarial gesture generation with realistic gesture phasing. Comput Gr 89:117–130
Article Google Scholar
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506
Hasegawa D, Kaneko N, Shirakawa S, Sakuta H, Sumi K (2018) Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the international conference on intelligent virtual agents, pp 79–86
Ishi CT, Machiyashiki D, Mikata R, Ishiguro H (2018) A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot Autom Lett 3(4):3757–3764
Article Google Scholar
Monahan S, Johnson E, Lucas G, Finch J, Gratch J (2018) Autonomous agent that provides automated feedback improves negotiation skills. In: International conference on artificial intelligence in education. Springer, pp 225–229
Neff M, Kipp M, Albrecht I, Seidel H-P (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Gr 27(1):1–24
Article Google Scholar
Yang S, Wu Z, Li M, Zhang Z, Hao L, Bao W, Cheng M, Xiao L (2023) Diffusestylegesture: stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Qiu M, Rong Q, Liang D, Tu H (2023) Visual Scanpath transformer: guiding computers to see the world. In: IEEE International symposium on mixed and augmented reality, pp 223–232
Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. association for computational linguistics, vol 2019. NIH Public Access, p 6558
Bhattacharya U, Rewkowski N, Banerjee A, Guhan P, Bera A, Manocha D (2021) Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. arXiv preprint arXiv:2101.11101
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning, pp 8748–8763
Wu Y, Chen K, Zhang T, Hui Y, Berg-Kirkpatrick T, Dubnov S (2023) Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: Proceedings of the international conference on audio, speech, signal process, pp 1–5
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661
Chu J, An D, Ma Y, Cui W, Zhai S, Gu XD, Bi X (2023) Wordgesture-GAN: modeling word-gesture movement with generative adversarial network. In: Proceedings of the 2023 CHI conference on human factors in computing systems, pp 1–15
Liu CY, Mohammadi G, Song Y, Johal W (2023) Speech-gesture GAN: gesture generation for robots and embodied agents. In: IEEE international conference on robot and human interactive communication, pp 405–412
Liu X, Wu Q, Zhou H, Xu Y, Qian R, Lin X, Zhou X, Wu W, Dai B, Zhou B (2022) Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the international conference on computer vision and pattern recognition, pp 10462–10472 (2022)
Voß H, Kopp S (2023) AQ-GT: a temporally aligned and quantized GRU-transformer for co-speech gesture synthesis. arXiv preprint arXiv:2305.01241
Liang Y, Feng Q, Zhu L, Hu L, Pan P, Yang Y (2022) SEEG: semantic energized co-speech gesture generation. In: Proceedings of the international conference on computer vision and pattern recognition, pp 10473–10482
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Sadoughi N, Liu Y, Busso C (2015) MSP-AVATAR corpus: motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In: International conference and workshops on automatic face gesture recognition, pp 1–5
Tolins J, Liu K, Wang Y, Tree JEF, Walker M, Neff M (2016) A multimodal motion-captured corpus of matched and mismatched extravert–introvert conversational pairs. In: Proceedings of the international conference on language resources and evaluation, pp 3469–3476
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339
Article Google Scholar
Shen Y, Feng Y, Wang W, Liang D, Qin J, Xie H, Wei M (2022) MBA-RainGAN: a multi-branch attention generative adversarial network for mixture of rain removal. In: Proceedings of the international conference on audio, speech, signal processing, pp 3418–3422

Download references

Acknowledgements

The research is supported by National Natural Science Foundation of China (62306029, 62076024, 62006018, U22B2055), National Key Research and Development Program of China (2020AAA0109700), CCF-Tencent Rhino-Bird Open Research Fund, special projects in key areas of Guangdong Provincial Department of Education (2023ZDZX1006) and the Science, Technology Program (Key R & D Program) of Guangzhou (2023B01J0004), China.

Author information

Authors and Affiliations

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
Xinyuan Qian & Xu-Cheng Yin
Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
Hao Tang
School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou, China
Jichen Yang
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
Hongxu Zhu

Authors

Xinyuan Qian
View author publications
You can also search for this author in PubMed Google Scholar
Hao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jichen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongxu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xu-Cheng Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jichen Yang or Hongxu Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qian, X., Tang, H., Yang, J. et al. Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis. Int J of Soc Robotics (2024). https://doi.org/10.1007/s12369-024-01136-y

Download citation

Accepted: 19 March 2024
Published: 13 May 2024
DOI: https://doi.org/10.1007/s12369-024-01136-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Abstract

Access this article

Similar content being viewed by others

BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

Co-speech Gesture Generation with Variational Auto Encoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Abstract

Access this article

Similar content being viewed by others

BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

Co-speech Gesture Generation with Variational Auto Encoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation