Abstract
Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.
Similar content being viewed by others
References
Li J, Kizilcec R, Bailenson J, Ju W (2016) Social robots and virtual agents as lecturers for video instruction. Comput Hum Behav 55:1222–1230
Liao M-Y, Sung C-Y, Wang H-C, Lin W-C (2019) Virtual classmates: embodying historical learners’ messages as learning companions in a VR classroom through comment mapping. In: IEEE conference on virtual reality and 3D user interfaces. IEEE, pp 163–171
Baur T, Damian I, Gebhard P, Porayska-Pomsta K, André E (2013) A job interview simulation: social cue-based interaction with a virtual character. In: International conference on social computing. IEEE, pp 220–227
Sadoughi N, Busso C (2019) Speech-driven animation with meaningful behaviors. Speech Commun 110:90–100
Karras T, Aila T, Laine S, Herva A, Lehtinen J (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Gr 36(4):1–12
Ahuja C, Morency L-P (2019) Language2pose: natural language grounded pose forecasting. In: International conference on 3D vision. IEEE, pp 719–728
Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions. Learning 2018:2
Yoon Y, Cha B, Lee J-H, Jang M, Lee J, Kim J, Lee G (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans Gr 39(6):1–16
Robotics S, NAO https://www.aldebaran.com/en/nao
Pouw W, Harrison SJ, Dixon JA (2020) Gesture-speech physics: the biomechanical basis for the emergence of gesture-speech synchrony. J Exp Psychol Gen 149(2):391
Graziano M, Nicoladis E, Marentette P (2020) How referential gestures align with speech: evidence from monolingual and bilingual speakers. Lang Learn 70(1):266–304
Loehr DP (2012) Temporal, structural, and pragmatic synchrony between intonation and gesture. Lab Phonol 3(1):71–89
Chiu C-C, Morency L-P, Marsella S (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In: International conference on intelligent virtual agents. Springer, pp 152–166
Wagner P, Malisz Z, Kopp S (2014) Gesture and speech in interaction: an overview. Elsevier
Kucherenko T, Hasegawa D, Henter GE, Kaneko N, Kjellström H (2019) Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of ACM international conference on intelligent virtual agents, pp 97–104
Robotics S, Pepper and NAO robots education
17. Yoon Y, Ko W-R, Jang M, Lee J, Kim J, Lee G (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: International conference on robotics and automation. IEEE, pp 4303–4309
Ferstl Y, Neff M, McDonnell R (2020) Adversarial gesture generation with realistic gesture phasing. Comput Gr 89:117–130
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3497–3506
Hasegawa D, Kaneko N, Shirakawa S, Sakuta H, Sumi K (2018) Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the international conference on intelligent virtual agents, pp 79–86
Ishi CT, Machiyashiki D, Mikata R, Ishiguro H (2018) A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot Autom Lett 3(4):3757–3764
Monahan S, Johnson E, Lucas G, Finch J, Gratch J (2018) Autonomous agent that provides automated feedback improves negotiation skills. In: International conference on artificial intelligence in education. Springer, pp 225–229
Neff M, Kipp M, Albrecht I, Seidel H-P (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Gr 27(1):1–24
Yang S, Wu Z, Li M, Zhang Z, Hao L, Bao W, Cheng M, Xiao L (2023) Diffusestylegesture: stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Qiu M, Rong Q, Liang D, Tu H (2023) Visual Scanpath transformer: guiding computers to see the world. In: IEEE International symposium on mixed and augmented reality, pp 223–232
Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. association for computational linguistics, vol 2019. NIH Public Access, p 6558
Bhattacharya U, Rewkowski N, Banerjee A, Guhan P, Bera A, Manocha D (2021) Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. arXiv preprint arXiv:2101.11101
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning, pp 8748–8763
Wu Y, Chen K, Zhang T, Hui Y, Berg-Kirkpatrick T, Dubnov S (2023) Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: Proceedings of the international conference on audio, speech, signal process, pp 1–5
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661
Chu J, An D, Ma Y, Cui W, Zhai S, Gu XD, Bi X (2023) Wordgesture-GAN: modeling word-gesture movement with generative adversarial network. In: Proceedings of the 2023 CHI conference on human factors in computing systems, pp 1–15
Liu CY, Mohammadi G, Song Y, Johal W (2023) Speech-gesture GAN: gesture generation for robots and embodied agents. In: IEEE international conference on robot and human interactive communication, pp 405–412
Liu X, Wu Q, Zhou H, Xu Y, Qian R, Lin X, Zhou X, Wu W, Dai B, Zhou B (2022) Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the international conference on computer vision and pattern recognition, pp 10462–10472 (2022)
Voß H, Kopp S (2023) AQ-GT: a temporally aligned and quantized GRU-transformer for co-speech gesture synthesis. arXiv preprint arXiv:2305.01241
Liang Y, Feng Q, Zhu L, Hu L, Pan P, Yang Y (2022) SEEG: semantic energized co-speech gesture generation. In: Proceedings of the international conference on computer vision and pattern recognition, pp 10473–10482
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Sadoughi N, Liu Y, Busso C (2015) MSP-AVATAR corpus: motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In: International conference and workshops on automatic face gesture recognition, pp 1–5
Tolins J, Liu K, Wang Y, Tree JEF, Walker M, Neff M (2016) A multimodal motion-captured corpus of matched and mismatched extravert–introvert conversational pairs. In: Proceedings of the international conference on language resources and evaluation, pp 3469–3476
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339
Shen Y, Feng Y, Wang W, Liang D, Qin J, Xie H, Wei M (2022) MBA-RainGAN: a multi-branch attention generative adversarial network for mixture of rain removal. In: Proceedings of the international conference on audio, speech, signal processing, pp 3418–3422
Acknowledgements
The research is supported by National Natural Science Foundation of China (62306029, 62076024, 62006018, U22B2055), National Key Research and Development Program of China (2020AAA0109700), CCF-Tencent Rhino-Bird Open Research Fund, special projects in key areas of Guangdong Provincial Department of Education (2023ZDZX1006) and the Science, Technology Program (Key R & D Program) of Guangzhou (2023B01J0004), China.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qian, X., Tang, H., Yang, J. et al. Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis. Int J of Soc Robotics (2024). https://doi.org/10.1007/s12369-024-01136-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s12369-024-01136-y