Abstract
Generating stylized audio-driven gestures for robots and virtual avatars has attracted increasing considerations recently. Existing methods require style labels (e.g. speaker identities), or complex preprocessing of data to obtain the style control parameters. In this paper, we propose a new end-to-end flow-based model, which can generate audio-driven gestures of arbitrary styles with neither preprocessing nor style labels. To achieve this goal, we introduce a global encoder and a gesture perceptual loss into the classic generative flow model to capture both global and local information. We conduct extensive experiments on two benchmark datasets: the TED Dataset and the Trinity Dataset. Both quantitative and qualitative evaluations show that the proposed model outperforms state-of-the-art models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM Trans. Graph. (TOG) 39(4), 64–1 (2020)
Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.-P.: Style transfer for co-speech gesture animation: a multi-speaker conditional-mixture approach. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 248–265. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_15
Ahuja, C., Morency, L.P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728 IEEE (2019)
Alexanderson, S., Henter, G.E.: Robust model training and generalisation with studentising flows. In: ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (INNF+ 2020). vol. 2, pp. 25–1 (2020)
Alexanderson, S., Henter, G.E., Kucherenko, T., Beskow, J.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum, vol. 39, pp. 487–496. Wiley Online Library (2020)
Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2027–2036 (2021)
Bremner, P., Pipe, A.G., Melhuish, C., Fraser, M., Subramanian, S.: The effects of robot-performed co-verbal gesture on listener behaviour. In: 2011 11th IEEE-RAS International Conference on Humanoid Robots, pp. 458–465. IEEE (2011)
Cassell, J., McNeill, D., McCullough, K.E.: Speech-gesture mismatches: evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics Cogn. 7(1), 1–34 (1999)
Cassell, J., et al.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st Annual Conference On Computer Graphics And Interactive Techniques, pp. 413–420 (1994)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: International Conference on Learning Representations (2017)
Du, H., Herrmann, E., Sprenger, J., Cheema, N., Hosseini, S., Fischer, K., Slusallek, P.: Stylistic locomotion modeling with conditional variational autoencoder. In: Eurographics (Short Papers), pp. 9–12 (2019)
Ferstl, Y., McDonnell, R.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 93–98 (2018)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 79–86 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Henter, G.E., Alexanderson, S., Beskow, J.: MoGlow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. (TOG) 39(6), 1–14 (2020)
Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: improving flow-based generative models with variational dequantization and architecture design. In: International Conference on Machine Learning, pp. 2722–2730. PMLR (2019)
Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Transa. Graph. (TOG) 35(4), 1–11 (2016)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014)
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Kipp, M.: Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal-Publishers, Irvine (2005)
Kopp, S., et al.: Towards a common framework for multimodal generation: the behavior markup language. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 205–217. Springer, Heidelberg (2006). https://doi.org/10.1007/11821830_17
Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., Kjellström, H.: Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 97–104 (2019)
Kucherenko, T., et al.: Gesticulator: a framework for semantically-aware speech-driven gesture generation. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 242–250 (2020)
Kucherenko, T., Jonell, P., Yoon, Y., Wolfert, P., Henter, G.E.: A large, crowdsourced evaluation of gesture generation systems on common data: the GENEA challenge 2020. In: 26th International Conference on Intelligent User Interfaces, pp. 11–21 (2021)
Levine, S., Krähenbühl, P., Thrun, S., Koltun, V.: Gesture controllers. In: ACM SIGGRAPH 2010 papers, pp. 1–11. Association for Computing Machinery, New York, NY, United States (2010)
Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: ACM SIGGRAPH Asia 2009 papers, pp. 1–10. Association for Computing Machinery, New York, NY, United States (2009)
Li, J., et al.: Audio2gestures: generating diverse gestures from speech audio with conditional variational autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11293–11302 (2021)
Ma, X., Kong, X., Zhang, S., Hovy, E.: MaCow: masked convolutional generative flow. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Ma, X., Kong, X., Zhang, S., Hovy, E.H.: Decoupling global and local representations via invertible generative flows. In: International Conference on Learning Representations (2020)
Neff, M., Kipp, M., Albrecht, I., Seidel, H.P.: Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans. Graph. (TOG) 27(1), 1–24 (2008)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11077–11086 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Studdert-Kennedy, M.: Hand and mind: what gestures reveal about thought. Lang. Speech 37(2), 203–209 (1994)
Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview. Speech Commun. 57, 209–232 (2014)
Wen, Y.H., Yang, Z., Fu, H., Gao, L., Sun, Y., Liu, Y.J.: Autoregressive stylized motion synthesis with generative flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13612–13621 (2021)
Yoon, Y., et al.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)
Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 4303–4309. IEEE (2019)
Acknowledgments
This work was supported by the Natural Science Foundation of China (No.61725204), Tsinghua University Initiative Scientific Research Program, China Postdoctoral Science Foundation (No.2021M701891).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ye, S. et al. (2022). Audio-Driven Stylized Gesture Generation with Flow-Based Model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13665. Springer, Cham. https://doi.org/10.1007/978-3-031-20065-6_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-20065-6_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20064-9
Online ISBN: 978-3-031-20065-6
eBook Packages: Computer ScienceComputer Science (R0)