Advertisement

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

Conference paper
  • 564 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12363)

Abstract

How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent ‘A’ in the gesturing style of a target speaker ‘B’. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker’s gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data and videos: http://chahuja.com/mix-stage.

Keywords

Gesture animation Style transfer Co-speech gestures 

Notes

Acknowledgements

This material is based upon work partially supported by the National Science Foundation (Awards #1750439 #1722822), National Institutes of Health and the InMind project. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of National Science Foundation or National Institutes of Health, and no official endorsement should be inferred.

Supplementary material

504473_1_En_15_MOESM1_ESM.pdf (7.3 mb)
Supplementary material 1 (pdf 7463 KB)

References

  1. 1.
    Ahuja, C., Ma, S., Morency, L.P., Sheikh, Y.: To react or not to react: end-to-end visual pose forecasting for personalized avatar during dyadic conversations. In: 2019 International Conference on Multimodal Interaction, pp. 74–84. ACM (2019)Google Scholar
  2. 2.
    Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728. IEEE (2019)Google Scholar
  3. 3.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)Google Scholar
  4. 4.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
  5. 5.
    Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (GANs). In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 224–232. JMLR. org (2017)Google Scholar
  6. 6.
    Bailenson, J.N., Yee, N., Merget, D., Schroeder, R.: The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence: Teleoperators Virtual Environ. 15(4), 359–372 (2006)Google Scholar
  7. 7.
    Bansal, A., Ma, S., Ramanan, D., Sheikh, Y.: Recycle-GAN: unsupervised video retargeting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 122–138. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_8CrossRefGoogle Scholar
  8. 8.
    Bergmann, K., Kopp, S.: Increasing the expressiveness of virtual agents: autonomous generation of speech and gesture for spatial description tasks. In: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 361–368 (2009)Google Scholar
  9. 9.
    Bian, Y., Chen, C., Kang, Y., Pan, Z.: Multi-reference tacotron by intercross training for style disentangling, transfer and control in speech synthesis. arXiv preprint arXiv:1904.02373 (2019)
  10. 10.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
  11. 11.
    Cassell, J., Vilhjálmsson, H.H., Bickmore, T.: Beat: the behavior expression animation toolkit. In: Prendinger, H., Ishizuka, M. (eds.) Life-Like Characters, pp. 163–185. Springer, Heidelberg (2004)Google Scholar
  12. 12.
    Chiu, C.C., Marsella, S.: Gesture generation with low-dimensional embeddings. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 781–788 (2014)Google Scholar
  13. 13.
    Chiu, C.-C., Morency, L.-P., Marsella, S.: Predicting co-verbal gestures: a deep and temporal modeling approach. In: Brinkman, W.-P., Broekens, J., Heylen, D. (eds.) IVA 2015. LNCS (LNAI), vol. 9238, pp. 152–166. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-21996-7_17CrossRefGoogle Scholar
  14. 14.
    Davis, R.O., Vincent, J.: Sometimes more is better: agent gestures, procedural knowledge and the foreign language learner. Br. J. Educ. Technol. 50(6), 3252–3263 (2019)CrossRefGoogle Scholar
  15. 15.
    Denton, E.L., et al.: Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems, pp. 4414–4423 (2017)Google Scholar
  16. 16.
    Ferstl, Y., Neff, M., McDonnell, R.: Multi-objective adversarial gesture generation. In: Motion, Interaction and Games, p. 3. ACM (2019)Google Scholar
  17. 17.
    Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
  18. 18.
    Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)Google Scholar
  19. 19.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  20. 20.
    Gurunath, N., Rallabandi, S.K., Black, A.: Disentangling speech and non-speech components for building robust acoustic models from found data. arXiv preprint arXiv:1909.11727 (2019)
  21. 21.
    Hao, G.Y., Yu, H.X., Zheng, W.S.: Mixgan: learning concepts from different domains for mixture generation. arXiv preprint arXiv:1807.01659 (2018)
  22. 22.
    Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evaluation of speech-to-gesture generation using bi-directional LSTM network. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA18), pp. 79–86 (2018)Google Scholar
  23. 23.
    Hoang, Q., Nguyen, T.D., Le, T., Phung, D.: MGAN: training generative adversarial nets with multiple generators (2018)Google Scholar
  24. 24.
    Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_11CrossRefGoogle Scholar
  25. 25.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)Google Scholar
  26. 26.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_43CrossRefGoogle Scholar
  27. 27.
    Kendon, A.: Gesture and speech: two aspects of the process of utterance. In: Key, M.R. (ed.) Nonverbal Communication and Language, pp. 207–227 (1980)Google Scholar
  28. 28.
    Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., Kjellström, H.: Analyzing input and output representations for speech-driven gesture generation. arXiv preprint arXiv:1903.03369 (2019)
  29. 29.
    Lee, H.Y., et al.: Drit++: Diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270 (2019)
  30. 30.
    Lee, H.Y., et al.: Dancing to music. In: Advances in Neural Information Processing Systems, pp. 3581–3591 (2019)Google Scholar
  31. 31.
    Levine, S., Krähenbühl, P., Thrun, S., Koltun, V.: Gesture controllers. ACM Trans. Graph. 29(4), 124:1–124:11 (2010)Google Scholar
  32. 32.
    Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. ACM Trans. Graph. 28(5), 172:1–172:10 (2009)Google Scholar
  33. 33.
    Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems, pp. 700–708 (2017)Google Scholar
  34. 34.
    Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 469–477 (2016)Google Scholar
  35. 35.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Ma, S., Mcduff, D., Song, Y.: Neural TTS stylization with adversarial and collaborative games (2018)Google Scholar
  37. 37.
    van den Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)Google Scholar
  38. 38.
    McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago Press (1992)Google Scholar
  39. 39.
    Nagrani, A., Chung, J.S., Albanie, S., Zisserman, A.: Disentangled speech embeddings using cross-modal self-supervision. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6829–6833. IEEE (2020)Google Scholar
  40. 40.
    Neff, M., Kipp, M., Albrecht, I., Seidel, H.P.: Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans. Graph. (TOG) 27(1), 1–24 (2008)CrossRefGoogle Scholar
  41. 41.
    Obermeier, C., Kelly, S.D., Gunter, T.C.: A speaker’s gesture style can affect language comprehension: ERP evidence from gesture-speech integration. Soc. Cogn. Affective Neurosci. 10(9), 1236–1243 (2015)CrossRefGoogle Scholar
  42. 42.
    Pelachaud, C.: Studies on gesture expressivity for a virtual agent. Speech Commun. 51(7), 630–639 (2009)CrossRefGoogle Scholar
  43. 43.
    Reynolds, D.A.: Gaussian mixture models. Encyclopedia Biometrics 741 (2009)Google Scholar
  44. 44.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  45. 45.
    Rosca, M., Lakshminarayanan, B., Warde-Farley, D., Mohamed, S.: Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987 (2017)
  46. 46.
    Royer, A., et al.: XGAN: unsupervised image-to-image translation for many-to-many mappings. In: Singh, R., Vatsa, M., Patel, V.M., Ratha, N. (eds.) Domain Adaptation for Visual Understanding, pp. 33–49. Springer, Cham (2020).  https://doi.org/10.1007/978-3-030-30671-7_3CrossRefGoogle Scholar
  47. 47.
    Sadoughi, N., Busso, C.: Novel realizations of speech-driven head movements with generative adversarial networks, pp. 6169–6173 (2018).  https://doi.org/10.1109/ICASSP.2018.8461967
  48. 48.
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)Google Scholar
  49. 49.
    Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1330–1345 (2008).  https://doi.org/10.1109/TPAMI.2007.70797CrossRefGoogle Scholar
  50. 50.
    Shlizerman, E., Dery, L., Schoen, H., Kemelmacher, I.: Audio to body dynamics. Proceedings/CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  51. 51.
    Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1153 (2017)Google Scholar
  52. 52.
    Smith, H.J., Cao, C., Neff, M., Wang, Y.: Efficient neural networks for real-time motion style transfer. Proc. ACM Comput. Graph. Interactive Tech. 2(2), 1–17 (2019)CrossRefGoogle Scholar
  53. 53.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  54. 54.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
  55. 55.
    Wagner, P., Malisz, Z., Kopp, S.: Gesture and speech in interaction: an overview (2014)Google Scholar
  56. 56.
    Wang, Y., et al.: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017 (2018)
  57. 57.
    Xu, J., Gannon, P.J., Emmorey, K., Smith, J.F., Braun, A.R.: Symbolic gestures and spoken language are processed by a common neural system. Proc. Natl. Acad. Sci. 106(49), 20664–20669 (2009)CrossRefGoogle Scholar
  58. 58.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)Google Scholar
  59. 59.
    Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems, pp. 465–476 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Seikei UniversityMusashinoJapan

Personalised recommendations