Skip to main content

A Shape-Aware Retargeting Approach to Transfer Human Motion and Appearance in Monocular Videos

Abstract

Transferring human motion and appearance between videos of human actors remains one of the key challenges in Computer Vision. Despite the advances from recent image-to-image translation approaches, there are several transferring contexts where most end-to-end learning-based retargeting methods still perform poorly. Transferring human appearance from one actor to another is only ensured when a strict setup has been complied, which is generally built considering their training regime’s specificities. In this work, we propose a shape-aware approach based on a hybrid image-based rendering technique that exhibits competitive visual retargeting quality compared to state-of-the-art neural rendering approaches. The formulation leverages the user body shape into the retargeting while considering physical constraints of the motion in 3D and the 2D image domain. We also present a new video retargeting benchmark dataset composed of different videos with annotated human motions to evaluate the task of synthesizing people’s videos, which can be used as a common base to improve tracking the progress in the field. The dataset and its evaluation protocols are designed to evaluate retargeting methods in more general and challenging conditions. Our method is validated in several experiments, comprising publicly available videos of actors with different shapes, motion types, and camera setups. The dataset and retargeting code are publicly available to the community at: https://www.verlab.dcc.ufmg.br/retargeting-motion.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Aberman, K., Shi, M., Liao, J., Lischinski, D., Chen, B., & Cohen-Or, D. (2018). Deep video-based performance cloning. CoRR

  2. Aberman, K., Wu, R., Lischinski, D., Chen, B., & Cohen-Or, D. (2019). Learning character-agnostic motion for motion retargeting in 2d. ACM TOG.

  3. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., & Pons-Moll, G. (2018). Video based reconstruction of 3d people models. In: CVPR.

  4. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state of the art analysis. In: CVPR.

  5. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). Scape: Shape completion and animation of people. ACM Trans Graph.

  6. Balakrishnan, G., Zhao, A., Dalca, A. V., Durand, F., & Guttag, J. V. (2018). Synthesizing images of humans in unseen poses. In: CVPR.

  7. Bau, D., Zhu, J. Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., & Torralba, A. (2019). Seeing what a gan cannot generate. In: ICCV.

  8. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV.

  9. Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR.

  10. Chan, C., Ginosar, S., Zhou, T., & Efros, A. (2019). Everybody dance now. In: ICCV.

  11. Choi, K. J., & Ko, H. S. (2000). On-line motion retargeting. Journal of Visualization and Computer Animation.

  12. Criminisi, A., Perez, P., & Toyama, K. (2004). Region filling and object removal by exemplar-based image inpainting. IEEE TIP.

  13. DeBoor, C., DeBoor, C., Mathématicien, E. U., DeBoor, C., & DeBoor, C. (1978). A practical guide to splines (Vol. 27). Berlin: Springer.

    Book  Google Scholar 

  14. Dosovitskiy, A., Springenberg, J. T., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In: CVPR.

  15. Esser, P., Sutter, E., & Ommer, B. (2018). A variational u-net for conditional appearance and shape generation. In: CVPR.

  16. Gleicher, M. (1998). Retargetting motion to new characters. In: SIGGRAPH.

  17. Gomes, T., Martins, R., Ferreira, J., & Nascimento, E. (2020). Do as I do: Transferring human motion and appearance between monocular videos with spatial and temporal constraints. In: WACV.

  18. Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., & Lin, L. (2018). Instance-level human parsing via part grouping network. In: ECCV.

  19. Hassan, M., Choutas, V., Tzionas, D., & Black, M.J. (2019). Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV.

  20. Kanazawa, A., Black, M.J., Jacobs, D.W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In: CVPR.

  21. Kolotouros, N., Pavlakos, G., Black, M.J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: ICCV.

  22. Lassner, C., Pons-Moll, G., & Gehler, P. V. (2017a) A generative model for people in clothing. In: ICCV.

  23. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., & Gehler, P. V. (2017b) Unite the people: Closing the loop between 3d and 2d human representations. In: CVPR.

  24. Levi, Z., & Gotsman, C. (2015). Smooth rotation enhanced as-rigid-as-possible mesh animation. T-VCG.

  25. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: ECCV.

  26. Liu, W., Piao, Z., Jie, M., Luo, W., Ma, L., & Gao, S. (2019) Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: ICCV.

  27. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M.J. (2015) Smpl: A skinned multi-person linear model. ACM Trans Graph.

  28. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017) Pose guided person image generation. In: NIPS.

  29. Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., & Black, M. J. (2019) AMASS: Archive of motion capture as surface shapes. In: ICCV.

  30. Marra, F., Gragnaniello, D., Verdoliva, L., & Poggi, G. (2020) A full-image full-resolution end-to-end-trainable cnn framework for image forgery detection. IEEE Access.

  31. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3DV.

  32. Mir, A., Alldieck, T., & Pons-Moll, G. (2020) Learning to transfer texture from clothing images to 3d humans. In: CVPR, IEEE.

  33. Neverova, N., Güler, R. A., & Kokkinos, I. (2018) Dense pose transfer. In: ECCV.

  34. Peng, X. B., Kanazawa, A., Malik, J., Abbeel, P., & Levine, S. (2018) Sfv: Reinforcement learning of physical skills from videos. ACM Trans Graph.

  35. Shysheya, A., Zakharov, E., Aliev, K. A., Bashirov, R., Burkov, E., Iskakov, K., Ivakhnenko, A., Malkov, Y., Pasechnik, I., Ulyanov, D., Vakhitov, A., & Lempitsky, V. (2019) Textured neural avatars. In: CVPR.

  36. Sigal, L., Balan, A., & Black, M. J. (2007) Combined discriminative and generative articulated pose and non-rigid shape estimation. In: NIPS.

  37. Simon, T., Joo, H., Matthews, I., & Sheikh, Y. (2017) Hand keypoint detection in single images using multiview bootstrapping. In: CVPR.

  38. Sun, Y. T., Fu, Q. C., Jiang, Y. R., Liu, Z., Lai, Y. K., Fu, H., & Gao, L. (2020) Human motion transfer with 3d constraints and detail enhancement. arXiv:2003.13510.

  39. Tatarchenko, M., Dosovitskiy, A., & Brox, T. (2015) Single-view to multi-view: Reconstructing unseen views with a convolutional network. CoRR.

  40. Tewari, A., Fried, O., Thies, J., Sitzmann, V., Lombardi, S., Sunkavalli, K., et al. (2020). State of the art on neural rendering. Computer Graphics Forum, 39(2), 701–727. https://doi.org/10.1111/cgf.14022.

    Article  Google Scholar 

  41. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2019). Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717

  42. Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018) Neural kinematic networks for unsupervised motion retargetting. In: CVPR.

  43. Wang, C., Huang, H., Han, X., & Wang, J. (2019) Video inpainting by jointly learning temporal structure and spatial details. In: AAAI.

  44. Wang, S., Wang, O., Zhang, R., Owens, A., & Efros, A.A. (2020) Cnn-generated images are surprisingly easy to spot... for now. In: CVPR.

  45. Wang, S. Y., Wang, O., Zhang, R., Owens, A., & Efros, A. A. (2020) Cnn-generated images are surprisingly easy to spot... for now. In: CVPR.

  46. Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018) Video-to-video synthesis. In: NIPS.

  47. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004) Image quality assessment: From error visibility to structural similarity. IEEE TIP.

  48. Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016) Convolutional pose machines. In: CVPR.

  49. Xu, R., Li, X., Zhou, B., & Loy, C. C. (2019) Deep flow-guided video inpainting. In: CVPR.

  50. Yang, J., Reed, S., Yang, M. H., & Lee, H. (2015) Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In: NIPS.

  51. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2018) Generative image inpainting with contextual attention. In: CVPR.

  52. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. (2019) Free-form image inpainting with gated convolution. In: ICCV.

  53. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR.

  54. Zhao, B., Wu, X., Cheng, Z., Liu, H., & Feng, J. (2017) Multi-view image generation from a single-view. CoRR.

Download references

Acknowledgements

The authors thank CAPES, CNPq, and FAPEMIG for funding this work. We also thank NVIDIA for the donation of a Titan XP GPU used in this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Thiago L. Gomes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Dima Damen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gomes, T.L., Martins, R., Ferreira, J. et al. A Shape-Aware Retargeting Approach to Transfer Human Motion and Appearance in Monocular Videos. Int J Comput Vis 129, 2057–2075 (2021). https://doi.org/10.1007/s11263-021-01471-x

Download citation

Keywords

  • Motion retargeting
  • Human image synthesis
  • Human motion
  • Video-to-video translation
  • Image manipulation