Skip to main content

CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation. CLIP-Actor animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and optimizing mesh style attributes. We build a text-driven human motion recommendation system by leveraging a large-scale human motion dataset with language labels. Given a natural language prompt, CLIP-Actor suggests a text-conforming human motion in a coarse-to-fine manner. Then, our novel zero-shot neural style optimization detailizes and texturizes the recommended mesh sequence to conform to the prompt in a temporally-consistent and pose-agnostic manner. This is distinctive in that prior work fails to generate plausible results when the pose of an artist-designed mesh does not conform to the text from the beginning. We further propose the spatio-temporal view augmentation and mask-weighted embedding attention, which stabilize the optimization process by leveraging multi-frame human motion and rejecting poorly rendered views. We demonstrate that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture solely from a natural language prompt.

K. Youwang and K. Ji-Yeon—Authors contributed equally to this work.

T.-H. Oh—Joint affiliated with Yonsei University, Korea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: IEEE International Conference on Robotics and Automation (ICRA) (2018)

    Google Scholar 

  2. Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: International Conference on 3D Vision (3DV) (2019)

    Google Scholar 

  3. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: MIP-NERF 360: unbounded anti-aliased neural radiance fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  4. Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining implicit function learning and parametric models for 3d human reconstruction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_19

    Chapter  Google Scholar 

  5. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: Learning to dress 3d people from images. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  6. Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., Cipolla, R.: Who left the dogs out? 3D animal reconstruction with expectation maximization in the loop. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 195–211. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_12

    Chapter  Google Scholar 

  7. Biggs, B., Roddick, T., Fitzgibbon, A., Cipolla, R.: Creatures great and SMAL: recovering the shape and motion of animals from video. In: Asia Conference on Computer Vision (ACCV) (2018)

    Google Scholar 

  8. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH (1999)

    Google Scholar 

  9. Božič, A., Palafox, P., Zollhöfer, M., Thies, J., Dai, A., Nießner, M.: Neural deformation graphs for globally-consistent non-rigid reconstruction. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  10. Bozic, A., Palafox, P., Zollöfer, M., Dai, A., Thies, J., Nießner, M.: Neural non-rigid tracking. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  11. Burov, A., Nießner, M., Thies, J.: Dynamic surface function networks for clothed human bodies. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  12. Canfes, Z., Atasoy, M.F., Dirik, A., Yanardag, P.: Text and image guided 3d avatar generation and manipulation. arXiv:2202.06079 (2022)

  13. Du, Y., Collins, M.K., Tenenbaum, B.J., Sitzmann, V.: Learning signal-agnostic manifolds of neural fields. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  14. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  15. Frans, K., Soros, L.B., Witkowski, O.: ClipDraw: exploring text-to-drawing synthesis through language-image encoders. arXiv:2106.14843 (2021)

  16. Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  17. Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. In: ACM Transactions on Graphics (SIGGRAPH) (2022)

    Google Scholar 

  18. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  19. Guo, C., Zuo, X., Wang, S., Liu, X., Zou, S., Gong, M., Cheng, L.: Action2video: generating videos of human 3d actions. In: International Journal of Computer Vision (IJCV), pp. 1–31 (2022)

    Google Scholar 

  20. Guo, C., et al.: Action2motion: conditioned generation of 3d human motions. In: ACM International Conference on Multimedia (MM) (2020)

    Google Scholar 

  21. Guo, J., Li, J., Narain, R., Park, H.: Inverse simulation: Reconstructing dynamic geometry of clothed humans via optimal control. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  22. Hanser, E., Kevitt, P.M., Lunney, T.F., Condell, J.: SceneMaker: intelligent multimodal visualisation of natural language scripts. In: Proceedings of the 20th Irish Conference on Artificial Intelligence and Cognitive Science (2009)

    Google Scholar 

  23. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  24. Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: semantically consistent few-shot view synthesis. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  25. Jiang, J., Xia, G.G., Carlton, D.B., Anderson, C.N., Miyakawa, R.H.: Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

    Google Scholar 

  26. Kato, H., et al.: Differentiable rendering: a survey. arXiv:2006.12057 (2020)

  27. Kim, G., Ye, J.C.: DiffusionClip: text-guided diffusion models for robust image manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  28. Kwon, G., Ye, J.C.: ClipStyler: image style transfer with a single text condition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  29. Li, Y., et al.: N-Cloth: predicting 3D cloth deformation with mesh-based networks. In: Computer Graphics Forum (Proceedings of Eurographics), pp. 547–558 (2022)

    Google Scholar 

  30. Lin, A.S., Wu, L., Corona, R., Tai, K.W.H., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. In: Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018 (2018)

    Google Scholar 

  31. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (SIGGRAPH Asia) 34(6), 248 (2015)

    Google Scholar 

  32. Ma, Q., et al.: Learning to dress 3d people in generative clothing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  33. Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R.: A sick cure for the evaluation of compositional distributional semantic models. In: International Conference on Language Resources and Evaluation (LREC) (2014)

    Google Scholar 

  34. Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: text-driven neural stylization for meshes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  36. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning (ICML) (2022)

    Google Scholar 

  37. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  38. Palafox, P., Bozic, A., Thies, J., Nießner, M., Dai, A.: Neural parametric models for 3d deformable shapes. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  39. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleClip: text-driven manipulation of StyleGAN imagery. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  40. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  41. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: IEEE International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  42. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision–ECCV 2022. LNCS vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

  43. Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. 109, 13–26 (2018)

    Article  Google Scholar 

  44. Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  45. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  46. Ravi, N., et al.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)

  47. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (SIGGRAPH Asia). 36(6), 1–6 (2017)

    Google Scholar 

  48. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PiFU: pixel-aligned implicit function for high-resolution clothed human digitization. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  49. Saito, S., Simon, T., Saragih, J., Joo, H.: PiFUHD: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  50. Saito, S., Yang, J., Ma, Q., Black, M.J.: SCANimate: weakly supervised learning of skinned clothed avatar networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  51. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 (2019)

  52. Shree, V., Asfora, B., Zheng, R., Hong, S., Banfi, J., Campbell, M.: Exploiting natural language for efficient risk-aware multi-robot SAR planning. IEEE Robot. Autom. Lett. 6(2), 3152–3159 (2021)

    Article  Google Scholar 

  53. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  54. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  55. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to CLIP space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision–ECCV 2022. LNCS, vol. 13682. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21

  56. Tsang, C.F., et al.: Kaolin (2019)

    Google Scholar 

  57. Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-NeRF: text-and-image driven manipulation of neural radiance fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  58. Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S.: MetaAvatar: learning animatable clothed human models from few depth images. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  59. Yoon, Y., Ko, W.R., Jang, M., Lee, J., Kim, J., Lee, G.: Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

    Google Scholar 

  60. Youwang, K., Ji-Yeon, K., Joo, K., Oh, T.H.: Unified 3d mesh recovery of humans and animals by learning animal exercise. In: British Machine Vision Conference (BMVC) (2021)

    Google Scholar 

  61. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  62. Zhang, R., et al.: PointClip: point cloud understanding by clip. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  63. Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-d safari: learning to estimate zebra pose, shape, and texture from images “in the wild”. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  64. Zuffi, S., Kanazawa, A., Jacobs, D., Black, M.J.: 3D menagerie: modeling the 3D shape and pose of animals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

Download references

Acknowledgment

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-00164860, Development of Human Digital Twin Technology Based on Dynamic Behavior Modeling and Human-Object-Space Interaction; and No.2021-0-02068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kim Youwang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7549 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Youwang, K., Ji-Yeon, K., Oh, TH. (2022). CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20062-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20061-8

  • Online ISBN: 978-3-031-20062-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics