InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Liang, Han; Zhang, Wenqian; Li, Wenxuan; Yu, Jingyi; Xu, Lan

doi:10.1007/s11263-024-02042-6

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Published: 26 March 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Han Liang¹,
Wenqian Zhang¹,
Wenxuan Li¹,
Jingyi Yu¹ &
…
Lan Xu ORCID: orcid.org/0000-0002-8807-7787¹

593 Accesses
2 Altmetric
Explore all metrics

Abstract

We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the multi-human interactions. In this paper, we present InterGen, an effective diffusion-based approach that enables layman users to customize high-quality two-person interaction motions, with only text guidance. We first contribute a multimodal dataset, named InterHuman. It consists of about 107 M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. For the algorithm side, we carefully tailor the motion diffusion model to our two-person interaction setting. To handle the symmetry of human identities during interactions, we propose two cooperative transformer-based denoisers that explicitly share weights, with a mutual attention mechanism to further connect the two denoising processes. Then, we propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame. We further introduce two novel regularization terms to encode spatial relations, equipped with a corresponding damping scheme during the training of our interaction diffusion model. Extensive experiments validate the effectiveness of InterGen (https://tr3e.github.io/intergen-page/). Notably, it can generate more diverse and compelling two-person motions than previous methods and enables various downstream applications for human interactions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Seeing is No Longer Believing: A Survey on the State of Deepfakes, AI-Generated Humans, and Other Nonveridical Media

A literature review and perspectives in deepfakes: generation, detection, and applications

Article 23 July 2022

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

Notes

The captured skeletal motions and text annotations are available at https://tr3e.github.io/intergen-page/.

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., & Oh, S. (2018). Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE international conference on robotics and automation (ICRA), IEEE (pp. 5915–5920).
Ahuja, C., & Morency, L. P. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 international conference on 3D vision (3DV), IEEE (pp. 719–728).
Andrews, S., Huerta, I., Komura, T., Sigal, L., & Mitchell, K. (2016). Real-time physics-based motion capture with sparse sensors. In Proceedings of the 13th European conference on visual media production (CVMP 2016) (pp. 1–10).
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). Scape: Shape completion and animation of people. In ACM SIGGRAPH 2005 papers (pp. 408–416).
Ao, T., Gao, Q., Lou, Y., Chen, B., & Liu, L. (2022). Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6), 1–19.
Article Google Scholar
Athanasiou, N., Petrovich, M., Black, M. J., & Varol G (2022). Teach: Temporal action composition for 3d humans. In 2022 international conference on 3D vision (3DV), IEEE (pp. 414–423).
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, Springer (pp. 561–578).
Bregler, C., & Malik, J. (1998). Tracking people with twists and exponential maps. In Proceedings. 1998 IEEE computer society conference on computer vision and pattern recognition (Cat. No. 98CB36231), IEEE (pp. 8–15).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Google Scholar
Chen, X., Su, Z., Yang, L., Cheng, P., Xu, L., Fu, B., & Yu, G. (2022). Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134.
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18000–18010).
De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H. P., & Thrun, S. (2008). Performance capture from sparse multi-view video. In ACM SIGGRAPH 2008 papers (pp. 1–10).
Duan, Y., Shi, T., Zou, Z., Lin, Y., Qian, Z., Zhang, B., & Yuan, Y. (2021). Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776.
Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision (IJCV), 87(1–2), 75–92.
Article Google Scholar
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., & Slusallek, P. (2021). Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1396–1406).
Gilbert, A., Trumble, M., Malleson, C., Hilton, A., & Collomosse, J. (2019). Fusing visual and inertial sensors with semantics for 3d human pose estimation. International Journal of Computer Vision, 127, 381–397.
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia (pp. 2021–2029).
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022a). Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5152–5161).
Guo, C., Zuo, X., Wang, S., & Cheng, L. (2022b). Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, Springer (pp. 580–597).
Guo, W., Bie, X., Alameda-Pineda, X., & Moreno-Noguer, F. (2022c). Multi-person extreme motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13053–13064).
Habermann, M., Xu, W., Zollhöfer, M., Pons-Moll, G., & Theobalt C (2019) Livecap: Real-time human performance capture from monocular video. ACM Transactions on Graphics (TOG) 38(2):14:1–14:17
Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., & Theobalt, C. (2020). Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Habibie, I., Elgharib, M., Sarkar, K., Abdullah, A., Nyatsanga, S., Neff, M., & Theobalt, C. (2022). A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 conference proceedings (pp. 1–9).
Harvey, F. G., Yurick, M., Nowrouzezahrai, D., & Pal, C. (2020). Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4), 60–1.
Article Google Scholar
He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., & Xu, L. (2021). Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11400–11411).
Helten, T., Muller, M., Seidel, H. P., & Theobalt, C. (2013). Real-time body tracking with one depth camera and inertial sensors. In Proceedings of the IEEE international conference on computer vision (pp. 1105–1112).
Henschel, R., Von Marcard, T., & Rosenhahn, B. (2020). Accurate long-term multiple people tracking using video and body-worn IMUS. IEEE Transactions on Image Processing, 29, 8476–8489.
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems (Vol. 30).
Ho, J., & Salimans, T. (2021). Classifier-free diffusion guidance. In NeurIPS 2021 workshop on deep generative models and downstream applications.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., & Black, M. J. (2017) Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV), IEEE (pp. 421–430).
Huang, Y., Kaufmann, M., Aksan, E., Black, M. J., Hilliges, O., & Pons-Moll, G. (2018). Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG), 37(6), 1–15.
Article Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795.
Kalakonda, S. S., Maheshwari, S., & Sarvadevabhatla, R. K. (2023). Action-GPT: Leveraging large-scale language models for improved and generalized action generation. In 2023 IEEE international conference on multimedia and expo (ICME), IEEE (pp. 31–36).
Kanazawa, A., Zhang, J. Y., Felsen, P., & Malik, J. (2019). Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5614–5623).
Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT (Vol. 1, p. 2).
Kim, J., Kim, J., & Choi, S. (2023). Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI conference on artificial intelligence, (Vol. 37, pp. 8255–8263).
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Kocabas, M., Athanasiou, N., & Black, M. J. (2020) Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5253–5263).
Kolotouros, N., Pavlakos, G., Black, M. J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2252–2261).
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., & Gehler, P. V. (2017). Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6050–6059).
Lee, H. Y., Yang, X., Liu, M. Y., Wang, T. C., Lu, Y. D., Yang, M. H., & Kautz, J. (2019). Dancing to music. Advances in neural information processing systems (Vol. 32).
Li, B., Zhao, Y., Zhelun, S., & Sheng, L. (2022). Danceformer: Music conditioned 3d dance generation with parametric motion transformer. Proceedings of the AAAI conference on artificial intelligence, (Vol. 36, pp. 1272–1279).
Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13401–13412).
Liang, H., He, Y., Zhao, C., Li, M., Wang, J., Yu, J., & Xu, L. (2023). Hybridcap: Inertia-aid monocular capture of challenging human motions. In: Proceedings of the AAAI Conference on Artificial Intelligence, (Vol. 37, pp. 1539–1548).
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684–2701.
Article Google Scholar
Liu, Y., Gall, J., Stoll, C., Dai, Q., Seidel, H. P., & Theobalt, C. (2013). Markerless motion capture of multiple characters using multiview image segmentation. IEEE transactions on pattern analysis and machine intelligence, 35(11), 2720–2735.
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 1–16.
Article Google Scholar
Loshchilov, I., & Hutter, F. (2018) Decoupled weight decay regularization. In International conference on learning representations.
Lucas, T., Baradel, F., Weinzaepfel, P., & Rogez, G. (2022). Posegpt: Quantization-based 3d human motion generation and forecasting. European Conference on Computer Vision (pp. 417–435). Berlin: Springer.
Malleson, C., Gilbert, A., Trumble, M., Collomosse, J., Hilton, A., & Volino, M. (2017). Real-time full-body motion capture from video and imus. In 2017 international conference on 3D vision (3DV), IEEE (pp. 449–457).
Malleson, C., Collomosse, J., & Hilton, A. (2019). Real-time multi-person motion capture from multi-view video and imus. International Journal of Computer Vision pp. 1–18.
Movella (2022) Movella xsens products. https://www.movella.com/products/xsens, Accessed 26 March 2023.
Ng, E., Xiang, D., Joo, H., & Grauman, K. (2020). You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9890–9900).
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International conference on machine learning, PMLR (pp. 8162–8171).
OpenAI, (2023). Gpt-4 technical report. 2303.08774.
Osman, A. A., Bolkart, T., & Black, M. J. (2020). Star: Sparse trained articulated human body regressor. Computer Vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 (pp. 598–613). Berlin: Springer.
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Harvesting multiple views for marker-less 3d human pose annotations. In Computer vision and pattern recognition (CVPR).
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10975–10985).
Peng, X. B., Ma, Z., Abbeel, P., Levine, S., & Kanazawa, A. (2021). Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4), 1–20.
Article Google Scholar
Petrovich, M., Black, M. J., Varol, G. (2021). Action-conditioned 3d human motion synthesis with transformer VAE. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10985–10995).
Petrovich, M., Black, M. J., & Varol, G. (2022). Temos: Generating diverse human motions from textual descriptions. Computer Vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII (pp. 480–497). Berlin: Springer.
Plappert, M., Mandery, C., & Asfour, T. (2016). The kit motion-language dataset. Big data, 4(4), 236–252.
Google Scholar
Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., & Black, M. J. (2021). Babel: bodies, action and behavior with English labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 722–731).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., & Guibas, L. J. (2021). Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11488–11499).
Ren, Y., Zhao, C., He, Y., Cong, P., Liang, H., Yu, J., Xu, L., & Ma, Y. (2023). Lidar-aid inertial poser: Large-scale human motion capture by sparse inertial and Lidar sensors. IEEE Transactions on Visualization and Computer Graphics, 29(5), 2337–2347.
Article Google Scholar
Rezende, D., & Mohamed, S. (2015). Variational inference with normalizing flows. In International conference on machine learning, PMLR (pp. 1530–1538).
Robertini, N., Casas, D., Rhodin, H., Seidel, H. P., & Theobalt, C. (2016). Model-based outdoor performance capture. In 2016 Fourth international conference on 3d vision (3DV), IEEE (pp. 166–175).
Shafir, Y., Tevet, G., Kapon, R., & Bermano, A. H. (2023). Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418.
Simon, T., Joo, H., Matthews, I., & Sheikh, Y. (2017). Hand keypoint detection in single images using multiview bootstrapping. In Computer vision and pattern recognition (CVPR)
Song, J., Meng, C., & Ermon, S. (2020a). Denoising diffusion implicit models. In International conference on learning representations
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020b). Score-based generative modeling through stochastic differential equations. In International conference on learning representations
Starke, S., Zhang, H., Komura, T., & Saito, J. (2019). Neural state machine for character-scene interactions. ACM Trans Graph, 38(6), 209–1.
Article Google Scholar
Starke, S., Mason, I., & Komura, T. (2022). Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 1–13.
Article Google Scholar
Stoll, C., Hasler, N., Gall, J., Seidel, H. P., & Theobalt, C. (2011). Fast articulated motion tracking using a sums of Gaussians body model. In International conference on computer vision (ICCV).
Tanaka, M., & Fujiwara, K. (2023). Role-aware interaction generation from textual description. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15999–16009).
Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. (2022). Motionclip: Exposing human motion generation to clip space. Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII (pp. 358–374). Berlin: Springer.
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022b) Human motion diffusion model. In International conference on learning representations.
Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H. P., & Thrun, S. (2010). Performance capture from multi-view video. Image and Geometry Processing for 3-D Cinematography (pp. 127–149). Berlin: Springer.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Van der Aa, N., Luo, X., Giezeman, G. J., Tan, R. T., & Veltkamp, R. C. (2011). Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In 2011 IEEE international conference on computer vision workshops (ICCV Workshops), IEEE (pp. 1264–1269)
Vicon. (2019). Vicon Motion Systems. https://www.vicon.com/
Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W., & Popović, J. (2007). Practical motion capture in everyday surroundings. ACM Transactions on Graphics (TOG), 26(3), 35.
Article Google Scholar
Von Marcard, T., Rosenhahn, B., Black, M. J., & Pons-Moll, G. (2017). Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. Computer Graphics Forum, Wiley Online Library, 36, 349–360.
Article Google Scholar
Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV) (pp. 601–617
Wang, J., Yan, S., Dai, B., & Lin, D. (2021). Scene-aware generative network for human motion synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12206–12215).
Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., & Huang, S. (2022). Humanise: Language-conditioned human motion generation in 3d scenes. Advances in Neural Information Processing Systems, 35, 14959–14971.
Google Scholar
Xu, L., Liu, Y., Cheng, W., Guo, K., Zhou, G., Dai, Q., & Fang, L. (2018). Flycap: Markerless motion capture using multiple autonomous flying cameras. IEEE Transactions on Visualization and Computer Graphics, 24(8), 2284–2297.
Article Google Scholar
Xu, L., Xu, W., Golyanik, V., Habermann, M., Fang, L., & Theobalt, C. (2020). Eventcap: Monocular 3d capture of high-speed human motions using an event camera. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4968–4978).
Xu, L., Song, Z., Wang, D., Su, J., Fang, Z., Ding, C., Gan, W., Yan, Y., Jin, X., Yang, X., et al. (2023). Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2228–2238).
Xu, W., Chatterjee, A., Zollhöfer, M., Rhodin, H., Mehta, D., Seidel, H. P., & Theobalt, C. (2018b). Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (TOG) 37(2):27:1–27:15
Yi, X., Zhou, Y., & Xu, F. (2021). Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics (TOG), 40(4), 1–13.
Article Google Scholar
Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., & Xu, F. (2022). Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In IEEE/CVF conference on computer vision and pattern recognition (CVPR)
You, J., Leskovec, J., He, K., & Xie, S. (2020). Graph structure of neural networks. In International conference on machine learning, PMLR (pp. 10881–10891).
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., & Kautz, J. (2023). Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16010–16021).
Z-CAM. (2022). Z CAM Cinema Camera. https://www.z-cam.com, Accessed 26 March 2023.
Zanfir, A., Bazavan, E. G., Zanfir, M., Freeman, W. T., Sukthankar, R., & Sminchisescu, C. (2021). Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14484–14493).
Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., & Shan, Y. (2023a). Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14730–14740).
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2022). Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.
Zhang, Y., Huang, D., Liu, B., Tang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Yu, N., & Ouyang, W. (2023b). Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900.
Zheng, Z., Yu, T., Li, H., Guo, K., Dai, Q., Fang, L., & Liu, Y. (2018). Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus. In Proceedings of the European conference on computer vision (ECCV) (pp. 384–400).

Download references

Author information

Authors and Affiliations

ShanghaiTech University, Shanghai, China
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu & Lan Xu

Authors

Han Liang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingyi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Lan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lan Xu.

Additional information

Communicated by Jean-Sébastien Franco.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, H., Zhang, W., Li, W. et al. InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02042-6

Download citation

Received: 05 April 2023
Accepted: 26 February 2024
Published: 26 March 2024
DOI: https://doi.org/10.1007/s11263-024-02042-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Abstract

Access this article

Similar content being viewed by others

Seeing is No Longer Believing: A Survey on the State of Deepfakes, AI-Generated Humans, and Other Nonveridical Media

A literature review and perspectives in deepfakes: generation, detection, and applications

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Abstract

Access this article

Similar content being viewed by others

Seeing is No Longer Believing: A Survey on the State of Deepfakes, AI-Generated Humans, and Other Nonveridical Media

A literature review and perspectives in deepfakes: generation, detection, and applications

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation