Skip to main content

Advertisement

Log in

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the multi-human interactions. In this paper, we present InterGen, an effective diffusion-based approach that enables layman users to customize high-quality two-person interaction motions, with only text guidance. We first contribute a multimodal dataset, named InterHuman. It consists of about 107 M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. For the algorithm side, we carefully tailor the motion diffusion model to our two-person interaction setting. To handle the symmetry of human identities during interactions, we propose two cooperative transformer-based denoisers that explicitly share weights, with a mutual attention mechanism to further connect the two denoising processes. Then, we propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame. We further introduce two novel regularization terms to encode spatial relations, equipped with a corresponding damping scheme during the training of our interaction diffusion model. Extensive experiments validate the effectiveness of InterGen (https://tr3e.github.io/intergen-page/). Notably, it can generate more diverse and compelling two-person motions than previous methods and enables various downstream applications for human interactions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The captured skeletal motions and text annotations are available at https://tr3e.github.io/intergen-page/.

References

  • Ahn, H., Ha, T., Choi, Y., Yoo, H., & Oh, S. (2018). Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE international conference on robotics and automation (ICRA), IEEE (pp. 5915–5920).

  • Ahuja, C., & Morency, L. P. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 international conference on 3D vision (3DV), IEEE (pp. 719–728).

  • Andrews, S., Huerta, I., Komura, T., Sigal, L., & Mitchell, K. (2016). Real-time physics-based motion capture with sparse sensors. In Proceedings of the 13th European conference on visual media production (CVMP 2016) (pp. 1–10).

  • Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). Scape: Shape completion and animation of people. In ACM SIGGRAPH 2005 papers (pp. 408–416).

  • Ao, T., Gao, Q., Lou, Y., Chen, B., & Liu, L. (2022). Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6), 1–19.

    Article  Google Scholar 

  • Athanasiou, N., Petrovich, M., Black, M. J., & Varol G (2022). Teach: Temporal action composition for 3d humans. In 2022 international conference on 3D vision (3DV), IEEE (pp. 414–423).

  • Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, Springer (pp. 561–578).

  • Bregler, C., & Malik, J. (1998). Tracking people with twists and exponential maps. In Proceedings. 1998 IEEE computer society conference on computer vision and pattern recognition (Cat. No. 98CB36231), IEEE (pp. 8–15).

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

    Google Scholar 

  • Chen, X., Su, Z., Yang, L., Cheng, P., Xu, L., Fu, B., & Yu, G. (2022). Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134.

  • Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18000–18010).

  • De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H. P., & Thrun, S. (2008). Performance capture from sparse multi-view video. In ACM SIGGRAPH 2008 papers (pp. 1–10).

  • Duan, Y., Shi, T., Zou, Z., Lin, Y., Qian, Z., Zhang, B., & Yuan, Y. (2021). Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776.

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision (IJCV), 87(1–2), 75–92.

    Article  Google Scholar 

  • Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., & Slusallek, P. (2021). Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1396–1406).

  • Gilbert, A., Trumble, M., Malleson, C., Hilton, A., & Collomosse, J. (2019). Fusing visual and inertial sensors with semantics for 3d human pose estimation. International Journal of Computer Vision, 127, 381–397.

    Article  Google Scholar 

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  • Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia (pp. 2021–2029).

  • Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022a). Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5152–5161).

  • Guo, C., Zuo, X., Wang, S., & Cheng, L. (2022b). Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, Springer (pp. 580–597).

  • Guo, W., Bie, X., Alameda-Pineda, X., & Moreno-Noguer, F. (2022c). Multi-person extreme motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13053–13064).

  • Habermann, M., Xu, W., Zollhöfer, M., Pons-Moll, G., & Theobalt C (2019) Livecap: Real-time human performance capture from monocular video. ACM Transactions on Graphics (TOG) 38(2):14:1–14:17

  • Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., & Theobalt, C. (2020). Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Habibie, I., Elgharib, M., Sarkar, K., Abdullah, A., Nyatsanga, S., Neff, M., & Theobalt, C. (2022). A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 conference proceedings (pp. 1–9).

  • Harvey, F. G., Yurick, M., Nowrouzezahrai, D., & Pal, C. (2020). Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4), 60–1.

    Article  Google Scholar 

  • He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., & Xu, L. (2021). Challencap: Monocular 3d capture of challenging human performances using multi-modal references. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11400–11411).

  • Helten, T., Muller, M., Seidel, H. P., & Theobalt, C. (2013). Real-time body tracking with one depth camera and inertial sensors. In Proceedings of the IEEE international conference on computer vision (pp. 1105–1112).

  • Henschel, R., Von Marcard, T., & Rosenhahn, B. (2020). Accurate long-term multiple people tracking using video and body-worn IMUS. IEEE Transactions on Image Processing, 29, 8476–8489.

    Article  Google Scholar 

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems (Vol. 30).

  • Ho, J., & Salimans, T. (2021). Classifier-free diffusion guidance. In NeurIPS 2021 workshop on deep generative models and downstream applications.

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., & Black, M. J. (2017) Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV), IEEE (pp. 421–430).

  • Huang, Y., Kaufmann, M., Aksan, E., Black, M. J., Hilliges, O., & Pons-Moll, G. (2018). Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG), 37(6), 1–15.

    Article  Google Scholar 

  • Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795.

  • Kalakonda, S. S., Maheshwari, S., & Sarvadevabhatla, R. K. (2023). Action-GPT: Leveraging large-scale language models for improved and generalized action generation. In 2023 IEEE international conference on multimedia and expo (ICME), IEEE (pp. 31–36).

  • Kanazawa, A., Zhang, J. Y., Felsen, P., & Malik, J. (2019). Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5614–5623).

  • Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT (Vol. 1, p. 2).

  • Kim, J., Kim, J., & Choi, S. (2023). Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI conference on artificial intelligence, (Vol. 37, pp. 8255–8263).

  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  • Kocabas, M., Athanasiou, N., & Black, M. J. (2020) Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5253–5263).

  • Kolotouros, N., Pavlakos, G., Black, M. J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2252–2261).

  • Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., & Gehler, P. V. (2017). Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6050–6059).

  • Lee, H. Y., Yang, X., Liu, M. Y., Wang, T. C., Lu, Y. D., Yang, M. H., & Kautz, J. (2019). Dancing to music. Advances in neural information processing systems (Vol. 32).

  • Li, B., Zhao, Y., Zhelun, S., & Sheng, L. (2022). Danceformer: Music conditioned 3d dance generation with parametric motion transformer. Proceedings of the AAAI conference on artificial intelligence, (Vol. 36, pp. 1272–1279).

  • Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13401–13412).

  • Liang, H., He, Y., Zhao, C., Li, M., Wang, J., Yu, J., & Xu, L. (2023). Hybridcap: Inertia-aid monocular capture of challenging human motions. In: Proceedings of the AAAI Conference on Artificial Intelligence, (Vol. 37, pp. 1539–1548).

  • Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684–2701.

    Article  Google Scholar 

  • Liu, Y., Gall, J., Stoll, C., Dai, Q., Seidel, H. P., & Theobalt, C. (2013). Markerless motion capture of multiple characters using multiview image segmentation. IEEE transactions on pattern analysis and machine intelligence, 35(11), 2720–2735.

    Article  Google Scholar 

  • Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 1–16.

    Article  Google Scholar 

  • Loshchilov, I., & Hutter, F. (2018) Decoupled weight decay regularization. In International conference on learning representations.

  • Lucas, T., Baradel, F., Weinzaepfel, P., & Rogez, G. (2022). Posegpt: Quantization-based 3d human motion generation and forecasting. European Conference on Computer Vision (pp. 417–435). Berlin: Springer.

  • Malleson, C., Gilbert, A., Trumble, M., Collomosse, J., Hilton, A., & Volino, M. (2017). Real-time full-body motion capture from video and imus. In 2017 international conference on 3D vision (3DV), IEEE (pp. 449–457).

  • Malleson, C., Collomosse, J., & Hilton, A. (2019). Real-time multi-person motion capture from multi-view video and imus. International Journal of Computer Vision pp. 1–18.

  • Movella (2022) Movella xsens products. https://www.movella.com/products/xsens, Accessed 26 March 2023.

  • Ng, E., Xiang, D., Joo, H., & Grauman, K. (2020). You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9890–9900).

  • Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International conference on machine learning, PMLR (pp. 8162–8171).

  • OpenAI, (2023). Gpt-4 technical report. 2303.08774.

  • Osman, A. A., Bolkart, T., & Black, M. J. (2020). Star: Sparse trained articulated human body regressor. Computer Vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16 (pp. 598–613). Berlin: Springer.

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Harvesting multiple views for marker-less 3d human pose annotations. In Computer vision and pattern recognition (CVPR).

  • Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10975–10985).

  • Peng, X. B., Ma, Z., Abbeel, P., Levine, S., & Kanazawa, A. (2021). Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4), 1–20.

    Article  Google Scholar 

  • Petrovich, M., Black, M. J., Varol, G. (2021). Action-conditioned 3d human motion synthesis with transformer VAE. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10985–10995).

  • Petrovich, M., Black, M. J., & Varol, G. (2022). Temos: Generating diverse human motions from textual descriptions. Computer Vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII (pp. 480–497). Berlin: Springer.

  • Plappert, M., Mandery, C., & Asfour, T. (2016). The kit motion-language dataset. Big data, 4(4), 236–252.

    Google Scholar 

  • Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., & Black, M. J. (2021). Babel: bodies, action and behavior with English labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 722–731).

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).

  • Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., & Guibas, L. J. (2021). Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11488–11499).

  • Ren, Y., Zhao, C., He, Y., Cong, P., Liang, H., Yu, J., Xu, L., & Ma, Y. (2023). Lidar-aid inertial poser: Large-scale human motion capture by sparse inertial and Lidar sensors. IEEE Transactions on Visualization and Computer Graphics, 29(5), 2337–2347.

    Article  Google Scholar 

  • Rezende, D., & Mohamed, S. (2015). Variational inference with normalizing flows. In International conference on machine learning, PMLR (pp. 1530–1538).

  • Robertini, N., Casas, D., Rhodin, H., Seidel, H. P., & Theobalt, C. (2016). Model-based outdoor performance capture. In 2016 Fourth international conference on 3d vision (3DV), IEEE (pp. 166–175).

  • Shafir, Y., Tevet, G., Kapon, R., & Bermano, A. H. (2023). Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418.

  • Simon, T., Joo, H., Matthews, I., & Sheikh, Y. (2017). Hand keypoint detection in single images using multiview bootstrapping. In Computer vision and pattern recognition (CVPR)

  • Song, J., Meng, C., & Ermon, S. (2020a). Denoising diffusion implicit models. In International conference on learning representations

  • Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020b). Score-based generative modeling through stochastic differential equations. In International conference on learning representations

  • Starke, S., Zhang, H., Komura, T., & Saito, J. (2019). Neural state machine for character-scene interactions. ACM Trans Graph, 38(6), 209–1.

    Article  Google Scholar 

  • Starke, S., Mason, I., & Komura, T. (2022). Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4), 1–13.

    Article  Google Scholar 

  • Stoll, C., Hasler, N., Gall, J., Seidel, H. P., & Theobalt, C. (2011). Fast articulated motion tracking using a sums of Gaussians body model. In International conference on computer vision (ICCV).

  • Tanaka, M., & Fujiwara, K. (2023). Role-aware interaction generation from textual description. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15999–16009).

  • Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. (2022). Motionclip: Exposing human motion generation to clip space. Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII (pp. 358–374). Berlin: Springer.

  • Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022b) Human motion diffusion model. In International conference on learning representations.

  • Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H. P., & Thrun, S. (2010). Performance capture from multi-view video. Image and Geometry Processing for 3-D Cinematography (pp. 127–149). Berlin: Springer.

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  • Van der Aa, N., Luo, X., Giezeman, G. J., Tan, R. T., & Veltkamp, R. C. (2011). Umpm benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In 2011 IEEE international conference on computer vision workshops (ICCV Workshops), IEEE (pp. 1264–1269)

  • Vicon. (2019). Vicon Motion Systems. https://www.vicon.com/

  • Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W., & Popović, J. (2007). Practical motion capture in everyday surroundings. ACM Transactions on Graphics (TOG), 26(3), 35.

    Article  Google Scholar 

  • Von Marcard, T., Rosenhahn, B., Black, M. J., & Pons-Moll, G. (2017). Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. Computer Graphics Forum, Wiley Online Library, 36, 349–360.

    Article  Google Scholar 

  • Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV) (pp. 601–617

  • Wang, J., Yan, S., Dai, B., & Lin, D. (2021). Scene-aware generative network for human motion synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12206–12215).

  • Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., & Huang, S. (2022). Humanise: Language-conditioned human motion generation in 3d scenes. Advances in Neural Information Processing Systems, 35, 14959–14971.

    Google Scholar 

  • Xu, L., Liu, Y., Cheng, W., Guo, K., Zhou, G., Dai, Q., & Fang, L. (2018). Flycap: Markerless motion capture using multiple autonomous flying cameras. IEEE Transactions on Visualization and Computer Graphics, 24(8), 2284–2297.

    Article  Google Scholar 

  • Xu, L., Xu, W., Golyanik, V., Habermann, M., Fang, L., & Theobalt, C. (2020). Eventcap: Monocular 3d capture of high-speed human motions using an event camera. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4968–4978).

  • Xu, L., Song, Z., Wang, D., Su, J., Fang, Z., Ding, C., Gan, W., Yan, Y., Jin, X., Yang, X., et al. (2023). Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2228–2238).

  • Xu, W., Chatterjee, A., Zollhöfer, M., Rhodin, H., Mehta, D., Seidel, H. P., & Theobalt, C. (2018b). Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (TOG) 37(2):27:1–27:15

  • Yi, X., Zhou, Y., & Xu, F. (2021). Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics (TOG), 40(4), 1–13.

    Article  Google Scholar 

  • Yi, X., Zhou, Y., Habermann, M., Shimada, S., Golyanik, V., Theobalt, C., & Xu, F. (2022). Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  • You, J., Leskovec, J., He, K., & Xie, S. (2020). Graph structure of neural networks. In International conference on machine learning, PMLR (pp. 10881–10891).

  • Yuan, Y., Song, J., Iqbal, U., Vahdat, A., & Kautz, J. (2023). Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16010–16021).

  • Z-CAM. (2022). Z CAM Cinema Camera. https://www.z-cam.com, Accessed 26 March 2023.

  • Zanfir, A., Bazavan, E. G., Zanfir, M., Freeman, W. T., Sukthankar, R., & Sminchisescu, C. (2021). Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14484–14493).

  • Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., & Shan, Y. (2023a). Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14730–14740).

  • Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2022). Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.

  • Zhang, Y., Huang, D., Liu, B., Tang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Yu, N., & Ouyang, W. (2023b). Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900.

  • Zheng, Z., Yu, T., Li, H., Guo, K., Dai, Q., Fang, L., & Liu, Y. (2018). Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus. In Proceedings of the European conference on computer vision (ECCV) (pp. 384–400).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lan Xu.

Additional information

Communicated by Jean-Sébastien Franco.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, H., Zhang, W., Li, W. et al. InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02042-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02042-6

Keywords

Navigation