BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Henning, Dorian F.; Laidlow, Tristan; Leutenegger, Stefan

doi:10.1007/978-3-031-19842-7_38

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Estimating human motion from video is an active research area due to its many potential applications. Most state-of-the-art methods predict human shape and posture estimates for individual images and do not leverage the temporal information available in video. Many “in the wild" sequences of human motion are captured by a moving camera, which adds the complication of conflated camera and human motion to the estimation. We therefore present BodySLAM, a monocular SLAM system that jointly estimates the position, shape, and posture of human bodies, as well as the camera trajectory. We also introduce a novel human motion model to constrain sequential body postures and observe the scale of the scene. Through a series of experiments on video sequences of human motion captured by a moving monocular camera, we demonstrate that BodySLAM improves estimates of all human body parameters and camera poses when compared to estimating these separately.

Research presented in this paper has been supported by Dyson Technology Ltd. and the Technical University of Munich.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We encourage the reader to view our supplementary video available at:
https://youtu.be/0-SL3VeWEvU.

References

Agarwal, S., Mierle, K., et al.: Ceres Solver. http://ceres-solver.org
Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM Trans. Graph. (2005)
Google Scholar
Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Barnes, D., Maddern, W., Pascoe, G., Posner, I.: Driven to distraction: self-supervised distractor learning for robust monocular visual odometry in urban environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Bârsan, I.A., Liu, P., Pollefeys, M., Geiger, A.: Robust dense mapping for large-scale dynamic environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Bescos, B., Fácil, J.M., Civera, J., Neira, J.: DynaSLAM: tracking, mapping and inpainting in dynamic scenes. Technical report (2018)
Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Catalin Ionescu Fuxin Li, C.S.: Latent structured models for human pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)
Google Scholar
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Sharma, A., Jain, A.: Learning 3D human pose from structure and motion. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 679–696. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_41
Chapter Google Scholar
Dai, W., Zhang, Y., Li, P., Fang, Z., Scherer, S.: RGB-D SLAM in dynamic environments using point correlations. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 44, 373–389 (2020)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2D features and intermediate 3D representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Henein, M., Kennedy, G., Mahony, R., Ila, V.: Exploiting rigid body motion for SLAM in dynamic environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Henein, M., Zhang, J., Mahony, R., Ila, V.: Dynamic SLAM: the need for speed. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2020)
Google Scholar
Henning, D., Guler, A., Leutenegger, S., Zafeiriou, S.: HPE3D: human pose estimation in 3D (2020). https://github.com/dorianhenning/hpe3d
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 69–86. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_5
Chapter Google Scholar
Huang, Y., et al.: Towards accurate marker-less human shape and pose estimation over time. In: Proceedings of the International Conference on 3D Vision (3DV) (2017)
Google Scholar
Jaimez, M., Kerl, C., Gonzalez-Jimenez, J., Cremers, D.: Fast odometry and scene flow from RGB-D cameras based on geometric clustering. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2017)
Google Scholar
Ji, T., Wang, C., Xie, L.: Towards real-time semantic RGB-D SLAM in dynamic environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2021)
Google Scholar
Judd, K.M., Gammell, J.D., Newman, P.: Multimotion Visual Odometry (MVO): simultaneous estimation of camera and third-party motions. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2018)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015)
Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Leutenegger, S., Chli, M., Siegwart, R.Y.: BRISK: binary robust invariant scalable keypoints. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)
Google Scholar
Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Robot. Res. (IJRR) 34, 314–334 (2015)
Google Scholar
Ling, H.Y., Zinno, F., Cheng, G., van de Panne, M.: Character controllers using motion vaes. ACM Trans. Graph. 39, 40 (2020)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 1–16 (2015)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 33, 1255–1262 (2017)
Google Scholar
Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: sparse trained articulated human body regressor. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 598–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_36
Chapter Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. In: Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Qiu, Y., Wang, C., Wang, W., Henein, M., Scherer, S.: AirDOS: dynamic SLAM benefits from articulated objects. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2022)
Google Scholar
Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_41
Chapter Google Scholar
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Rünz, M., Agapito, L.: Co-fusion: real-time segmentation, tracking and fusion of multiple objects. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2017)
Google Scholar
Rünz, M., Buffier, M., Agapito, L.: MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects. In: Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR) (2018)
Google Scholar
Scona, R., Jaimez, M., Petillot, Y.R., Fallon, M., Cremers, D.: StaticFusion: background reconstruction for dense RGB-D SLAM in dynamic environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Valmadre, J., Lucey, S.: Deterministic 3D human pose estimation using rigid structure. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 467–480. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15558-1_34
Chapter Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Xu, B., Li, W., Tzoumanikas, D., Bloesch, M., Davison, A., Leutenegger, S.: MID-fusion: octree-based object-level multi-instance dynamic SLAM. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)
Google Scholar
Zhao, R., Wang, Y., Martinez, A.: A simple, fast and highly-accurate algorithm to recover 3D shape from 2D landmarks on a single image. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 40, 3059–3066 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Imperial College London, London, UK
Dorian F. Henning, Tristan Laidlow & Stefan Leutenegger
Technische Universität München, Munich, Germany
Stefan Leutenegger

Authors

Dorian F. Henning
View author publications
You can also search for this author in PubMed Google Scholar
Tristan Laidlow
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Leutenegger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dorian F. Henning .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Supplementary material 1 (mp4 5364 KB)

Supplementary material 2 (pdf 505 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Henning, D.F., Laidlow, T., Leutenegger, S. (2022). BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_38
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking