Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video

  • Benjamin BiggsEmail author
  • Thomas RoddickEmail author
  • Andrew FitzgibbonEmail author
  • Roberto CipollaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11365)


We present a system to recover the 3D shape and motion of a wide variety of quadrupeds from video. The system comprises a machine learning front-end which predicts candidate 2D joint positions, a discrete optimization which finds kinematically plausible joint correspondences, and an energy minimization stage which fits a detailed 3D model to the image. In order to overcome the limited availability of motion capture training data from animals, and the difficulty of generating realistic synthetic training images, the system is designed to work on silhouette data. The joint candidate predictor is trained on synthetically generated silhouette images, and at test time, deep learning methods or standard video segmentation tools are used to extract silhouettes from real data. The system is tested on animal videos from several species, and shows accurate reconstructions of 3D shape and pose.



The authors would like to thank GlaxoSmithKline for sponsoring this work.

Supplementary material

484520_1_En_1_MOESM1_ESM.pdf (1 mb)
Supplementary material 1 (pdf 1056 KB)

Supplementary material 2 (mp4 88929 KB)


  1. 1.
    Food and Agriculture Organization of the United Nations: FAOSTAT statistics database (2016). Accessed FAOSTAT 21 Nov 2017Google Scholar
  2. 2.
    Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation (2018)Google Scholar
  3. 3.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of CVPR (2014)Google Scholar
  4. 4.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  5. 5.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human poseestimation. In: Proceedings of BMVC, pp. 12.1–12.11 (2010)Google Scholar
  6. 6.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results (2012)Google Scholar
  7. 7.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of CVPR (2016)Google Scholar
  8. 8.
    Wilhelm, N., Vögele, A., Zsoldos, R., Licka, T., Krüger, B., Bernard, J.: Furyexplorer: visual-interactive exploration of horse motion capture data. In: Visualization and Data Analysis 2015, vol. 9397, p. 93970F (2015)Google Scholar
  9. 9.
    Zuffi, S., Kanazawa, A., Jacobs, D., Black, M.J.: 3D menagerie: modeling the 3D shape and pose of animals. In: Proceedings of CVPR, pp. 5524–5532. IEEE (2017)Google Scholar
  10. 10.
    Shotton, J., et al.: Real-time human pose recognition in parts from a single depth image. In: Proceedings of CVPR. IEEE (2011)Google Scholar
  11. 11.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of CVPR, vol. 1, p. 7 (2017)Google Scholar
  12. 12.
    Zuffi, S., Kanazawa, A., Black, M.J.: Lions and tigers and bears: capturing non-rigid, 3D, articulated shape from images. In: Proceedings of CVPR (2018)Google Scholar
  13. 13.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR (2009)Google Scholar
  14. 14.
    Li, X., et al.: Video object segmentation with re-identification. In: The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)Google Scholar
  15. 15.
    Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for object tracking. In: The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)Google Scholar
  16. 16.
    Cashman, T.J., Fitzgibbon, A.W.: What shape are dolphins? Building 3D morphable models from 2Dimages. IEEE TPAMI 35, 232–244 (2013)CrossRefGoogle Scholar
  17. 17.
    Reinert, B., Ritschel, T., Seidel, H.P.: Animated 3D creatures from single-view video by skeletal sketching. In: Graphics Interface, pp. 133–141 (2016)Google Scholar
  18. 18.
    Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of CVPR. IEEE (2015)Google Scholar
  19. 19.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 248:1–248:16 (2015). (Proceedings of SIGGRAPH Asia)Google Scholar
  20. 20.
    Chen, Y., Kim, T.-K., Cipolla, R.: Inferring 3D shapes and deformations from single views. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 300–313. Springer, Heidelberg (2010). Scholar
  21. 21.
    Favreau, L., Reveret, L., Depraz, C., Cani, M.P.: Animal gaits from video. In: Proceedings of the 2004 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 277–286 (2004)Google Scholar
  22. 22.
    Tan, V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3D human body shape and pose prediction. In: Proceedings of BMVC (2017)Google Scholar
  23. 23.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  24. 24.
    Wiles, O., Zisserman, A.: SilNet: single-and multi-view reconstruction by learning from silhouettes. In: Proceedings of BMVC (2017)Google Scholar
  25. 25.
    Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. In: Proceedings of CVPR, pp. 623–630. IEEE (2010)Google Scholar
  26. 26.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: Proceedings of CVPR, pp. 588–595. IEEE (2013)Google Scholar
  27. 27.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016). Scholar
  28. 28.
    Mathis, A., et al.: DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Technical report, Nature Publishing Group (2018)Google Scholar
  29. 29.
    Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Joint object and part segmentation using deep learned potentials. In: Proceedings of ICCV, pp. 1573–1581 (2015)Google Scholar
  30. 30.
    Wang, J., Yuille, A.L.: Semantic part segmentation using compositional model combining shape and appearance. In: Proceedings of CVPR, pp. 1788–1797 (2015)Google Scholar
  31. 31.
    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: Proceedings of CVPR (2014)Google Scholar
  32. 32.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of CVPR (2017)Google Scholar
  33. 33.
    Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). Scholar
  34. 34.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  35. 35.
    Diamond, S., Boyd, S.: CVXPY: a Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17, 1–5 (2016)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Park, J., Boyd, S.: General heuristics for nonconvex quadratically constrained quadratic programming (2017)Google Scholar
  37. 37.
    Blum, H.: A transformation for extracting new descriptors of shape. Models Percept. Speech Vis. Forms 1967, 362–380 (1967)Google Scholar
  38. 38.
    Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT Press, Cambridge (1992)CrossRefGoogle Scholar
  39. 39.
    Lourakis, M., Argyros, A.A.: Is Levenberg-Marquardt the most efficient optimization algorithm for implementing bundle adjustment? In: Proceedings of ICCV, pp. 1526–1531 (2005)Google Scholar
  40. 40.
    Adobe Systems Inc.: Creating a green screen key using ultra key. Accessed 14 Mar 2018
  41. 41.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. TPAMI 35, 2878–2890 (2013)CrossRefGoogle Scholar
  42. 42.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of EngineeringUniversity of CambridgeCambridgeUK
  2. 2.Microsoft ResearchCambridgeUK

Personalised recommendations