Skip to main content
Log in

Learning to disentangle latent physical factors of deformable faces

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

We proposed a monocular image disentanglement framework based on a compositional model. Our model disentangles the input image into its constituent components of albedo, depth, deformation, pose, and illumination. Instead of relying on any handcrafted priors, we trained our deep neural network to understand the physical meaning of each element by mimicking real-world operations, allowing it to reconstruct images in a self-supervised manner. Our model, trained on multi-frame images of each subject, demonstrates a better understanding of the objects without requiring any supervision or strong model assumptions. We utilized a deformation-free canonical space to align multi-frame images in the same space. This approach enables the understanding of information from multi-frame images in the same space. Our experiments showed that our approach accurately disentangled the physical elements of deformable faces from images with wide variations found in the wild.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are openly available in VoxCeleb2 and Basel Face Model at www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html, reference number  [9] and https://faces.dmi.unibas.ch/bfm, reference number  [37], respectively.

References

  1. Abrevaya, V.F., Boukhayma, A., Torr, P.H., Boyer, E.: Cross-modal deep face normals with deactivable skip connections. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4979–4989 (2020)

  2. Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: international Conference on Computer Vision (2015)

  3. Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1670–1687 (2015)

    Article  Google Scholar 

  4. Barrow, H.: Recovering intrinsic scene characteristics from images. Comput. Vis. Syst. pp. 3–26 (1978). Cited By (since 1996) 143

  5. Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM Trans. Graph. (2014). https://doi.org/10.1145/2601097.2601206

    Article  Google Scholar 

  6. Blanz, V., Basso, C., Poggio, T., Vetter, T.: Reanimating faces in images and video. Comput. Graph. Forum (2003). https://doi.org/10.1111/1467-8659.t01-1-00712

    Article  Google Scholar 

  7. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Annual Conference on Computer Graphics and Interactive Techniques (Proc. SIGGRAPH 1999), pp. 187–194 (1999)

  8. Burkov, E., Pasechnik, I., Grigorev, A., Lempitsky, V.: Neural head reentactment with latent pose descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)

  9. Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)

  10. Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20,311–20,322 (2022)

  11. Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)

  12. Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: Revisiting deep intrinsic image decompositions. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8944–8952 (2018). https://doi.org/10.1109/CVPR.2018.00932

  13. Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. (ToG) 40(4), 1–13 (2021)

    Article  Google Scholar 

  14. Geiger, A., Ziegler, J., Stiller, C.: StereoScan: gense 3D reconstruction in real-time. In: IEEE Intelligent Vehicles Symposium (IV), pp. 963–968 (2011)

  15. Georgoulis, S., Rematas, K., Ritschel, T., Gavves, E., Fritz, M., Van Gool, L., Tuytelaars, T.: Reflectance and natural illumination from single-material specular objects using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 1932–1947 (2018). https://doi.org/10.1109/TPAMI.2017.2742999

    Article  Google Scholar 

  16. Goel, S., Kanazawa, A., Malik, J.: Shape and viewpoint without keypoints. In: European Conference on Computer Vision (2020)

  17. Henderson, P., Ferrari, V.: Learning to generate and reconstruct 3D meshes with only 2D supervision. arXiv preprint arXiv:1807.09259 (2018)

  18. Horn, B.K.P.: Obtaining shape from shading information. In: Winston, P.H. (ed.) The Psychology of Computer Vision. McGraw-Hill (1975)

    Google Scholar 

  19. Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: Advances in Neural Information Processing Systems (2018)

  20. Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (eds.) Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XV, Lecture Notes in Computer Science, vol. 11219, pp. 386–402. Springer (2018). https://doi.org/10.1007/978-3-030-01267-0_23

  21. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

  22. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

  23. Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Niessner, M., Pérez, P., Richardt, C., Zollhöfer, M., Theobalt, C.: Deep video portraits. ACM Trans. Graph. (Proc. SIGGRAPH 2018) 37(4), 1–14 (2018)

    Article  Google Scholar 

  24. Kim, H., Zollhöfer, M., Tewari, A., Thies, J., Richardt, C., Theobalt, C.: Inversefacenet: Deep monocular inverse face rendering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  25. Kovacs, B., Bell, S., Snavely, N., Bala, K.: Shading annotations in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859 (2017). https://doi.org/10.1109/CVPR.2017.97

  26. Liu, F., Liu, X.: 2D gans meet unsupervised single-view 3D reconstruction. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pp. 497–514. Springer (2022)

  27. Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: The IEEE International Conference on Computer Vision (ICCV) (2019)

  28. Lombardi, S., Nishino, K.: Reflectance and illumination recovery in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 129–141 (2016). https://doi.org/10.1109/TPAMI.2015.2430318

    Article  Google Scholar 

  29. Meka, A., Haene, C., Pandey, R., Zollhoefer, M., Fanello, S., Fyffe, G., Kowdle, A., Yu, X., Busch, J., Dourgarian, J., Denny, P., Bouaziz, S., Lincoln, P., Whalen, M., Harvey, G., Taylor, J., Izadi, S., Tagliasacchi, A., Debevec, P., Theobalt, C., Valentin, J., Rhemann, C.: Deep reflectance fields—high-quality facial reflectance field inference from color gradient illumination. ACM Trans. Graph. (Proceedings SIGGRAPH) 38(4), 1–12 (2019). https://doi.org/10.1145/3306346.3323027

    Article  Google Scholar 

  30. Meka, A., Maximov, M., Zollhoefer, M., Chatterjee, A., Seidel, H.P., Richardt, C., Theobalt, C.: Lime: Live intrinsic material estimation. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2018). http://gvv.mpi-inf.mpg.de/projects/LIME/

  31. Mobahi, H., Liu, C., Freeman, W.T.: A compositional model for low-dimensional image set representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

  32. Nestmeyer, T., Lalonde, J.F., Matthews, I., Lehrmann, A.: Learning physics-guided face relighting under directional light. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  33. Novotny, D., Larlus, D., Vedaldi, A.: Learning 3D object categories by looking around them. In: International Conference on Computer Vision (2017)

  34. Ondrúška, P., Kohli, P., Izadi, S.: MobileFusion: real-time volumetric surface reconstruction and dense tracking on mobile phones. IEEE Trans. Vis. Comput. Graph. 21(11), 1251–1258 (2015)

    Article  Google Scholar 

  35. Pan, X., Dai, B., Liu, Z., Loy, C.C., Luo, P.: Do 2D Gans know 3D shape? unsupervised 3D shape reconstruction from 2D image Gans. In: International Conference on Learning Representations (2021)

  36. Pan, X., Dai, B., Liu, Z., Loy, C.C., Luo, P.: Do 2D gans know 3D shape? unsupervised 3d shape reconstruction from 2D image Gans. In: International Conference on Learning Representations (2021)

  37. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)

  38. Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. ACM Trans. Graph. (Proc/ SIGGRAPH 2001) 20(3), 497–500 (2001)

    Google Scholar 

  39. Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.W.: SfSNet: Learning shape, reflectance and illuminance of faces ‘in the wild’. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6296–6305 (2018)

  40. Shang, J., Shen, T., Li, S., Zhou, L., Zhen, M., Fang, T., Quan, L.: Self-supervised monocular 3D face reconstruction by occlusion-aware multi-view geometry consistency. arXiv preprint arXiv:2007.12494 (2020)

  41. Shu, Z., Sahasrabudhe, M., Güler, R.A., Samaras, D., Paragios, N., Kokkinos, I.: Deforming autoencoders: unsupervised disentangling of shape and appearance. In: Proceedings of the European conference on computer vision, pp. 650–665 (2018)

  42. Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 5444–5453. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.578

  43. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

  44. Sun, T., Barron, J.T., Tsai, Y.T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec, P., Ramamoorthi, R.: Single image portrait relighting. ACM Trans. Graph. (2019). https://doi.org/10.1145/3306346.3323008

    Article  Google Scholar 

  45. Tewari, A., Bernard, F., Garrido, P., Bharaj, G., Elgharib, M., Seidel, H.P., Pérez, P., Zöllhofer, M., Theobalt, C.: Fml: Face model learning from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10,812–10,822 (2019)

  46. Tewari, A., Zollhofer, M., Kim, H., Garrido, P., Bernard, F., Perez, P., Theobalt, C.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops (2017)

  47. Tran, L., Liu, X.: Nonlinear 3d face morphable model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  48. Tran, L., Liu, X.: Nonlinear 3D face morphable model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  49. Tran, L., Liu, X.: On learning 3d face morphable model from in-the-wild images. IEEE Tran. Pattern Anal. Mach. Intell. 43, 157–171 (2019)

    Google Scholar 

  50. Tran, L., Liu, X.: On learning 3D face morphable model from in-the-wild images. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 157–171 (2021). https://doi.org/10.1109/TPAMI.2019.2927975

    Article  Google Scholar 

  51. Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: Computer Vision and Pattern Recognition (CVPR) (2018)

  52. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Computer Vision and Pattern Recognition (CVPR) (2017)

  53. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: DeMoN: Depth and motion network for learning monocular stereo. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 5038–5047 (2017)

  54. Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  55. Wen, Y., Liu, W., Raj, B., Singh, R.: Self-supervised 3d face reconstruction via conditional estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13,289–13,298 (2021)

  56. Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: end-to-end view synthesis from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7467–7477 (2020)

  57. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Opt. Eng. 19(1), 139–144 (1980)

    Article  Google Scholar 

  58. Wu, S., Rupprecht, C., Vedaldi, A.: Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10 (2020)

  59. Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing systems, pp. 1696–1704 (2016)

  60. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: International Conference on Computer Vision (2019)

  61. Zhang, K., Zhang, Z., Li, Z., Yu, Q.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  62. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 690–706 (1999). https://doi.org/10.1109/34.784284

    Article  MATH  Google Scholar 

  63. Zhang, Z., Ge, Y., Tai, Y., Cao, W., Chen, R., Liu, K., Tang, H., Huang, X., Wang, C., Xie, Z., et al.: Physically-guided disentangled implicit rendering for 3D face modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20,353–20,363 (2022)

  64. Zhou, H., Hadap, S., Sunkavalli, K., Jacobs, D.W.: Deep single-image portrait relighting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  65. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference On Computer Vision And Pattern Recognition (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sung-eui Yoon.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ha, I., Chang, H.S., Son, M. et al. Learning to disentangle latent physical factors of deformable faces. Vis Comput 39, 3481–3494 (2023). https://doi.org/10.1007/s00371-023-02948-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02948-1

Keywords

Navigation