Skip to main content

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Abstract

Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. Concurrently, image completion has aimed to create plausible appearance for the invisible regions, but requires a manual mask as input. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene. Particularly, we built a system to decompose a scene into individual objects, infer their underlying occlusion relationships, and even automatically learn which parts of the objects are occluded that need to be completed. In order to disentangle the occluded relationships of all objects in a complex scene, we use the fact that the front object without being occluded is easy to be identified, detected, and segmented. Our system interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects layer-by-layer. We first provide a thorough experiment using a new realistically rendered dataset with ground-truths for all invisible regions. To bridge the domain gap to real imagery where ground-truths are unavailable, we then train another model with the pseudo-ground-truths generated from our trained synthesis model. We demonstrate results on a wide variety of datasets and show significant improvement over the state-of-the-art.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    We used the heuristic in PCNet (Zhan et al. 2020) — larger masks are ordered in front for KINS, and behind for COCOA and CSD.

  2. 2.

    See details in (Zhan et al. 2020), where the visible ground-truth masks \(\text {V}_{gt}\) are used for ordering.

References

  1. Amer, M. R., Yousefi, S., Raich, R., & Todorovic, A. (2015). Monocular extraction of 2.1 d sketch using constrained convex optimization. International Journal of Computer Vision, 112(1), 23–42.

  2. Autodesk, Maya., (2019) Autodesk Maya. https://www.autodesk.com/products/maya/overview

  3. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481–2495.

  4. Burgess CP, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M, Lerchner A (2019) MONet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390

  5. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,Shi, J., Ouyang, W., et al. (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4974–4983

  6. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

  7. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  8. Dai, J., He, K., & Sun, J. (2016) Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3150–3158

  9. Dhamo, H., Navab, N., & Tombari, F. (2019) Object-driven multi-layer scene decomposition from a single image. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

  10. Dinh, L., Krueger, D., Bengio, Y. (2014) Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516

  11. Dinh, L., Sohl-Dickstein, J., Bengio, S. (2017) Density estimation using real nvp. In: International Conference on Learning Representations

  12. Ehsani, K., Mottaghi, R., Farhadi, A. (2018) SeGAN: Segmenting and generating the invisible. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6144–6153

  13. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  14. Follmann, P., Nig, R.K., Rtinger, P.H., Klostermann, M., Ttger, T.B. (2019) Learning to see the invisible: End-to-end trainable amodal instance segmentation. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1328–1336

  15. Gao, R.X., Wu,T.F., Zhu, S.C., Sang, N. (2007) Bayesian inference for layer representation with mixed markov random field. In: International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Springer, pp 213–224

  16. Geiger, A., Lenz, P., Urtasun, R. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 3354–3361

  17. Girshick, R. (2015) Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440–1448

  18. Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587

  19. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680

  20. Gould, S., Fulton, R., Koller, D. (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1–8

  21. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777

  22. Guo, R., Hoiem, D., (2012) Beyond the line of sight: labeling the underlying surfaces. In: European Conference on Computer Vision, Springer, pp 761–774

  23. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  24. He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969

  25. Hoiem, D., Efros, A. A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. International Journal of Computer Vision, 91(3), 328–346.

  26. Hu, Y.T., Chen, H.S., Hui, K., Huang, J.B., Schwing, A.G. (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3105–3115

  27. Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4), 107.

  28. Johnson, J., Alahi, A., Fei-Fei, L., (2016) Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision, pp 694–711

  29. Kar, A., Tulsiani, S., Carreira, J., Malik, J. (2015) Amodal completion and size constancy in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 127–135

  30. Karras, T., Laine, S., Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4401–4410

  31. Kingma, D.P., Dhariwal, P., (2018) Glow: Generative flow with invertible 1x1 convolutions. In: Advances in neural information processing systems, pp 10215–10224

  32. Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In: editor (ed) Proceedings of the International Conference on Learning Representations (ICLR)

  33. Li, K., Malik, J. (2016) Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 677–693

  34. Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y. (2017) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2359–2367

  35. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755

  36. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125

  37. Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S. (2020) Variational amodal object completion. Advances in Neural Information Processing Systems 33

  38. Liu, C., Kohli, P., Furukawa, Y. (2016) Layered scene decomposition via the Occlusion-CRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 165–173

  39. Long, J., Shelhamer, E., Darrell, T., (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440

  40. Mark, N., Mumford, D. (1990) The 2.1-d sketch. In: ICCV, pp 138–144

  41. Mirza, M., Osindero, S. (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  42. Nathan Silberman, P.K., Derek Hoiem, Fergus. R. (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision

  43. Nitzberg, M., Mumford, D., & Shiota, T. (1993). Filtering, segmentation and depth, (Vol. 662). Springer.

  44. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A. (2016) Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2536–2544

  45. Pinheiro, P.O., Collobert, R., Dollár, P. (2015) Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp 1990–1998

  46. Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, Springer, pp 75–91

  47. Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J. (2019) Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3014–3023

  48. Ren, S., He, K., Girshick, R., Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99

  49. Shade, J., Gortler, S., He, Lw., Szeliski, R. (1998) Layered depth images. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 231–242

  50. Silberman, N., Hoiem, D., Kohli, P., Fergus, R. (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp 746–760

  51. Simonyan, K., Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  52. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T. (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1746–1754

  53. Sun, D., Sudderth, E.B., Black, M.J. (2010) Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Advances in Neural Information Processing Systems, pp 2226–2234

  54. Tighe, J., Niethammer, M., Lazebnik, S. (2014) Scene parsing with object instances and occlusion ordering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3748–3755

  55. Vahdat, A., Kautz, J. (2020) NVAE: A deep hierarchical variational autoencoder. In: Neural Information Processing Systems (NeurIPS)

  56. Van Den Oord, A., Vinyals, O., et al. (2017) Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp 6306–6315

  57. Winn, J., Shotton, J. (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE, vol 1, pp 37–44

  58. Yan, X., Wang, F., Liu, W., Yu, Y., He, S., Pan, J. (2019) Visualizing the invisible: Occluded vehicle segmentation and recovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp 7618–7627

  59. Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H. (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, p 3

  60. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C. (2010) Layered object detection for multi-class segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3113–3120

  61. Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2011). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.

    Article  Google Scholar 

  62. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S. (2018) Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5505–5514

  63. Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S. (2018) Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3712–3722

  64. Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C. (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792

  65. Zhang, Z., Schwing, A.G., Fidler, S., Urtasun, R. (2015) Monocular object instance segmentation and depth ordering with cnns. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2614–2622

  66. Zheng, C., Cham, T.J., Cai, J. (2019) Pluralistic image completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1438–1447

  67. Zhu, Y., Tian, Y., Metaxas, D., Dollár, P. (2017) Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1464–1472

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). This research is also supported by the Monash FIT Start-up Grant.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chuanxia Zheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by S.-C. Zhu.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1259 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, C., Dao, DS., Song, G. et al. Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition. Int J Comput Vis (2021). https://doi.org/10.1007/s11263-021-01517-0

Download citation

Keywords

  • Layered scene decomposition
  • Scene completion
  • Amodal instance segmentation
  • Instance depth order
  • Scene recomposition.