Skip to main content

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Abstract

Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. Concurrently, image completion has aimed to create plausible appearance for the invisible regions, but requires a manual mask as input. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene. Particularly, we built a system to decompose a scene into individual objects, infer their underlying occlusion relationships, and even automatically learn which parts of the objects are occluded that need to be completed. In order to disentangle the occluded relationships of all objects in a complex scene, we use the fact that the front object without being occluded is easy to be identified, detected, and segmented. Our system interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects layer-by-layer. We first provide a thorough experiment using a new realistically rendered dataset with ground-truths for all invisible regions. To bridge the domain gap to real imagery where ground-truths are unavailable, we then train another model with the pseudo-ground-truths generated from our trained synthesis model. We demonstrate results on a wide variety of datasets and show significant improvement over the state-of-the-art.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. We used the heuristic in PCNet (Zhan et al. 2020) — larger masks are ordered in front for KINS, and behind for COCOA and CSD.

  2. See details in (Zhan et al. 2020), where the visible ground-truth masks \(\text {V}_{gt}\) are used for ordering.

References

  • Amer, M. R., Yousefi, S., Raich, R., & Todorovic, A. (2015). Monocular extraction of 2.1 d sketch using constrained convex optimization. International Journal of Computer Vision, 112(1), 23–42.

  • Autodesk, Maya., (2019) Autodesk Maya. https://www.autodesk.com/products/maya/overview

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481–2495.

  • Burgess CP, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M, Lerchner A (2019) MONet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390

  • Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,Shi, J., Ouyang, W., et al. (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4974–4983

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  • Dai, J., He, K., & Sun, J. (2016) Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3150–3158

  • Dhamo, H., Navab, N., & Tombari, F. (2019) Object-driven multi-layer scene decomposition from a single image. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

  • Dinh, L., Krueger, D., Bengio, Y. (2014) Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516

  • Dinh, L., Sohl-Dickstein, J., Bengio, S. (2017) Density estimation using real nvp. In: International Conference on Learning Representations

  • Ehsani, K., Mottaghi, R., Farhadi, A. (2018) SeGAN: Segmenting and generating the invisible. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6144–6153

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Follmann, P., Nig, R.K., Rtinger, P.H., Klostermann, M., Ttger, T.B. (2019) Learning to see the invisible: End-to-end trainable amodal instance segmentation. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1328–1336

  • Gao, R.X., Wu,T.F., Zhu, S.C., Sang, N. (2007) Bayesian inference for layer representation with mixed markov random field. In: International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Springer, pp 213–224

  • Geiger, A., Lenz, P., Urtasun, R. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 3354–3361

  • Girshick, R. (2015) Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440–1448

  • Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680

  • Gould, S., Fulton, R., Koller, D. (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1–8

  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777

  • Guo, R., Hoiem, D., (2012) Beyond the line of sight: labeling the underlying surfaces. In: European Conference on Computer Vision, Springer, pp 761–774

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969

  • Hoiem, D., Efros, A. A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. International Journal of Computer Vision, 91(3), 328–346.

  • Hu, Y.T., Chen, H.S., Hui, K., Huang, J.B., Schwing, A.G. (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3105–3115

  • Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4), 107.

  • Johnson, J., Alahi, A., Fei-Fei, L., (2016) Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision, pp 694–711

  • Kar, A., Tulsiani, S., Carreira, J., Malik, J. (2015) Amodal completion and size constancy in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 127–135

  • Karras, T., Laine, S., Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4401–4410

  • Kingma, D.P., Dhariwal, P., (2018) Glow: Generative flow with invertible 1x1 convolutions. In: Advances in neural information processing systems, pp 10215–10224

  • Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In: editor (ed) Proceedings of the International Conference on Learning Representations (ICLR)

  • Li, K., Malik, J. (2016) Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 677–693

  • Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y. (2017) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2359–2367

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125

  • Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S. (2020) Variational amodal object completion. Advances in Neural Information Processing Systems 33

  • Liu, C., Kohli, P., Furukawa, Y. (2016) Layered scene decomposition via the Occlusion-CRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 165–173

  • Long, J., Shelhamer, E., Darrell, T., (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440

  • Mark, N., Mumford, D. (1990) The 2.1-d sketch. In: ICCV, pp 138–144

  • Mirza, M., Osindero, S. (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  • Nathan Silberman, P.K., Derek Hoiem, Fergus. R. (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision

  • Nitzberg, M., Mumford, D., & Shiota, T. (1993). Filtering, segmentation and depth, (Vol. 662). Springer.

  • Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A. (2016) Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2536–2544

  • Pinheiro, P.O., Collobert, R., Dollár, P. (2015) Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp 1990–1998

  • Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, Springer, pp 75–91

  • Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J. (2019) Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3014–3023

  • Ren, S., He, K., Girshick, R., Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99

  • Shade, J., Gortler, S., He, Lw., Szeliski, R. (1998) Layered depth images. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 231–242

  • Silberman, N., Hoiem, D., Kohli, P., Fergus, R. (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp 746–760

  • Simonyan, K., Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T. (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1746–1754

  • Sun, D., Sudderth, E.B., Black, M.J. (2010) Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Advances in Neural Information Processing Systems, pp 2226–2234

  • Tighe, J., Niethammer, M., Lazebnik, S. (2014) Scene parsing with object instances and occlusion ordering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3748–3755

  • Vahdat, A., Kautz, J. (2020) NVAE: A deep hierarchical variational autoencoder. In: Neural Information Processing Systems (NeurIPS)

  • Van Den Oord, A., Vinyals, O., et al. (2017) Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp 6306–6315

  • Winn, J., Shotton, J. (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE, vol 1, pp 37–44

  • Yan, X., Wang, F., Liu, W., Yu, Y., He, S., Pan, J. (2019) Visualizing the invisible: Occluded vehicle segmentation and recovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp 7618–7627

  • Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H. (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, p 3

  • Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C. (2010) Layered object detection for multi-class segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3113–3120

  • Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2011). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.

    Article  Google Scholar 

  • Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S. (2018) Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5505–5514

  • Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S. (2018) Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3712–3722

  • Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C. (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792

  • Zhang, Z., Schwing, A.G., Fidler, S., Urtasun, R. (2015) Monocular object instance segmentation and depth ordering with cnns. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2614–2622

  • Zheng, C., Cham, T.J., Cai, J. (2019) Pluralistic image completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1438–1447

  • Zhu, Y., Tian, Y., Metaxas, D., Dollár, P. (2017) Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1464–1472

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). This research is also supported by the Monash FIT Start-up Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuanxia Zheng.

Additional information

Communicated by S.-C. Zhu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1259 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, C., Dao, DS., Song, G. et al. Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition. Int J Comput Vis 129, 3195–3215 (2021). https://doi.org/10.1007/s11263-021-01517-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01517-0

Keywords

  • Layered scene decomposition
  • Scene completion
  • Amodal instance segmentation
  • Instance depth order
  • Scene recomposition.