Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Zheng, Chuanxia; Dao, Duy-Son; Song, Guoxian; Cham, Tat-Jen; Cai, Jianfei

doi:10.1007/s11263-021-01517-0

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Published: 28 September 2021

Volume 129, pages 3195–3215, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chuanxia Zheng ORCID: orcid.org/0000-0002-3584-9640¹,
Duy-Son Dao²,
Guoxian Song¹,
Tat-Jen Cham¹ &
…
Jianfei Cai²

850 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. Concurrently, image completion has aimed to create plausible appearance for the invisible regions, but requires a manual mask as input. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene. Particularly, we built a system to decompose a scene into individual objects, infer their underlying occlusion relationships, and even automatically learn which parts of the objects are occluded that need to be completed. In order to disentangle the occluded relationships of all objects in a complex scene, we use the fact that the front object without being occluded is easy to be identified, detected, and segmented. Our system interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects layer-by-layer. We first provide a thorough experiment using a new realistically rendered dataset with ground-truths for all invisible regions. To bridge the domain gap to real imagery where ground-truths are unavailable, we then train another model with the pseudo-ground-truths generated from our trained synthesis model. We demonstrate results on a wide variety of datasets and show significant improvement over the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Article Open access 28 April 2022

Building Scene Models by Completing and Hallucinating Depth and Semantics

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

We used the heuristic in PCNet (Zhan et al. 2020) — larger masks are ordered in front for KINS, and behind for COCOA and CSD.
See details in (Zhan et al. 2020), where the visible ground-truth masks \(\text {V}_{gt}\) are used for ordering.

References

Amer, M. R., Yousefi, S., Raich, R., & Todorovic, A. (2015). Monocular extraction of 2.1 d sketch using constrained convex optimization. International Journal of Computer Vision, 112(1), 23–42.
Autodesk, Maya., (2019) Autodesk Maya. https://www.autodesk.com/products/maya/overview
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481–2495.
Burgess CP, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M, Lerchner A (2019) MONet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,Shi, J., Ouyang, W., et al. (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4974–4983
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Dai, J., He, K., & Sun, J. (2016) Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3150–3158
Dhamo, H., Navab, N., & Tombari, F. (2019) Object-driven multi-layer scene decomposition from a single image. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Dinh, L., Krueger, D., Bengio, Y. (2014) Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516
Dinh, L., Sohl-Dickstein, J., Bengio, S. (2017) Density estimation using real nvp. In: International Conference on Learning Representations
Ehsani, K., Mottaghi, R., Farhadi, A. (2018) SeGAN: Segmenting and generating the invisible. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6144–6153
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Follmann, P., Nig, R.K., Rtinger, P.H., Klostermann, M., Ttger, T.B. (2019) Learning to see the invisible: End-to-end trainable amodal instance segmentation. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1328–1336
Gao, R.X., Wu,T.F., Zhu, S.C., Sang, N. (2007) Bayesian inference for layer representation with mixed markov random field. In: International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Springer, pp 213–224
Geiger, A., Lenz, P., Urtasun, R. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 3354–3361
Girshick, R. (2015) Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440–1448
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680
Gould, S., Fulton, R., Koller, D. (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1–8
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
Guo, R., Hoiem, D., (2012) Beyond the line of sight: labeling the underlying surfaces. In: European Conference on Computer Vision, Springer, pp 761–774
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
Hoiem, D., Efros, A. A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. International Journal of Computer Vision, 91(3), 328–346.
Hu, Y.T., Chen, H.S., Hui, K., Huang, J.B., Schwing, A.G. (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3105–3115
Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4), 107.
Johnson, J., Alahi, A., Fei-Fei, L., (2016) Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision, pp 694–711
Kar, A., Tulsiani, S., Carreira, J., Malik, J. (2015) Amodal completion and size constancy in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 127–135
Karras, T., Laine, S., Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4401–4410
Kingma, D.P., Dhariwal, P., (2018) Glow: Generative flow with invertible 1x1 convolutions. In: Advances in neural information processing systems, pp 10215–10224
Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In: editor (ed) Proceedings of the International Conference on Learning Representations (ICLR)
Li, K., Malik, J. (2016) Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 677–693
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y. (2017) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2359–2367
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S. (2020) Variational amodal object completion. Advances in Neural Information Processing Systems 33
Liu, C., Kohli, P., Furukawa, Y. (2016) Layered scene decomposition via the Occlusion-CRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 165–173
Long, J., Shelhamer, E., Darrell, T., (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440
Mark, N., Mumford, D. (1990) The 2.1-d sketch. In: ICCV, pp 138–144
Mirza, M., Osindero, S. (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Nathan Silberman, P.K., Derek Hoiem, Fergus. R. (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision
Nitzberg, M., Mumford, D., & Shiota, T. (1993). Filtering, segmentation and depth, (Vol. 662). Springer.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A. (2016) Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2536–2544
Pinheiro, P.O., Collobert, R., Dollár, P. (2015) Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp 1990–1998
Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, Springer, pp 75–91
Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J. (2019) Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3014–3023
Ren, S., He, K., Girshick, R., Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99
Shade, J., Gortler, S., He, Lw., Szeliski, R. (1998) Layered depth images. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 231–242
Silberman, N., Hoiem, D., Kohli, P., Fergus, R. (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp 746–760
Simonyan, K., Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T. (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1746–1754
Sun, D., Sudderth, E.B., Black, M.J. (2010) Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Advances in Neural Information Processing Systems, pp 2226–2234
Tighe, J., Niethammer, M., Lazebnik, S. (2014) Scene parsing with object instances and occlusion ordering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3748–3755
Vahdat, A., Kautz, J. (2020) NVAE: A deep hierarchical variational autoencoder. In: Neural Information Processing Systems (NeurIPS)
Van Den Oord, A., Vinyals, O., et al. (2017) Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp 6306–6315
Winn, J., Shotton, J. (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE, vol 1, pp 37–44
Yan, X., Wang, F., Liu, W., Yu, Y., He, S., Pan, J. (2019) Visualizing the invisible: Occluded vehicle segmentation and recovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp 7618–7627
Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H. (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, p 3
Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C. (2010) Layered object detection for multi-class segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3113–3120
Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2011). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.
Article Google Scholar
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S. (2018) Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5505–5514
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S. (2018) Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3712–3722
Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C. (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792
Zhang, Z., Schwing, A.G., Fidler, S., Urtasun, R. (2015) Monocular object instance segmentation and depth ordering with cnns. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2614–2622
Zheng, C., Cham, T.J., Cai, J. (2019) Pluralistic image completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1438–1447
Zhu, Y., Tian, Y., Metaxas, D., Dollár, P. (2017) Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1464–1472

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). This research is also supported by the Monash FIT Start-up Grant.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Chuanxia Zheng, Guoxian Song & Tat-Jen Cham
Department of Data Science & AI, Monash University, Melbourne, Australia
Duy-Son Dao & Jianfei Cai

Authors

Chuanxia Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Duy-Son Dao
View author publications
You can also search for this author in PubMed Google Scholar
Guoxian Song
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Jen Cham
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanxia Zheng.

Additional information

Communicated by S.-C. Zhu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1259 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, C., Dao, DS., Song, G. et al. Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition. Int J Comput Vis 129, 3195–3215 (2021). https://doi.org/10.1007/s11263-021-01517-0

Download citation

Received: 23 January 2021
Accepted: 13 August 2021
Published: 28 September 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11263-021-01517-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Abstract

Access this article

Similar content being viewed by others

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Building Scene Models by Completing and Hallucinating Depth and Semantics

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 1259 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition

Abstract

Access this article

Similar content being viewed by others

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Building Scene Models by Completing and Hallucinating Depth and Semantics

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 1259 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation