Abstract
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. Concurrently, image completion has aimed to create plausible appearance for the invisible regions, but requires a manual mask as input. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene. Particularly, we built a system to decompose a scene into individual objects, infer their underlying occlusion relationships, and even automatically learn which parts of the objects are occluded that need to be completed. In order to disentangle the occluded relationships of all objects in a complex scene, we use the fact that the front object without being occluded is easy to be identified, detected, and segmented. Our system interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects layer-by-layer. We first provide a thorough experiment using a new realistically rendered dataset with ground-truths for all invisible regions. To bridge the domain gap to real imagery where ground-truths are unavailable, we then train another model with the pseudo-ground-truths generated from our trained synthesis model. We demonstrate results on a wide variety of datasets and show significant improvement over the state-of-the-art.
Similar content being viewed by others
References
Amer, M. R., Yousefi, S., Raich, R., & Todorovic, A. (2015). Monocular extraction of 2.1 d sketch using constrained convex optimization. International Journal of Computer Vision, 112(1), 23–42.
Autodesk, Maya., (2019) Autodesk Maya. https://www.autodesk.com/products/maya/overview
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481–2495.
Burgess CP, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M, Lerchner A (2019) MONet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z.,Shi, J., Ouyang, W., et al. (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4974–4983
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Dai, J., He, K., & Sun, J. (2016) Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3150–3158
Dhamo, H., Navab, N., & Tombari, F. (2019) Object-driven multi-layer scene decomposition from a single image. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Dinh, L., Krueger, D., Bengio, Y. (2014) Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516
Dinh, L., Sohl-Dickstein, J., Bengio, S. (2017) Density estimation using real nvp. In: International Conference on Learning Representations
Ehsani, K., Mottaghi, R., Farhadi, A. (2018) SeGAN: Segmenting and generating the invisible. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6144–6153
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Follmann, P., Nig, R.K., Rtinger, P.H., Klostermann, M., Ttger, T.B. (2019) Learning to see the invisible: End-to-end trainable amodal instance segmentation. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1328–1336
Gao, R.X., Wu,T.F., Zhu, S.C., Sang, N. (2007) Bayesian inference for layer representation with mixed markov random field. In: International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Springer, pp 213–224
Geiger, A., Lenz, P., Urtasun, R. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 3354–3361
Girshick, R. (2015) Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440–1448
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680
Gould, S., Fulton, R., Koller, D. (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1–8
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
Guo, R., Hoiem, D., (2012) Beyond the line of sight: labeling the underlying surfaces. In: European Conference on Computer Vision, Springer, pp 761–774
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
Hoiem, D., Efros, A. A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. International Journal of Computer Vision, 91(3), 328–346.
Hu, Y.T., Chen, H.S., Hui, K., Huang, J.B., Schwing, A.G. (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3105–3115
Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4), 107.
Johnson, J., Alahi, A., Fei-Fei, L., (2016) Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision, pp 694–711
Kar, A., Tulsiani, S., Carreira, J., Malik, J. (2015) Amodal completion and size constancy in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 127–135
Karras, T., Laine, S., Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4401–4410
Kingma, D.P., Dhariwal, P., (2018) Glow: Generative flow with invertible 1x1 convolutions. In: Advances in neural information processing systems, pp 10215–10224
Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In: editor (ed) Proceedings of the International Conference on Learning Representations (ICLR)
Li, K., Malik, J. (2016) Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision, Springer, pp 677–693
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y. (2017) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2359–2367
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S. (2020) Variational amodal object completion. Advances in Neural Information Processing Systems 33
Liu, C., Kohli, P., Furukawa, Y. (2016) Layered scene decomposition via the Occlusion-CRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 165–173
Long, J., Shelhamer, E., Darrell, T., (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440
Mark, N., Mumford, D. (1990) The 2.1-d sketch. In: ICCV, pp 138–144
Mirza, M., Osindero, S. (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Nathan Silberman, P.K., Derek Hoiem, Fergus. R. (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision
Nitzberg, M., Mumford, D., & Shiota, T. (1993). Filtering, segmentation and depth, (Vol. 662). Springer.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A. (2016) Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2536–2544
Pinheiro, P.O., Collobert, R., Dollár, P. (2015) Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp 1990–1998
Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: Proceedings of the European Conference on Computer Vision, Springer, pp 75–91
Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J. (2019) Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3014–3023
Ren, S., He, K., Girshick, R., Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99
Shade, J., Gortler, S., He, Lw., Szeliski, R. (1998) Layered depth images. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 231–242
Silberman, N., Hoiem, D., Kohli, P., Fergus, R. (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp 746–760
Simonyan, K., Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T. (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1746–1754
Sun, D., Sudderth, E.B., Black, M.J. (2010) Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Advances in Neural Information Processing Systems, pp 2226–2234
Tighe, J., Niethammer, M., Lazebnik, S. (2014) Scene parsing with object instances and occlusion ordering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3748–3755
Vahdat, A., Kautz, J. (2020) NVAE: A deep hierarchical variational autoencoder. In: Neural Information Processing Systems (NeurIPS)
Van Den Oord, A., Vinyals, O., et al. (2017) Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp 6306–6315
Winn, J., Shotton, J. (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE, vol 1, pp 37–44
Yan, X., Wang, F., Liu, W., Yu, Y., He, S., Pan, J. (2019) Visualizing the invisible: Occluded vehicle segmentation and recovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp 7618–7627
Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H. (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, p 3
Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C. (2010) Layered object detection for multi-class segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3113–3120
Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2011). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S. (2018) Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5505–5514
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S. (2018) Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3712–3722
Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C. (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792
Zhang, Z., Schwing, A.G., Fidler, S., Urtasun, R. (2015) Monocular object instance segmentation and depth ordering with cnns. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2614–2622
Zheng, C., Cham, T.J., Cai, J. (2019) Pluralistic image completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1438–1447
Zhu, Y., Tian, Y., Metaxas, D., Dollár, P. (2017) Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1464–1472
Acknowledgements
This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). This research is also supported by the Monash FIT Start-up Grant.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by S.-C. Zhu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zheng, C., Dao, DS., Song, G. et al. Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition. Int J Comput Vis 129, 3195–3215 (2021). https://doi.org/10.1007/s11263-021-01517-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01517-0