Abstract
In this paper, we rethink the problem of scene reconstruction from an embodied agent’s perspective: While the classic view focuses on the reconstruction accuracy, our new perspective emphasizes the underlying functions and constraints of the reconstructed scenes that provide actionable information for simulating interactions with agents. Here, we address this challenging problem by reconstructing a functionally equivalent and interactive scene from RGB-D data streams, where the objects within are segmented by a dedicated 3D volumetric panoptic mapping module and subsequently replaced by part-based articulated CAD models to afford finer-grained robot interactions. The object functionality and contextual relations are further organized by a graph-based scene representation that can be readily incorporated into robots’ action specifications and task definition, facilitating their long-term task and motion planning in the scenes. In the experiments, we demonstrate that (i) our panoptic mapping module outperforms previous state-of-the-art methods in recognizing and segmenting scene entities, (ii) the geometric and physical reasoning procedure matches, aligns, and replaces object meshes with best-fitted CAD models, and (iii) the reconstructed functionally equivalent and interactive scenes are physically plausible and naturally afford actionable interactions; without any manual labeling, they are seamlessly imported to ROS-based robot simulators and VR environments for simulating complex robot interactions.
Similar content being viewed by others
Notes
Additional results are available online at https://sites.google.com/view/ijcv2022-reconstruction. Code can be found at https://github.com/hmz-15/Interactive-Scene-Reconstruction.
References
Agin, G. J. and Binford T. O. 1973 “Computer description of curved objects.” International Joint Conference on Artificial Intelligence (IJCAI).
Armeni, I., He, Z. Y., Gwak, J., Zamir, A. R., Fischer, M., Malik, J., & Savarese S. (2019). 3d scene graph: A structure for unified semantics, 3d space, and camera. In Conference on Computer Vision and Pattern Recognition (CVPR).
Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A. X., & Nießner, M. (2019a). Scan2cad: Learning cad model alignment in rgb-d scans. In Conference on Computer Vision and Pattern Recognition (CVPR).
Avetisyan, A., Dai, A., & Nießner, M. (2019b). End-to-end cad model retrieval and 9dof alignment in 3d scans. In International Conference on Computer Vision (ICCV).
Batra, D., Chang, A. X., Chernova, S., Davison, A. J., Deng, J., Koltun, V., Levine, S., Malik, J., Mordatch, I., & Mottaghi R. et al. (2020). Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975
Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., et al. (2016). Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics (T-RO), 32(6), 1309–1332.
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., & Zhang Y. (2017). Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV).
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., & Su H. et al. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012
Chang, H. J., & Demiris, Y. (2017). Highly articulated kinematic structure estimation combining motion and skeleton information. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(9), 2165–2179.
Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., & Zhu, S. C. (2019). Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In International Conference on Computer Vision (ICCV).
Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR).
Deitke, M., Han, W., Herrasti, A., Kembhavi, A., Kolve, E., Mottaghi, R., Salvador, J., Schwenk, D., VanderBilt, E., & Wallingford, M. et al. (2020). Robothor: An open simulation-to-real embodied ai platform. In Conference on Computer Vision and Pattern Recognition (CVPR).
Edmonds, M., Gao, F., Liu, H., Xie, X., Qi, S., Rothrock, B., et al. (2019). A tale of two explanations: Enhancing human trust by explaining robot behavior. Science Robotics, 4(37), eaay4663.
Edmonds, M., Gao, F., Xie, X., Liu, H., Qi, S., Zhu, Y., Rothrock, B., & Zhu, S. C. (2017). Feeling the force: Integrating force and pose for fluent discovery through imitation learning to open medicine bottles. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Furrer, F., Novkovic, T., Fehr, M., Gawel, A., Grinvald, M., Sattler, T., Siegwart, R., & Nieto J. (2018). Incremental object database: Building 3d models from multiple partial observations. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Garrett, C. R., Paxton, C., Lozano-Pérez, T., Kaelbling, L. P., & Fox D. (2020). Online replanning in belief space for partially observable task and motion problems. In IEEE International Conference on Robotics and Automation (ICRA).
Gibson, J. J. (1950). The perception of the visual world. Houghton Mifflin.
Gibson, J. J. (1966). The senses considered as perceptual systems. Houghton Mifflin.
Grinvald, M., Furrer, F., Novkovic, T., Chung, J. J., Cadena, C., Siegwart, R., & Nieto, J. (2019). Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters (RA-L), 4(3), 3037–3044.
Gupta, S., Arbeláez, P., Girshick, R., & Malik J. (2015). Aligning 3d models to rgb-d images of cluttered scenes. In Conference on Computer Vision and Pattern Recognition (CVPR).
Han, L., Zheng, T., Xu, L., & Fang, L. (2020). Occuseg: Occupancy-aware 3d instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR).
Han, M., Zhang, Z., Jiao, Z., Xie, X., Zhu, Y., Zhu, S. C., & Liu H. (2021). Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments. In IEEE International Conference on Robotics and Automation (ICRA). IEEE.
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge University Press.
He, K., Gkioxari, G., Dollár, P., & Girshick R. (2017). Mask r-cnn. In International Conference on Computer Vision (ICCV).
Hoang, D. C., Lilienthal, A. J., & Stoyanov, T. (2020). Panoptic 3d mapping and object pose estimation using adaptively weighted semantic information. IEEE Robotics and Automation Letters (RA-L), 5(2), 1962–1969.
Hua, B. S., Pham, Q. H., Nguyen, D. T., Tran, M. K., Yu, L. F., & Yeung S. K. (2016). Scenenn: A scene meshes dataset with annotations. In International Conference on 3D Vision (3DV).
Hua, B. S., Tran, M. K., & Yeung, S. K. (2018). Pointwise convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR).
Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y. N., & Zhu, S. C. (2018a). Cooperative holistic scene understanding: Unifying 3d object, layout, and camera pose estimation. In Advances in Neural Information Processing Systems (NeurIPS).
Huang, S., Qi, S., Zhu, Y., Xiao, Y., Xu, Y., & Zhu, S. C. (2018b). Holistic 3d scene parsing and reconstruction from a single rgb image. In European Conference on Computer Vision (ECCV).
Ikeuchi, K., & Hebert M. (1992). Task-oriented vision. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Jia, B., Chen, Y., Huang, S., Zhu, Y., & Zhu, S. C. (2020). Lemma: A multi-view dataset for learning multi-agent multi-task activities. In European Conference on Computer Vision (ECCV).
Jiang, C., Qi, S., Zhu, Y., Huang, S., Lin, J., Yu, L. F., et al. (2018). Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision (IJCV), 126(9), 920–941.
Jiao, Z., Niu, Y., Zhang, Z., Zhu, S. C., Zhu, Y., & Liu, H. (2022). Sequential Manipulation Planning on Scene Graph. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Jiao, Z., Zhang, Z., Jiang, X., Han, D., Zhu, S. C., Zhu, Y., & Liu, H. (2021a). Consolidating kinematic models to promote coordinated mobile manipulations. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Jiao, Z., Zhang, Z., Wang, W., Han, D., Zhu, S. C., Zhu, Y., & Liu H. (2021b). Efficient task planning for mobile manipulation: A virtual kinematic chain perspective. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Jonker, R., & Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4), 325–340.
Kaelbling, L. P. (2020). The foundation of efficient robot learning. Science, 369(6506), 915–916.
Kaelbling, L. P., & Lozano-Pérez, T. (2011). Hierarchical task and motion planning in the now. In IEEE International Conference on Robotics and Automation (ICRA).
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P. (2019). Panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR).
Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. Cambridge University Press.
Li, X., Liu, S., Kim, K., Wang, X., Yang, M. H., & Kautz, J. (2019). Putting humans in a scene: Learning affordance in 3d indoor environments. In Conference on Computer Vision and Pattern Recognition (CVPR).
Li, X., Wang, H., Yi, L., Guibas, L. J., Abbott, A. L., & Song, S. (2020). Category-level articulated object pose estimation. In International Conference on Computer Vision (ICCV).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV).
Liu, H., Zhang, Y., Si, W., Xie, X., Zhu, Y., & Zhu, S. C. (2018a). Interactive robot knowledge patching using augmented reality. In IEEE International Conference on Robotics and Automation (ICRA).
Liu, H., Zhang, C., Zhu, Y., Jiang, C., & Zhu S. C. (2019). Mirroring without overimitation: Learning functionally equivalent manipulation actions. In AAAI Conference on Artificial Intelligence (AAAI).
Liu, L., Xia, X., Sun, H., Shen, Q., Xu, J., Chen, B., et al. (2018). Object-aware guidance for autonomous scene reconstruction. ACM Transactions on Graphics (TOG), 37(4), 1–12.
Malandain, G., & Boissonnat, J. D. (2002). Computing the diameter of a point set. International Journal of Computational Geometry & Applications, 12(06), 489–509.
Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(5), 530–549.
Martín-Martín R., & Brock, O. (2019). Coupled recursive estimation for online interactive perception of articulated objects. International Journal of Robotics Research (IJRR), 1–37.
McCormac, J., Clark, R., Bloesch, M., Davison, A., & Leutenegger S. (2018). Fusion++: Volumetric object-level slam. In International Conference on 3D Vision (3DV).
McCormac, J., Handa, A., Davison, A., & Leutenegger, S. (2017). Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In IEEE International Conference on Robotics and Automation (ICRA).
Min, H., Luo, R., Zhu, J., Bi, S., et al. (2016). Affordance research in developmental robotics: A survey. IEEE Transactions on Cognitive and Developmental Systems, 8(4), 237–255.
Minton, S., Johnston, M. D., Philips, A. B., & Laird, P. (1992). Minimizing conflicts: A heuristic repair method for constraint satisfaction and scheduling problems. Artificial Intelligence, 58(1–3), 161–205.
Mo, K., Zhu, S., Chang, A. X., Yi, L., Tripathi, S., Guibas, L. J., & Su, H. (2019). Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Conference on Computer Vision and Pattern Recognition (CVPR).
Moré, J. J. (1978). The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical analysis (pp. 105–116). Springer.
Mur-Artal, R., & Tardós, J. D. (2017). Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics (T-RO), 33(5), 1255–1262.
Myers, A., Teo, C. L., Fermüller, C., & Aloimonos, Y. (2015). Affordance detection of tool parts from geometric features. In IEEE International Conference on Robotics and Automation (ICRA).
Narita, G., Seno, T., Ishikawa, T., & Kaji Y. (2019). Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Oleynikova, H., Taylor, Z., Fehr, M., Siegwart, R., & Nieto, J. (2017). Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Pham, Q. H., Hua, B. S., Nguyen, T., & Yeung, S. K. (2019a). Real-time progressive 3d semantic segmentation for indoor scenes. In Proceedings of Winter Conference on Applications of Computer Vision (WACV).
Pham, Q. H., Nguyen, T., Hua, B. S., Roig, G., & Yeung, S. K. (2019b). Jsis3d: Joint semantic-instance segmentation of 3d point clouds with multi-task pointwise networks and multi-value conditional random fields. In Conference on Computer Vision and Pattern Recognition (CVPR).
Pham, Q. H., Tran, M. K., Li, W., Xiang, S., Zhou, H., Nie, W., Liu, A., Su, Y., Tran, M. T., & Bui, N. M. et al. (2018). Shrec’18: Rgb-d object-to-cad retrieval. In 3DOR: Proceedings of the 11th Eurographics Workshop on 3D Object Retrieval.
Pronobis, A., & Jensfelt, P. (2012). Large-scale semantic mapping and reasoning with heterogeneous modalities. In IEEE International Conference on Robotics and Automation (ICRA).
Qi, S., Jia, B., Huang, S., Wei, P., & Zhu, S. C. (2020). A generalized earley parser for human activity parsing and prediction. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43, 2538–2554.
Qi, S., Zhu, Y., Huang, S., Jiang, C., & Zhu, S. C. (2018). Human-centric indoor scene synthesis using stochastic grammar. In Conference on Computer Vision and Pattern Recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(6), 1137–1149.
Rosinol, A., Gupta, A., Abate, M., Shi, J., & Carlone, L. (2020). 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. In Robotics: Science and Systems (RSS).
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., & Malik J. et al. (2019). Habitat: A platform for embodied ai research. In Conference on Computer Vision and Pattern Recognition (CVPR).
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). Springer.
Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR).
Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In Conference on Computer Vision and Pattern Recognition (CVPR).
Srivastava, S., Fang, E., Riano, L., Chitnis, R., Russell, S., & Abbeel, P. (2014). Combined task and motion planning through an extensible planner-independent interface layer. In IEEE International Conference on Robotics and Automation (ICRA).
Sturm, J., Stachniss, C., & Burgard, W. (2011). A probabilistic framework for learning kinematic models of articulated objects. Journal of Artificial Intelligence Research, 41, 477–526.
Sui, Z., Chang, H., Xu, N., & Jenkins, O. C. (2020). Geofusion: Geometric consistency informed scene estimation in dense clutter. IEEE Robotics and Automation Letters (RA-L), 5(4), 5913–5920.
Taguchi, Y., Jian, Y. D., Ramalingam, S., & Feng, C. (2013). Point-plane slam for hand-held 3d sensors. In IEEE International Conference on Robotics and Automation (ICRA).
Wada, K., Sucar, E., James, S., Lenton, D., & Davison, A. J. (2020). Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion. In Conference on Computer Vision and Pattern Recognition (CVPR).
Wald, J., Dhamo, H., Navab, N., & Tombari, F. (2020). Learning 3d semantic scene graphs from 3d indoor reconstructions. In Conference on Computer Vision and Pattern Recognition (CVPR).
Wu, Y., Kirillov, A., Massa, F., Lo, W. Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2
Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., & Wang H, et al. (2020). Sapien: A simulated part-based interactive environment. In Conference on Computer Vision and Pattern Recognition (CVPR).
Xia, F., Shen, W. B., Li, C., Kasimbeg, P., Tchapmi, M. E., Toshev, A., et al. (2020). Interactive Gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters (RA-L), 5(2), 713–720.
Xie, X., Liu, H., Zhang, Z., Qiu, Y., Gao, F., Qi, S., Zhu, Y., & Zhu, S. C. (2019). Vrgym: A virtual testbed for physical and interactive ai. In Proceedings of the ACM Turing Celebration Conference-China, pp. 1–6.
Xu, K., Huang, H., Shi, Y., Li, H., Long, P., Caichen, J., et al. (2015). Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Transactions on Graphics (TOG), 34(6), 1–14.
Yang, S., & Scherer, S. (2019a). Cubeslam: Monocular 3-d object slam. IEEE Transactions on Robotics (T-RO), 35(4), 925–938.
Yang, S., & Scherer, S. (2019b). Monocular object and plane slam in structured environments. IEEE Robotics and Automation Letters (RA-L), 4(4), 3145–3152.
Yi, L., Zhao, W., Wang, H., Sung, M., & Guibas, L. J. (2019). Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Conference on Computer Vision and Pattern Recognition (CVPR).
Yuan, T., Liu, H., Fan, L., Zheng, Z., Gao, T., Zhu, Y., & Zhu, S. C. (2020). Joint inference of states, robot knowledge, and human (false-)beliefs. In IEEE International Conference on Robotics and Automation (ICRA).
Yu, L. F., Yeung, S. K., Tang, C. K., Terzopoulos, D., Chan, T. F., & Osher, S. J. (2011). Make it home: Automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG), 30(4), 1–12.
Zhang, Z., Jiao, Z., Wang, W., Zhu, Y., Zhu, S. C., & Liu, H. (2022). Understanding Physical Effects for Effective Tool-use. IEEE Robotics and Automation Letters (RA-L), 7(4), 9469–9476.
Zhang, J., Zhao, X., Chen, Z., & Lu, Z. (2019). A review of deep learning-based semantic segmentation for point cloud. IEEE Access, 7, 179118–179133.
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6), 1245–1262.
Zhang, Z., Zhu, Y., & Zhu, S. C. (2020). Graph-based hierarchical knowledge representation for robot task transfer from virtual to physical world. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Zhao, Y., & Zhu, S. C. (2011). Image parsing with stochastic scene grammar. In Advances in Neural Information Processing Systems (NeurIPS).
Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In Conference on Computer Vision and Pattern Recognition (CVPR).
Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., & Zhu, S. C. (2015). Scene understanding by reasoning stability and safety. International Journal of Computer Vision (IJCV), 112(2), 221–238.
Zhu, S. C., & Mumford, D. (2007). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., et al. (2020). Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3), 310–345.
Zhu, Y., Jiang, C., Zhao, Y., Terzopoulos, D., & Zhu, S. C. (2016). Inferring forces and learning human utilities from videos. In Conference on Computer Vision and Pattern Recognition (CVPR).
Zhu, Y., Zhao, Y., & Zhu, S. C. (2015). Understanding tools: Task-oriented object modeling, learning and recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
Zou, C., Guo, R., Li, Z., & Hoiem, D. (2019). Complete 3d scene parsing from an rgbd image. International Journal of Computer Vision (IJCV), 127(2), 143–162.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Akihiro Sugimoto.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Han, M., Zhang, Z., Jiao, Z. et al. Scene Reconstruction with Functional Objects for Robot Autonomy. Int J Comput Vis 130, 2940–2961 (2022). https://doi.org/10.1007/s11263-022-01670-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01670-0