LocaliseBot: Multi-view 3D Object Localisation with Differentiable Rendering for Robot Grasping

Vijayaraghavan, Sujal; Alqasemi, Redwan; Dubey, Rajiv; Sarkar, Sudeep

doi:10.1007/978-3-031-25075-0_47

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13806))

Included in the following conference series:

European Conference on Computer Vision

1428 Accesses
1 Altmetric

Abstract

Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera’s extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ainetter, S., Fraundorfer, F.: End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 13452–13458. IEEE (2021)
Google Scholar
Asif, U., Tang, J., Harrer, S.: Graspnet: an efficient convolutional neural network for real-time grasp detection for low-powered devices. In: IJCAI, vol. 7, pp. 4875–4882 (2018)
Google Scholar
Aspert, N., Santa-Cruz, D., Ebrahimi, T.: Mesh: measuring errors between surfaces using the hausdorff distance. In: Proceedings of IEEE International Conference on Multimedia and Expo, vol. 1, pp. 705–708. IEEE (2002)
Google Scholar
Bai, F., Zhu, D., Cheng, H., Xu, P., Meng, M.Q.H.: Active semi-supervised grasp pose detection with geometric consistency. In: 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1402–1408. IEEE (2021)
Google Scholar
Buchholz, D., Futterlieb, M., Winkelbach, S., Wahl, F.M.: Efficient bin-picking and grasp planning based on depth data. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3245–3250. IEEE (2013)
Google Scholar
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: Fs-net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1581–1590 (2021)
Google Scholar
Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)
Google Scholar
Chu, F.J., Vela, P.A.: Deep grasp: detection and localization of grasps with deep neural networks. arXiv preprint arXiv:1802.00520 (2018)
Du, G., Wang, K., Lian, S., Zhao, K.: Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artif. Intell. Rev. 54(3), 1677–1734 (2021)
Article Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Chapter Google Scholar
Huang, X., Mei, G., Zhang, J., Abbas, R.: A comprehensive survey on point cloud registration. arXiv preprint arXiv:2103.02690 (2021)
Jiang, Y., Moseson, S., Saxena, A.: Efficient grasping from rgbd images: Learning using a new rectangle representation. In: 2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311. IEEE (2011)
Google Scholar
Kato, H., et al.: Differentiable rendering: a survey (2020)
Google Scholar
Kumra, S., Joshi, S., Sahin, F.: Antipodal robotic grasping using generative residual convolutional neural network. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9626–9633. IEEE (2020)
Google Scholar
Kuo, W., Angelova, A., Lin, T.-Y., Dai, A.: Mask2CAD: 3D shape prediction by learning to segment and retrieve. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 260–277. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_16
Chapter Google Scholar
Kuo, W., Angelova, A., Malik, J., Lin, T.Y.: Shapemask: learning to segment novel objects by refining shape priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Le, T.T., Le, T.S., Chen, Y.R., Vidal, J., Lin, C.Y.: 6d pose estimation with combined deep learning and 3d vision techniques for a fast and accurate object grasping. Robot. Auton. Syst. 141, 103775 (2021)
Article Google Scholar
Li, K., et al.: Odam: object detection, association, and mapping using posed RGB video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5998–6008 (2021)
Google Scholar
Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3706–3715 (2020)
Google Scholar
Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42
Chapter Google Scholar
Li, Z., Ji, X.: Pose-guided auto-encoder and feature-based refinement for 6-dof object pose regression. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8397–8403. IEEE (2020)
Google Scholar
Li, Z., Wang, G., Ji, X.: Cdpn: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7678–7687 (2019)
Google Scholar
Litvak, Y., Biess, A., Bar-Hillel, A.: Learning pose estimation for high-precision robotic assembly using simulated depth images. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3521–3527. IEEE (2019)
Google Scholar
Liu, H., Cao, C.: Grasp pose detection based on point cloud shape simplification. In: IOP Conference Series: Materials Science and Engineering, vol. 717, p. 012007. IOP Publishing (2020)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11
Chapter Google Scholar
Luo, Z., Tang, B., Jiang, S., Pang, M., Xiang, K.: Grasp detection based on faster region CNN. In: 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), pp. 323–328. IEEE (2020)
Google Scholar
Maninis, K.K., Popov, S., Niesser, M., Ferrari, V.: Vid2CAD: CAD model alignment using multi-view constraints from videos. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Pitteri, G., Bugeau, A., Ilic, S., Lepetit, V.: 3D object detection and pose estimation of unseen objects in color images with local surface embeddings. In: Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020
Google Scholar
Ravi, N., et al.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)
Song, L., Wu, W., Guo, J., Li, X.: Survey on camera calibration technique. In: 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 2, pp. 389–392 (2013). https://doi.org/10.1109/IHMSC.2013.240
Suchi, M., Patten, T., Fischinger, D., Vincze, M.: Easylabel: a semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6678–6684. IEEE (2019)
Google Scholar
Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1868–1876 (2015)
Google Scholar
Tan, T., Alqasemi, R., Dubey, R., Sarkar, S.: Formulation and validation of an intuitive quality measure for antipodal grasp pose evaluation. IEEE Robot. Autom. Lett. 6(4), 6907–6914 (2021)
Article Google Scholar
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301 (2018)
Google Scholar
Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790 (2018)
Vohra, M., Prakash, R., Behera, L.: Real-time grasp pose estimation for novel objects in densely cluttered environment. In: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1–6. IEEE (2019)
Google Scholar
Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3343–3352 (2019)
Google Scholar
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019)
Google Scholar
Wei, W., et al.: Dvgg: deep variational grasp generation for dextrous manipulation. IEEE Robot. Autom. Lett. (2022)
Google Scholar
Wu, Y., Fu, Y., Wang, S.: Deep instance segmentation and 6d object pose estimation in cluttered scenes for robotic autonomous grasping. Industrial Robot: the international journal of robotics research and application (2020)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. CMMI 1826258.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Tampa, USA
Sujal Vijayaraghavan & Sudeep Sarkar
Department of Mechanical Engineering, Tampa, USA
Redwan Alqasemi & Rajiv Dubey
University of South Florida, Tampa, USA
Sujal Vijayaraghavan, Redwan Alqasemi, Rajiv Dubey & Sudeep Sarkar

Authors

Sujal Vijayaraghavan
View author publications
You can also search for this author in PubMed Google Scholar
Redwan Alqasemi
View author publications
You can also search for this author in PubMed Google Scholar
Rajiv Dubey
View author publications
You can also search for this author in PubMed Google Scholar
Sudeep Sarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sujal Vijayaraghavan .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vijayaraghavan, S., Alqasemi, R., Dubey, R., Sarkar, S. (2023). LocaliseBot: Multi-view 3D Object Localisation with Differentiable Rendering for Robot Grasping. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_47

Download citation

DOI: https://doi.org/10.1007/978-3-031-25075-0_47
Published: 19 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25074-3
Online ISBN: 978-3-031-25075-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LocaliseBot: Multi-view 3D Object Localisation with Differentiable Rendering for Robot Grasping