Skip to main content

LocaliseBot: Multi-view 3D Object Localisation with Differentiable Rendering for Robot Grasping

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 Workshops (ECCV 2022)

Abstract

Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera’s extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ainetter, S., Fraundorfer, F.: End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 13452–13458. IEEE (2021)

    Google Scholar 

  2. Asif, U., Tang, J., Harrer, S.: Graspnet: an efficient convolutional neural network for real-time grasp detection for low-powered devices. In: IJCAI, vol. 7, pp. 4875–4882 (2018)

    Google Scholar 

  3. Aspert, N., Santa-Cruz, D., Ebrahimi, T.: Mesh: measuring errors between surfaces using the hausdorff distance. In: Proceedings of IEEE International Conference on Multimedia and Expo, vol. 1, pp. 705–708. IEEE (2002)

    Google Scholar 

  4. Bai, F., Zhu, D., Cheng, H., Xu, P., Meng, M.Q.H.: Active semi-supervised grasp pose detection with geometric consistency. In: 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1402–1408. IEEE (2021)

    Google Scholar 

  5. Buchholz, D., Futterlieb, M., Winkelbach, S., Wahl, F.M.: Efficient bin-picking and grasp planning based on depth data. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3245–3250. IEEE (2013)

    Google Scholar 

  6. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: Fs-net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1581–1590 (2021)

    Google Scholar 

  7. Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)

    Google Scholar 

  8. Chu, F.J., Vela, P.A.: Deep grasp: detection and localization of grasps with deep neural networks. arXiv preprint arXiv:1802.00520 (2018)

  9. Du, G., Wang, K., Lian, S., Zhao, K.: Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artif. Intell. Rev. 54(3), 1677–1734 (2021)

    Article  Google Scholar 

  10. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  11. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42

    Chapter  Google Scholar 

  14. Huang, X., Mei, G., Zhang, J., Abbas, R.: A comprehensive survey on point cloud registration. arXiv preprint arXiv:2103.02690 (2021)

  15. Jiang, Y., Moseson, S., Saxena, A.: Efficient grasping from rgbd images: Learning using a new rectangle representation. In: 2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311. IEEE (2011)

    Google Scholar 

  16. Kato, H., et al.: Differentiable rendering: a survey (2020)

    Google Scholar 

  17. Kumra, S., Joshi, S., Sahin, F.: Antipodal robotic grasping using generative residual convolutional neural network. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9626–9633. IEEE (2020)

    Google Scholar 

  18. Kuo, W., Angelova, A., Lin, T.-Y., Dai, A.: Mask2CAD: 3D shape prediction by learning to segment and retrieve. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 260–277. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_16

    Chapter  Google Scholar 

  19. Kuo, W., Angelova, A., Malik, J., Lin, T.Y.: Shapemask: learning to segment novel objects by refining shape priors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  20. Le, T.T., Le, T.S., Chen, Y.R., Vidal, J., Lin, C.Y.: 6d pose estimation with combined deep learning and 3d vision techniques for a fast and accurate object grasping. Robot. Auton. Syst. 141, 103775 (2021)

    Article  Google Scholar 

  21. Li, K., et al.: Odam: object detection, association, and mapping using posed RGB video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5998–6008 (2021)

    Google Scholar 

  22. Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3706–3715 (2020)

    Google Scholar 

  23. Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42

    Chapter  Google Scholar 

  24. Li, Z., Ji, X.: Pose-guided auto-encoder and feature-based refinement for 6-dof object pose regression. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8397–8403. IEEE (2020)

    Google Scholar 

  25. Li, Z., Wang, G., Ji, X.: Cdpn: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7678–7687 (2019)

    Google Scholar 

  26. Litvak, Y., Biess, A., Bar-Hillel, A.: Learning pose estimation for high-precision robotic assembly using simulated depth images. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3521–3527. IEEE (2019)

    Google Scholar 

  27. Liu, H., Cao, C.: Grasp pose detection based on point cloud shape simplification. In: IOP Conference Series: Materials Science and Engineering, vol. 717, p. 012007. IOP Publishing (2020)

    Google Scholar 

  28. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  29. Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11

    Chapter  Google Scholar 

  30. Luo, Z., Tang, B., Jiang, S., Pang, M., Xiang, K.: Grasp detection based on faster region CNN. In: 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), pp. 323–328. IEEE (2020)

    Google Scholar 

  31. Maninis, K.K., Popov, S., Niesser, M., Ferrari, V.: Vid2CAD: CAD model alignment using multi-view constraints from videos. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  32. Pitteri, G., Bugeau, A., Ilic, S., Lepetit, V.: 3D object detection and pose estimation of unseen objects in color images with local surface embeddings. In: Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020

    Google Scholar 

  33. Ravi, N., et al.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)

  34. Song, L., Wu, W., Guo, J., Li, X.: Survey on camera calibration technique. In: 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 2, pp. 389–392 (2013). https://doi.org/10.1109/IHMSC.2013.240

  35. Suchi, M., Patten, T., Fischinger, D., Vincze, M.: Easylabel: a semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6678–6684. IEEE (2019)

    Google Scholar 

  36. Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1868–1876 (2015)

    Google Scholar 

  37. Tan, T., Alqasemi, R., Dubey, R., Sarkar, S.: Formulation and validation of an intuitive quality measure for antipodal grasp pose evaluation. IEEE Robot. Autom. Lett. 6(4), 6907–6914 (2021)

    Article  Google Scholar 

  38. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301 (2018)

    Google Scholar 

  39. Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790 (2018)

  40. Vohra, M., Prakash, R., Behera, L.: Real-time grasp pose estimation for novel objects in densely cluttered environment. In: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1–6. IEEE (2019)

    Google Scholar 

  41. Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3343–3352 (2019)

    Google Scholar 

  42. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019)

    Google Scholar 

  43. Wei, W., et al.: Dvgg: deep variational grasp generation for dextrous manipulation. IEEE Robot. Autom. Lett. (2022)

    Google Scholar 

  44. Wu, Y., Fu, Y., Wang, S.: Deep instance segmentation and 6d object pose estimation in cluttered scenes for robotic autonomous grasping. Industrial Robot: the international journal of robotics research and application (2020)

    Google Scholar 

  45. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. CMMI 1826258.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sujal Vijayaraghavan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vijayaraghavan, S., Alqasemi, R., Dubey, R., Sarkar, S. (2023). LocaliseBot: Multi-view 3D Object Localisation with Differentiable Rendering for Robot Grasping. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25075-0_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25074-3

  • Online ISBN: 978-3-031-25075-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics