Skip to main content

YOLOPose: Transformer-Based Multi-object 6D Pose Estimation Using Keypoint Regression

  • Conference paper
  • First Online:
Intelligent Autonomous Systems 17 (IAS 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 577))

Included in the following conference series:

Abstract

6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods.

Arash Amini, Arul Selvam Periyasamy are equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://bop.felk.cvut.cz/datasets/.

References

  1. Amini, A., Periyasamy, A.S., Behnke, S.: T6D-Direct: Transformers for multi-object 6D object pose estimation. In: DAGM German Conference on Pattern Recognition (GCPR) (2021)

    Google Scholar 

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision (ECCV), pp. 213–229 (2020)

    Google Scholar 

  3. Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating pnp optimization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8100–8109 (2020)

    Google Scholar 

  4. Cohen, N., Shashua, A.: Inductive bias of deep convolutional networks through pooling geometry. In: International Conference on Learning Representations, ICLR 2017, Toulon, France, OpenReview.net (2017)

    Google Scholar 

  5. Gao, X., Hou, X., Tang, J., Cheng, H.: Complete solution classification for the perspective-three-point problem. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 25, 930–943 (2003)

    Article  Google Scholar 

  6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2 edn. Cambridge University Press (2004). https://doi.org/10.1017/CBO9780511811685

  7. Hodaň, T., Sundermeyer, M., Drost, B., Labbé, Y., Brachmann, E., Michel, F., Rother, C., Matas, J.: BOP challenge 2020 on 6D object localization. In: European Conference on Computer Vision (ECCV), pp. 577–594 (2020)

    Google Scholar 

  8. Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2930–2939 (2020)

    Google Scholar 

  9. Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3385–3394 (2019)

    Google Scholar 

  10. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Quart. 2(1–2), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  11. Labbe, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: Consistent multi-view multi-object 6D pose estimation. In: European Conference on Computer Vision (ECCV) (2020)

    Google Scholar 

  12. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. Handb. Brain Theor. Neural Netw. 3361(10), 1995 (1995)

    Google Scholar 

  13. Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: An accurate o(n) solution to the PnP problem. Int. J. Comput. Vis. (IJCV) 81(2), 155 (2009)

    Google Scholar 

  14. Li, S., Yan, Z., Li, H., Cheng, K.T.: Exploring intermediate representation for monocular vehicle pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1873–1883 (2021)

    Google Scholar 

  15. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: Deep iterative matching for 6D pose estimation. In: European Conference on Computer Vision (ECCV), pp. 683–698 (2018)

    Google Scholar 

  16. Li, Z., Wang, G., Ji, X.: CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7678–7687 (2019)

    Google Scholar 

  17. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (ECCV), pp. 740–755 (2014)

    Google Scholar 

  18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  19. Manhardt, F., Kehl, W., Navab, N., Tombari, F.: Deep model-based 6D pose refinement in RGB. In: European Conference on Computer Vision (ECCV), pp. 800–815 (2018)

    Google Scholar 

  20. Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  21. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: Pixel-wise voting network for 6DOF pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570 (2019)

    Google Scholar 

  22. Periyasamy, A.S., Schwarz, M., Behnke, S.: Robust 6D object pose estimation in cluttered scenes using semantic segmentation and pose regression networks. In: International Conference on Intelligent Robots and Systems (IROS) (2018), 10.1109/IROS.2018.8594406

    Google Scholar 

  23. Periyasamy, A.S., Schwarz, M., Behnke, S.: Refining 6D object pose predictions using abstract render-and-compare. In: IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 739–746 (2019)

    Google Scholar 

  24. Rad, M., Lepetit, V.: BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: IEEE International Conference on Computer Vision (ICCV), pp. 3828–3836 (2017)

    Google Scholar 

  25. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 658–666 (2019)

    Google Scholar 

  26. Shao, J., Jiang, Y., Wang, G., Li, Z., Ji, X.: PFRL: Pose-free reinforcement learning for 6D pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  27. Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2325–2333 (2016)

    Google Scholar 

  28. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  29. Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  30. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv:1711.00199 (2017)

  31. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5745–5753 (2019)

    Google Scholar 

Download references

Acknowledgment

This work has been funded by the German Ministry of Education and Research (BMBF), grant no. 01IS21080, project “Learn2Grasp: Learning Human-like Interactive Grasping based on Visual and Haptic Feedback”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arul Selvam Periyasamy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Amini, A., Selvam Periyasamy, A., Behnke, S. (2023). YOLOPose: Transformer-Based Multi-object 6D Pose Estimation Using Keypoint Regression. In: Petrovic, I., Menegatti, E., Marković, I. (eds) Intelligent Autonomous Systems 17. IAS 2022. Lecture Notes in Networks and Systems, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-031-22216-0_27

Download citation

Publish with us

Policies and ethics