Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement

Zhang, Zhongqun; Chen, Wei; Zheng, Linfang; Leonardis, Aleš; Chang, Hyung Jin

doi:10.1007/978-3-031-25085-9_7

Zhongqun Zhang¹⁰,
Wei Chen¹¹,
Linfang Zheng^10,12,
Aleš Leonardis¹⁰ &
…
Hyung Jin Chang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13808))

Included in the following conference series:

European Conference on Computer Vision

1345 Accesses
2 Citations

Abstract

Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Billings, G., Johnson-Roberson, M.: SilhoNet: an RGB method for 6D object pose estimation. IEEE Robot. Autom. Lett. 4(4), 3727–3734 (2019). https://doi.org/10.1109/LRA.2019.2928776
Article Google Scholar
Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.: Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3364–3372 (2016). https://doi.org/10.1109/CVPR.2016.366
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35. https://www.microsoft.com/en-us/research/publication/learning-6d-object-pose-estimation-using-3d-object-coordinates/
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, D., Li, J., Wang, Z., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11970–11979 (2020). https://doi.org/10.1109/CVPR42600.2020.01199
Chen, W., Jia, X., Chang, H.J., Duan, J., Leonardis, A.: G2L-Net: global to local network for real-time 6d pose estimation with embedding vector features. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4232–4241 (2020). https://doi.org/10.1109/CVPR42600.2020.00429
Chen, W., Duan, J., Basevi, H., Chang, H.J., Leonardis, A.: PonitPoseNet: point pose network for robust 6d object pose estimation. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), March 2020
Google Scholar
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism (2021)
Google Scholar
Cheng, Y., Lu, F.: Gaze estimation using transformer. arXiv preprint arXiv:2105.14424 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReiD: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6dof pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11629–11638 (2020). https://doi.org/10.1109/CVPR42600.2020.01165
Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Chapter Google Scholar
Hodaň, T., Baráth, D., Matas, J.: Epos: Estimating 6D pose of objects with symmetries. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709 (2020). https://doi.org/10.1109/CVPR42600.2020.01172
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2927–2936 (2020). https://doi.org/10.1109/CVPR42600.2020.00300
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2930–2939 (2020)
Google Scholar
Hudson, D.A., Zitnick, C.L.: Generative adversarial transformers (2021)
Google Scholar
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two transformers can make one strong GAN. arXiv preprint arXiv:2102.07074 (2021)
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017). https://doi.org/10.1109/ICCV.2017.169
Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer (2021)
Google Scholar
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: Deepim: deep iterative matching for 6d pose estimation. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-dof object pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7677–7686 (2019). https://doi.org/10.1109/ICCV.2019.00777
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Google Scholar
Morrison, D., et al.: Cartman: the low-cost cartesian manipulator that won the amazon robotics challenge. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7757–7764. IEEE (2018)
Google Scholar
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8
Chapter Google Scholar
Park, K., Patten, T., Vincze, M.: Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7667–7676 (2019). https://doi.org/10.1109/ICCV.2019.00776
Peng, S., Zhou, X., Liu, Y., Lin, H., Huang, Q., Bao, H.: Pvnet: pixel-wise voting network for 6dof object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020). https://doi.org/10.1109/TPAMI.2020.3047388
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3848–3856 (2017). https://doi.org/10.1109/ICCV.2017.413
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Shao, J., Jiang, Y., Wang, G., Li, Z., Ji, X.: PFRL: pose-free reinforcement learning for 6d pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11451–11460 (2020). https://doi.org/10.1109/CVPR42600.2020.01147
Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020). https://doi.org/10.1109/CVPR42600.2020.00051
Trabelsi, A., Chaabane, M., Blanchard, N., Beveridge, R.: A pose proposal and refinement network for better 6d object pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2382–2391, January 2021
Google Scholar
Wada, K., Sucar, E., James, S., Lenton, D., Davison, A.J.: MoreFusion: multi-object reasoning for 6d pose estimation from volumetric fusion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14528–14537 (2020). https://doi.org/10.1109/CVPR42600.2020.01455
Wang, C., et al.: 6-pack: category-level 6d pose tracker with anchor-based keypoints. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10059–10066 (2020). https://doi.org/10.1109/ICRA40945.2020.9196679
Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3338–3347 (2019). https://doi.org/10.1109/CVPR.2019.00346
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-NET: geometry-guided direct regression network for monocular 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16611–16621 (2021)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes (2017)
Google Scholar
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: towards explainable human pose estimation by transformer (2020)
Google Scholar
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6d pose object detector and refiner. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019). https://doi.org/10.1109/ICCV.2019.00203
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by Institute of Information and communications Technology Planning and evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-00537, Visual common sense through self-supervised learning for restoration of invisible parts in images). ZQZ was supported by China Scholarship Council (CSC) Grant No. 202208060266. AL was supported in part by the EPSRC (grant number EP/S032487/1).

Author information

Authors and Affiliations

University of Birmingham, Birmingham, UK
Zhongqun Zhang, Linfang Zheng, Aleš Leonardis & Hyung Jin Chang
Defense Innovation Institute, Boston, China
Wei Chen
SUSTech, Shenzhen, China
Linfang Zheng

Authors

Zhongqun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Linfang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Leonardis
View author publications
You can also search for this author in PubMed Google Scholar
Hyung Jin Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongqun Zhang .

Editor information

Editors and Affiliations

IBM Research AI and MIT-IBM Watson AI Lab, Haifa, Israel
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Chen, W., Zheng, L., Leonardis, A., Chang, H.J. (2023). Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-25085-9_7
Published: 12 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25084-2
Online ISBN: 978-3-031-25085-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement