Abstract
Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Billings, G., Johnson-Roberson, M.: SilhoNet: an RGB method for 6D object pose estimation. IEEE Robot. Autom. Lett. 4(4), 3727–3734 (2019). https://doi.org/10.1109/LRA.2019.2928776
Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.: Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3364–3372 (2016). https://doi.org/10.1109/CVPR.2016.366
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35. https://www.microsoft.com/en-us/research/publication/learning-6d-object-pose-estimation-using-3d-object-coordinates/
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, D., Li, J., Wang, Z., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11970–11979 (2020). https://doi.org/10.1109/CVPR42600.2020.01199
Chen, W., Jia, X., Chang, H.J., Duan, J., Leonardis, A.: G2L-Net: global to local network for real-time 6d pose estimation with embedding vector features. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4232–4241 (2020). https://doi.org/10.1109/CVPR42600.2020.00429
Chen, W., Duan, J., Basevi, H., Chang, H.J., Leonardis, A.: PonitPoseNet: point pose network for robust 6d object pose estimation. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), March 2020
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism (2021)
Cheng, Y., Lu, F.: Gaze estimation using transformer. arXiv preprint arXiv:2105.14424 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReiD: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6dof pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11629–11638 (2020). https://doi.org/10.1109/CVPR42600.2020.01165
Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Hodaň, T., Baráth, D., Matas, J.: Epos: Estimating 6D pose of objects with symmetries. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709 (2020). https://doi.org/10.1109/CVPR42600.2020.01172
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2927–2936 (2020). https://doi.org/10.1109/CVPR42600.2020.00300
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2930–2939 (2020)
Hudson, D.A., Zitnick, C.L.: Generative adversarial transformers (2021)
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two transformers can make one strong GAN. arXiv preprint arXiv:2102.07074 (2021)
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1530–1538 (2017). https://doi.org/10.1109/ICCV.2017.169
Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer (2021)
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: Deepim: deep iterative matching for 6d pose estimation. In: European Conference on Computer Vision (ECCV) (2018)
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-dof object pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7677–7686 (2019). https://doi.org/10.1109/ICCV.2019.00777
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Morrison, D., et al.: Cartman: the low-cost cartesian manipulator that won the amazon robotics challenge. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7757–7764. IEEE (2018)
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8
Park, K., Patten, T., Vincze, M.: Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7667–7676 (2019). https://doi.org/10.1109/ICCV.2019.00776
Peng, S., Zhou, X., Liu, Y., Lin, H., Huang, Q., Bao, H.: Pvnet: pixel-wise voting network for 6dof object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020). https://doi.org/10.1109/TPAMI.2020.3047388
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3848–3856 (2017). https://doi.org/10.1109/ICCV.2017.413
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Shao, J., Jiang, Y., Wang, G., Li, Z., Ji, X.: PFRL: pose-free reinforcement learning for 6d pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11451–11460 (2020). https://doi.org/10.1109/CVPR42600.2020.01147
Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020). https://doi.org/10.1109/CVPR42600.2020.00051
Trabelsi, A., Chaabane, M., Blanchard, N., Beveridge, R.: A pose proposal and refinement network for better 6d object pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2382–2391, January 2021
Wada, K., Sucar, E., James, S., Lenton, D., Davison, A.J.: MoreFusion: multi-object reasoning for 6d pose estimation from volumetric fusion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14528–14537 (2020). https://doi.org/10.1109/CVPR42600.2020.01455
Wang, C., et al.: 6-pack: category-level 6d pose tracker with anchor-based keypoints. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10059–10066 (2020). https://doi.org/10.1109/ICRA40945.2020.9196679
Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3338–3347 (2019). https://doi.org/10.1109/CVPR.2019.00346
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-NET: geometry-guided direct regression network for monocular 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16611–16621 (2021)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes (2017)
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: towards explainable human pose estimation by transformer (2020)
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6d pose object detector and refiner. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019). https://doi.org/10.1109/ICCV.2019.00203
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers (2020)
Acknowledgements
This work was supported by Institute of Information and communications Technology Planning and evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-00537, Visual common sense through self-supervised learning for restoration of invisible parts in images). ZQZ was supported by China Scholarship Council (CSC) Grant No. 202208060266. AL was supported in part by the EPSRC (grant number EP/S032487/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Chen, W., Zheng, L., Leonardis, A., Chang, H.J. (2023). Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-25085-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25084-2
Online ISBN: 978-3-031-25085-9
eBook Packages: Computer ScienceComputer Science (R0)