Skip to main content
Log in

Unreal mask: one-shot multi-object class-based pose estimation for robotic manipulation using keypoints with a synthetic dataset

Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Object pose estimation is a prerequisite for many robotic applications. Preparing dataset for network training is a challenging part of the pose estimation approaches, and in most of them, the network can detect just the trained objects. Synthetic data are used to train deep neural networks in robotic manipulation as a promising method for obtaining a huge amount of prelabeled training data, which are generated safely. We are to investigate the reality gap in the pose estimation of intra-category objects from a single RGB-D image using keypoints. The proposed approach in this paper provides a fast and simple procedure for training a deep neural network to identify the object and its keypoints based on synthetic dataset and autolabeling program. To our knowledge, this is the first deep network trained only on synthetic data that can find keypoints of intra-category objects for pose estimation purposes. The speed of training and the simplicity of this method make it very easy to add a new class of objects to the system which is the main advantage of this approach. Using this approach, we demonstrate a near-real-time system estimating object poses with sufficient accuracy for real-world semantic grasping and manipulating of intra-category objects in clutter by a real robot.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Notes

  1. It is called also as Kovsh or Ladle.

References

  1. Milan A, et al. (2018) Semantic segmentation from limited training data. In: Proceedings-IEEE international conference on robotics and automation, pp 1908–1915

  2. Schwarz M, et al. (2018) Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing. In: Proceedings-IEEE international conference on robotics and automation, pp 3347–3354

  3. Wong JM et al (2017) SegICP: integrated deep semantic segmentation and pose estimation. IEEE Int Conf Intell Robot Syst 2017:5784–5789

    Google Scholar 

  4. Marion P, Florence PR, Manuelli L, Tedrake R (2018) Label fusion: a pipeline for generating ground truth labels for real RGBD data of cluttered scenes. In: Proceedings-IEEE international conference on robotics and automation, pp 3235–3242

  5. Pavlakos G, Zhou X, Chan A, Derpanis KG, Daniilidis K (2017) “6-DoF object pose from semantic keypoints. In: Proceedings-IEEE international conference on robotics and automation, pp 2011–2018

  6. Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. https://github.com/facebookresearch/detectron2

  7. Manuelli L, Gao W, Florence P, Tedrake R (2019) kPAM: keypoint affordances for category-level robotic manipulation. arXiv, pp 1–10

  8. Hu Y, Hugonot J, Fua P, Salzmann M (2019) Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019, pp 3380–3389

  9. Zakharov S, Shugurov I, Ilic S (2019) DPOD: 6D pose object detector and refiner In: IEEE international conference on computer vision (ICCV), pp 1941–1950

  10. Rad M, Lepetit V (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE international conference on computer vision, vol 2017, pp 3848–3856

  11. Tekin B, Sinha SN, Fua P (2018) Real-time seamless single shot 6D object pose prediction. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  12. Kehl W, Manhardt F, Tombari F, Ilic S, Navab N (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE international conference on computer vision, vol 2017, pp 1530–1538

  13. Song C, Song J, Huang Q (2020) HybridPose: 6D object pose estimation under hybrid representations. http://arxiv.org/abs/2001.01869

  14. Peng S, Liu Y, Huang Q, Zhou X, Bao H (2019) PVNET: pixel-wise voting network for 6dof pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition

  15. Li Y, Wang G, Ji X, Xiang Y, Fox D (2019) DeepIM: deep iterative matching for 6D pose estimation. Int J Comput Vis 128:657

    Article  Google Scholar 

  16. Schmidt T, Newcombe R, Fox D (2017) Self-supervised visual descriptor learning for dense correspondence. IEEE Robot Autom Lett 2(2):420–427

    Article  Google Scholar 

  17. Chang AX et al (2015) ShapeNet: an information-rich 3D model repository, arXiv [cs.GR]. Available at: arXiv:1512.03012

  18. Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. In: 2nd Conference on robot learning, no. CoRL, pp 1–11

  19. Hinterstoisser S, Lepetit V, Wohlhart P, Konolige K (2019) On pre-trained image features and synthetic images for deep learning. In: Lecture notes in computer science (including lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11129, pp 682–697

  20. Nikolenko SI (2019) Synthetic data for deep learning, arXiv [cs.LG]. Available at: arXiv:1909.11512

  21. Gupta S, Arbeláez P, Girshick R, Malik J (2015) Aligning 3D models to RGB-D images of cluttered scenes. In: IEEE conference on computer vision and pattern recognition (CVPR)

  22. Xiang Y et al (2018) PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: Robotics: science and systems XIV. Robotics: science and systems foundation

  23. Lenz I, Lee H (2015) Deep learning for detecting robotic grasps. Int J Rob Res 34(4–5):705–724

    Article  Google Scholar 

  24. Morrison D, Leitner J, Corke P (2018) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. arXiv preprint http://arxiv.org/abs/1804.05172

  25. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Lecture notes in computer science (including lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11210, pp 536–553

  26. Gualtieri M, Ten Pas A, Saenko K, Platt R (2016) High precision grasp pose detection in dense clutter. In: International conference on intelligent robots and systems, vol 2016, pp 598–605

  27. Mahler J et al. (2016) A cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In: 2016 IEEE International conference on robotics and automation (ICRA), pp 1957–1964

  28. Gualtieri M, Ten Pas A, Platt R (2018) Pick and place without geometric object models. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 7433–7440

  29. Andrychowicz M, et al. (2018) learning dexterous in-hand manipulation. arXiv, pp 1–27

  30. To T, et al. (2018) NDDS: N NVIDIA deep learning dataset synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer

  31. He K, Gkioxari G, Dollár P, Girshick R (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386–397

    Article  Google Scholar 

  32. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp 1–14

  33. Wang X (2017) A-Fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp 770–778

  35. Xie S, Girshick R, Dollár P, Tu Z, He K, San Diego U (2017) Aggregated residual transformations for deep neural networks. In IEEE conference on computer vision and pattern recognition, pp 1492–1500

  36. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition, pp 2117–2125

  37. Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In 23rd international conference on machine learning, pp 233–240

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. H. Zabihifar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Learning rate

The final learning rate has been defined by user but the optimization algorithms change it from very small number to the final one during training process (Figs. 18 , 19 and 20 and Tables 5, 6 and 7).

PR curves

PR curve is a graph showing the ratio of precision to recall at different detection thresholds for each object class. Multiple precision–recall value pairs are obtained by changing the threshold. It is always a trade-off between precision and recall. The lower the threshold, the higher the recall and the lower the precision. Conversely, the higher the threshold, the lower the recall and the higher the precision. AP is the area under the PR curve interpolated to the stepped line. mAP is all the AP values averaged over different object classes.Precision and recall are defined as follows:

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$

where TP is true positive, TN is true negative, FP is false positive and FN is false negative. The area under curve of the PR plot is defined as:

$$ \left[ {{\text{AP}} = \int\limits_{0}^{1} {p\left( r \right){\text{d}}r} } \right] $$
Fig. 18
figure 18

Learning rate per iteration

Fig. 19
figure 19

Precision–recall curves

Fig. 20
figure 20

Other unseen objects

Table 5 Comparison of AP for object detection between synthetic and real data
Table 6 Comparison of AP for keypoint detection between synthetic and real data
Table 7 Cross-validation results

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zabihifar, S.H., Semochkin, A.N., Seliverstova, E.V. et al. Unreal mask: one-shot multi-object class-based pose estimation for robotic manipulation using keypoints with a synthetic dataset. Neural Comput & Applic 33, 12283–12300 (2021). https://doi.org/10.1007/s00521-020-05644-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05644-6

Keywords

Navigation