Object Detection and Pose Estimation from RGB and Depth Data for Real-Time, Adaptive Robotic Grasping

Paul, Shuvo Kumar; Chowdhury, Muhammed Tawfiq; Nicolescu, Mircea; Nicolescu, Monica; Feil-Seifer, David

doi:10.1007/978-3-030-71051-4_10

Shuvo Kumar Paul⁷,
Muhammed Tawfiq Chowdhury⁷,
Mircea Nicolescu⁷,
Monica Nicolescu⁷ &
…
David Feil-Seifer⁷

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

990 Accesses
11 Citations

Abstract

In recent times, object detection and pose estimation have gained significant attention in the context of robotic vision applications. Both the identification of objects of interest as well as the estimation of their pose remain important capabilities in order for robots to provide effective assistance for numerous robotic applications ranging from household tasks to industrial manipulation. This problem is particularly challenging because of the heterogeneity of objects having different and potentially complex shapes, and the difficulties arising due to background clutter and partial occlusions between objects. As the main contribution of this work, we propose a system that performs real-time object detection and pose estimation, for the purpose of dynamic robot grasping. The robot has been pre-trained to perform a small set of canonical grasps from a few fixed poses for each object. When presented with an unknown object in an arbitrary pose, the proposed approach allows the robot to detect the object identity and its actual pose, and then adapt a canonical grasp in order to be used with the new pose. For training, the system defines a canonical grasp by capturing the relative pose of an object with respect to the gripper attached to the robot’s wrist. During testing, once a new pose is detected, a canonical grasp for the object is identified and then dynamically adapted by adjusting the robot arm’s joint angles, so that the gripper can grasp the object in its new pose. We conducted experiments using a humanoid PR2 robot and showed that the proposed framework can detect well-textured objects, and provide accurate pose estimation in the presence of tolerable amounts of out-of-plane rotation. The performance is also illustrated by the robot successfully grasping objects from a wide range of arbitrary poses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

K. He, et al., Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
Google Scholar
S. Liu, W. Deng, Very deep convolutional neural network based image classification using small training sample size, in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR) (2015), pp. 730–734
Google Scholar
C. Szegedy, et al., Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9
Google Scholar
D.C. Ciresan, et al., Flexible, high performance convolutional neural networks for image classification, in Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Google Scholar
P. Sermanet, et al., Overfeat: integrated recognition, localization and detection using convolutional networks, in 2nd International Conference on Learning Representations (ICLR 2014), Conference Date: 14-04-2014 Through 16-04-2014 (2014)
Google Scholar
K. He, et al., Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Article Google Scholar
R. Girshick, Fast R-CNN, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1440–1448
Google Scholar
S. Ren, et al., Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015), pp. 91–99
Google Scholar
W. Liu, et al., SSD: single shot multibox detector, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 21–37
Google Scholar
J. Redmon, et al., You only look once: unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 779–788
Google Scholar
J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7263–7271
Google Scholar
T.-Y. Lin, et al., Focal loss for dense object detection, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2980–2988
Google Scholar
V. Badrinarayanan, et al., Segnet: a deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
K. He, et al., Mask R-CNN, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2961–2969
Google Scholar
O. Ronneberger, et al., U-net: convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, Berlin, 2015), pp. 234–241
Google Scholar
D.J. Butler, et al., A naturalistic open source movie for optical flow evaluation, in Computer Vision – ECCV 2012, ed. by A. Fitzgibbon et al. (Springer, Berlin, 2012), pp. 611–625
Chapter Google Scholar
N. Mayer, et al., A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4040–4048
Google Scholar
W. Qiu, A. Yuille, Unrealcv: connecting computer vision to unreal engine, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 909–916
Google Scholar
Y. Zhang, et al., Unrealstereo: a synthetic dataset for analyzing stereo vision (2016, preprint). arXiv:1612.04647
Google Scholar
J. McCormac, et al., Scenenet RGB-D: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? in The IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Y. Xiang, et al., Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes, in Robotics: Science and Systems (RSS) (2018)
Google Scholar
J. Tremblay, et al., Deep object pose estimation for semantic robotic grasping of household objects, in Conference on Robot Learning (CoRL) (2018)
Google Scholar
E. Brachmann, et al., Learning 6d object pose estimation using 3d object coordinates, in European Conference on Computer Vision (Springer, Berlin, 2014), pp. 536–551
Google Scholar
C. Wang, et al., Densefusion: 6d object pose estimation by iterative dense fusion, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 3343–3352
Google Scholar
Y. Hu, et al., Segmentation-driven 6d object pose estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 3385–3394
Google Scholar
C.G. Harris, et al., A combined corner and edge detector, in Alvey Vision Conference, vol. 15 (Citeseer, 1988), pp. 10–5244
Google Scholar
C. Tomasi, T. Kanade, Detection and tracking of point features. School of Computer Science, Carnegie Mellon University, Pittsburgh (1991)
Google Scholar
J. Shi, et al., Good features to track, in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 1994), pp. 593–600
Google Scholar
D. Hall, et al., Saliency of interest points under scale changes, in British Machine Vision Conference (BMVC) (2002), pp. 1–10
Google Scholar
T. Lindeberg, Feature detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 79–116 (1998)
Article Google Scholar
K. Mikolajczyk, C. Schmid, Indexing based on scale invariant interest points, in Proceedings Eighth IEEE International Conference on Computer Vision (ICCV 2001), vol. 1 (IEEE, Piscataway, 2001), pp. 525–531
Google Scholar
D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
Article Google Scholar
H. Bay, et al., SURF: speeded up robust features, in Computer Vision – ECCV 2006, ed. by A. Leonardis, et al. (Springer Berlin, 2006), pp. 404–417
Chapter Google Scholar
Y. Ke, R. Sukthankar, PCA-sift: a more distinctive representation for local image descriptors, in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (CVPR 2004), vol. 2 (IEEE, Piscataway, 2004), p. II
Google Scholar
S.K. Lodha, Y. Xiao, GSIFT: geometric scale invariant feature transform for terrain data, in Vision Geometry XIV, vol. 6066 (International Society for Optics and Photonics, Bellingham, 2006), p. 60660L
Google Scholar
A.E. Abdel-Hakim, A.A. Farag, CSIFT: a sift descriptor with color invariant characteristics, in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2 (IEEE, Piscataway, 2006), pp. 1978–1983
Google Scholar
J.-M. Morel, G. Yu, ASIFT: a new framework for fully affine invariant image comparison. SIAM J. Imag. Sci. 2(2), 438–469 (2009)
Article MathSciNet MATH Google Scholar
P.F. Alcantarilla, et al., Gauge-surf descriptors. Image Vis. Comput. 31(1), 103–116 (2013)
Article Google Scholar
T.-K. Kang, et al., MDGHM-surf: a robust local image descriptor based on modified discrete Gaussian–Hermite moment. Pattern Recognit. 48(3), 670–684 (2015)
Article Google Scholar
J. Fu, et al., C-surf: colored speeded up robust features, in International Conference on Trustworthy Computing and Services (Springer, Berlin, 2012), pp. 203–210
Google Scholar
E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in European Conference on Computer Vision (Springer, Berlin, 2006), pp. 430–443
Google Scholar
E. Mair, et al., Adaptive and generic corner detection based on the accelerated segment test, in European Conference on Computer Vision (Springer, Berlin, 2010), pp. 183–196
Google Scholar
M. Calonder, et al., Brief: computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1281–1298 (2011)
Article Google Scholar
E. Rublee, et al., ORB: an efficient alternative to sift or surf, in 2011 International Conference on Computer Vision (2011), pp. 2564–2571
Google Scholar
S. Leutenegger, et al., Brisk: binary robust invariant scalable keypoints, in 2011 International Conference on Computer Vision (IEEE, Piscataway, 2011), pp. 2548–2555
Book Google Scholar
R. Ortiz, Freak: fast retina keypoint, in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR’12, Washington (IEEE Computer Society, Washington, 2012), pp. 510–517
Google Scholar
P.F. Alcantarilla, et al., Kaze features, in Computer Vision – ECCV 2012, ed. by A. Fitzgibbon, et al. (Springer, Berlin, 2012), pp. 214–227
Chapter Google Scholar
P.F. Alcantarilla, et al., Fast explicit diffusion for accelerated features in nonlinear scale spaces, in British Machine Vision Conference (BMVC) (2013)
Google Scholar
J. Weickert, et al., Cyclic schemes for PDE-based image analysis. Int. J. Comput. Vis. 118(3), 275–299 (2016)
Article MathSciNet MATH Google Scholar
S. Grewenig, et al., From box filtering to fast explicit diffusion, in Joint Pattern Recognition Symposium (Springer, Berlin, 2010), pp. 533–542
Google Scholar
G. Simon, M. Berger, Pose estimation for planar structures. IEEE Comput. Graph. Appl. 22(6), 46–53 (2002)
Article Google Scholar
C. Xu, et al., 3D pose estimation for planes, in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops (2009), pp. 673–680
Google Scholar
M. Donoser, et al., Robust planar target tracking and pose estimation from a single concavity, in 2011 10th IEEE International Symposium on Mixed and Augmented Reality (2011), pp. 9–15
Google Scholar
D. Nistér, H. Stewénius, Linear time maximally stable extremal regions, in Computer Vision – ECCV 2008, ed. by D. Forsyth et al. (Springer, Berlin, 2008), pp. 183–196
Chapter Google Scholar
A. Sahbani, et al., An overview of 3d object grasp synthesis algorithms. Robot. Auton. Syst. 60(3), 326–336 (2012)
Article Google Scholar
J. Bohg, et al., Data-driven grasp synthesis–a survey. IEEE Trans. Robot. 30(2), 289–309 (2013)
Article Google Scholar
B. Kehoe, et al., Cloud-based robot grasping with the google object recognition engine, in 2013 IEEE International Conference on Robotics and Automation (IEEE, Piscataway, 2013)
Google Scholar
K. Huebner, et al., Minimum volume bounding box decomposition for shape approximation in robot grasping, in 2008 IEEE International Conference on Robotics and Automation (IEEE, Piscataway, 2008)
Google Scholar
S. Caldera, et al., Review of deep learning methods in robotic grasp detection. Multimodal Technol. Interact. 2(3), 57 (2018)
Google Scholar
J. Yu, et al., A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation, in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO) (IEEE, Piscataway, 2013)
Google Scholar
O. Kroemer, et al., Active learning using mean shift optimization for robot grasping, in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE, Piscataway, 2009)
Google Scholar
J. Aleotti, S. Caselli, Part-based robot grasp planning from human demonstration, in 2011 IEEE International Conference on Robotics and Automation (IEEE, Piscataway, 2011)
Google Scholar
A. Saxena, et al., Robotic grasping of novel objects using vision. Int. J. Robot. Res. 27(2), 157–173 (2008)
Article Google Scholar
L. Montesano, M. Lopes, Active learning of visual descriptors for grasping using non-parametric smoothed beta distributions. Robot. Auton. Syst. 60(3), 452–462 (2012)
Article Google Scholar
J. Nogueira, et al., Unscented Bayesian optimization for safe robot grasping, in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, Piscataway, 2016)
Google Scholar
E. Karami, et al., Image matching using sift, surf, brief and orb: performance comparison for distorted images, in The 24th Annual Newfoundland Electrical and Computer Engineering Conference, NECEC (2015)
Google Scholar
S.A.K. Tareen, Z. Saleem, A comparative analysis of sift, surf, kaze, akaze, orb, and brisk, in 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) (IEEE, Piscataway, 2018), pp. 1–10
Google Scholar
M. Muja, D.G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in International Conference on Computer Vision Theory and Application VISSAPP’09) (INSTICC Press, Lisboa, 2009), pp. 331–340
Google Scholar
M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work has been supported in part by the Office of Naval Research award N00014-16-1-2312 and US Army Research Laboratory (ARO) award W911NF-20-2-0084.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Nevada, Reno, NV, USA
Shuvo Kumar Paul, Muhammed Tawfiq Chowdhury, Mircea Nicolescu, Monica Nicolescu & David Feil-Seifer

Authors

Shuvo Kumar Paul
View author publications
You can also search for this author in PubMed Google Scholar
Muhammed Tawfiq Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Mircea Nicolescu
View author publications
You can also search for this author in PubMed Google Scholar
Monica Nicolescu
View author publications
You can also search for this author in PubMed Google Scholar
David Feil-Seifer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuvo Kumar Paul .

Editor information

Editors and Affiliations

Department of Computer Science, University of Georgia, Athens, GA, USA
Hamid R. Arabnia
School of Computing and Data Sciences, Wentworth Institute of Technology, Boston, MA, USA
Leonidas Deligiannidis
Graduate School of Information Science & Engineering, University of Electro-Communications, Chofu, Japan
Hayaru Shouno
Facultad de Informática - CIC PBA, Universidad Nacional de La Plata, La Plata, Argentina
Fernando G. Tinetti
Department of Computer Science, Southeastern Louisiana University, Hammond, LA, USA
Quoc-Nam Tran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paul, S.K., Chowdhury, M.T., Nicolescu, M., Nicolescu, M., Feil-Seifer, D. (2021). Object Detection and Pose Estimation from RGB and Depth Data for Real-Time, Adaptive Robotic Grasping. In: Arabnia, H.R., Deligiannidis, L., Shouno, H., Tinetti, F.G., Tran, QN. (eds) Advances in Computer Vision and Computational Biology. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71051-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-71051-4_10
Published: 01 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71050-7
Online ISBN: 978-3-030-71051-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics