Skip to main content

Advertisement

Log in

A new benchmark for pose estimation with ground truth from virtual reality

  • Assembly
  • Published:
Production Engineering Aims and scope Submit manuscript

Abstract

The development of programming paradigms for industrial assembly currently gets fresh impetus from approaches in human demonstration and programming-by-demonstration. Major low- and mid-level prerequisites for machine vision and learning in these intelligent robotic applications are pose estimation, stereo reconstruction and action recognition. As a basis for the machine vision and learning involved, pose estimation is used for deriving object positions and orientations and thus target frames for robot execution. Our contribution introduces and applies a novel benchmark for typical multi-sensor setups and algorithms in the field of demonstration-based automated assembly. The benchmark platform is equipped with a multi-sensor setup consisting of stereo cameras and depth scanning devices (see Fig. 1). The dimensions and abilities of the platform have been chosen in order to reflect typical manual assembly tasks. Following the eRobotics methodology, a simulatable 3D representation of this platform was modelled in virtual reality. Based on a detailed camera and sensor simulation, we generated a set of benchmark images and point clouds with controlled levels of noise as well as ground truth data such as object positions and time stamps. We demonstrate the application of the benchmark to evaluate our latest developments in pose estimation, stereo reconstruction and action recognition and publish the benchmark data for objective comparison of sensor setups and algorithms in industry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. It is well-known that multiple Kinect sensors sharing a common field of view will cause IR interference, resulting in poor depth reconstructions. A known solution, which our platform also incorporates, is the use of vibrating motors mounted on the Kinect sensors [9]. This method has been shown to effectively blur out the noisy contributions of external sensors, while maintaining a high depth reconstruction quality.

References

  1. Aksoy EE, Abramov A, Wörgötter F, Dellen B (2010) Categorizing object-action relations from semantic scene graphs. In: IEEE international conference on robotics and automation (ICRA), pp 398–405

  2. Aksoy EE, Abramov A, Dörr J, Ning K, Dellen B, Wörgötter F (2011) Learning the semantics of object-action relations by observation. Int J Rob Res 30(10):1229–1249

    Article  Google Scholar 

  3. Aldoma A, Tombari F, Di Stefano L, Vincze M (2012) A global hypotheses verification method for 3d object recognition. In: European conference on computer vision (ECCV), Springer, pp 511–524

  4. Badler N (1975) Temporal scene analysis: conceptual descriptions of object movements. PhD thesis, University of Toronto, Canada

  5. Besl PJ, McKay ND (1992) A method for registration of 3-d shapes. IEEE Trans Pattern Anal Mach Intell 14(2):239–256

    Article  Google Scholar 

  6. Billard A, Calinon S, Guenter F (2006) Discriminative and adaptive imitation in uni-manual and bi-manual tasks. Rob Auton Syst 54(5):370–384

    Article  Google Scholar 

  7. Bradski G, Kaehler A (2008) Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media

  8. Buch AG, Kraft D, Kamarainen JK, Petersen HG, Krüger N (2013) Pose estimation using local structure-specific shape and appearance context. In: IEEE international conference on robotics and automation (ICRA), pp 2080–2087

  9. Butler DA, Izadi S, Hilliges O, Molyneaux D, Hodges S, Kim D (2012) Shake’n’sense: reducing interference for overlapping structured light depth cameras. In: Proceedings of the 2012 ACM annual conference on human factors in computing systems, ACM, pp 1933–1936

  10. Collins K, Palmer AJ, Rathmill K (1985) The development of a European benchmark for the comparison of assembly robot programming systems. In: Robot technology and applications (Robotics Europe Conference), pp 187–199

  11. Coppelia Robotics (2014) V-REP. http://www.coppeliarobotics.com/

  12. Cos S, Uwaerts D, Hermans L (2006) Evaluation of STAR250 and STAR1000 CMOS image sensors. In: ESA international conference of guidance, navigation and control systems, pp 1–6

  13. El-Laithy RA, Huang J, Yeh M (2012) Study on the use of microsoft kinect for robotics applications. In: IEEE symposium on position location and navigation (PLANS), pp 1280–1288

  14. Emde M, Rossmann J (2013) Validating a simulation of a single ray based laser scanner used in mobile robot applications. In: International symposium on robotic and sensors environments (ROSE), pp 55–60

  15. Farrell K, Okincha M, Parmar M (2008) Sensor calibration and simulation. In: SPIE 6817, Digital Photography IV, pp 1–9

  16. Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 24(6):381–395

    Article  MathSciNet  Google Scholar 

  17. Glover J, Popovic S (2013) Bingham procrustean alignment for object detection in clutter. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 2158–2165

  18. Gupta A, Davis LS (2007) Objects in action: An approach for combining action understanding and object perception. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8

  19. Hartley RI, Zisserman A (2004) Multiple view geometry in computer vision. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  20. Herrera C, Kannala J (2012) Joint depth and color camera calibration with distortion correction. IEEE Trans Pattern Anal Mach Intell 34(10):2058–2064

    Article  Google Scholar 

  21. Hue C, Le Cadre JP, Perez P (2002) Tracking multiple objects with particle filtering. IEEE Trans Aerosp Electron Syst 38(3):791–812

    Article  Google Scholar 

  22. Ikeuchi K, Suehiro T (1994) Toward an assembly plan from observation, part I: task recognition with polyhedral objects. IEEE Trans Rob Autom 10(3):368–385

    Article  Google Scholar 

  23. Isard M, Blake A (1998) CONDENSATION: conditional density propagation for visual tracking. Int J Comput Vis 29(1):5–28

    Article  Google Scholar 

  24. Khan Z, Balch T, Dellaert F (2005) MCMC-based particle filtering for tracking a variable number of interacting targets. IEEE Trans Pattern Anal Mach Intell 27(11):1805–1819

    Article  Google Scholar 

  25. Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-D videos. Int J Rob Res 32(8):951–970

    Article  Google Scholar 

  26. Lai K, Bo L, Ren X, Fox D (2011) A large-scale hierarchical multi-view RGB-D object dataset. In: IEEE international conference on robotics and automation (ICRA), pp 1817–1824

  27. Martinez D, Alenya G, Jimenez P, Torras C, Rossmann J, Wantia N, Aksoy EE, Haller S, Piater J (2014) Active learning of manipulation sequences. In: IEEE international conference on robotics and automation (ICRA)

  28. Mian AS, Bennamoun M, Owens R (2006) Three-dimensional model-based object recognition and segmentation in cluttered scenes. IEEE Trans Pattern Anal Mach Intell 28(10):1584–1601

    Article  Google Scholar 

  29. Microsoft (2012) Kinect for windows SDK version 1.5. http://www.kinectforwindows.org/

  30. Microsoft (2014) Microsoft robotics developer studio 4. http://www.microsoft.com/en-us/download/details.aspx?id=29081

  31. Open Source Robotics Foundation (2014) GazeboSim. http://gazebosim.org/

  32. Papon J, Kulvicius T, Aksoy EE, Wörgötter F (2013) Point cloud video object segmentation using a persistent supervoxel world-model. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3712–3718

  33. Pardowitz M, Knoop S, Dillmann R, Zöllner RD (2007) Incremental learning of tasks from user demonstrations, past experiences, and vocal comments. IEEE Trans Syst Man Cybern B Cybern 37(2):322–332

    Article  Google Scholar 

  34. Point Grey (2011) Bumblebee2 stereo camera. http://ww2.ptgrey.com/stereo-vision/bumblebee-2

  35. Roßmann J (2012) eRobotics: the symbiosis of advanced robotics and virtual reality technologies. In: ASME international design in engineering technical Conference and computers and information in engineering (IDETC/CIE), vol 2, pp 1395–1402

  36. Roßmann J, Ruf H, Schlette C (2009) Model-based programming ’by demonstration’— fast setup of robot systems. In: Advances in robotics research: Theory, implementation, application. Springer, pp 159–168

  37. Roßmann J, Schlette C, Ruf H (2010) A tool kit of new model-based methods for programming industrial robots. In: IASTED international conference on robotics and applications (RA), pp 379–385

  38. Roßmann J, Hempe N, Emde M, Steil T (2012a) A real-time optical sensor simulation framework for development and testing of industrial and mobile robot applications. In: German conference in robotics (ROBOTIK), pp 1–6

  39. Roßmann J, Schlette C, Wantia N (2012b) Virtual reality providing ground truth for machine learning and programming by demonstration. In: ASME international design in engineering technical conference computers and information in engineering (IDETC/CIE), pp 1501–1508

  40. Roßmann J, Steil T, Springer M (2012c) Validating the camera and light simulation of a virtual space robotics testbed by means of physical mockup data. In: International symposium on artificial intelligence, robotics and automation in space (i-SAIRAS), pp 1–6

  41. Rusu RB, Cousins S (2011) 3d is here: Point Cloud Library (PCL). In: IEEE international conference on robotics and automation (ICRA)

  42. Schou C, Carøe CF, Hvilshøj M, Damgaard JS, Bøgh S, Madsen O (2012) Human assisted instructing of autonomous industrial mobile manipulator and its qualitative assessment. In: AAU workshop on human-centered robotics, pp 22–28

  43. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: International conference on pattern recognition (ICPR), vol 3, pp 32–36

  44. Sridhar M, Cohn GA, Hogg D (2008) Learning functional object-categories from a relational spatio-temporal representation. In: European conference on artificial intelligence (ECAI), pp 606–610

  45. Summers-Stay D, Teo CL, Yang Y, Fermüller C, Aloimonos Y (2012) Using a minimal action grammar for activity understanding in the real world. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 4104–4111

  46. Thomas U, Wahl FM (2001) A system for automatic planning, evaluation and execution of assembly sequences for industrial robots. In: IEEE/RSJ international conference on intelligent robots and systems (IROS)), pp 1458–1464

  47. Tombari F, Salti S, Di Stefano L (2010) Unique signatures of histograms for local surface description. In: European conference on computer vision (ECCV), Springer, pp 356–369

  48. Vermaak J, Godsill S, Perez P (2005) Monte Carlo filtering for multi target tracking and data association. IEEE Trans Aerosp Electron Syst 41(1):309–332

    Article  Google Scholar 

  49. Yang Y, Fermüller C, Aloimonos Y (2013) Detection of manipulation action consequences (MAC). In: International conference on computer vision and pattern recognition (CVPR), pp 2563–2570

  50. Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE MultiMed 19(2):4–10

    Article  Google Scholar 

Download references

Acknowledgments

The research leading to these results has received funding from the European Communities Seventh Framework Programme FP7/2007-2013 (Specific Programme Cooperation, Theme 3, Information and Communication Technologies) under Grant Agreement No. 269959, IntellAct.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Schlette.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schlette, C., Buch, A.G., Aksoy, E.E. et al. A new benchmark for pose estimation with ground truth from virtual reality. Prod. Eng. Res. Devel. 8, 745–754 (2014). https://doi.org/10.1007/s11740-014-0552-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11740-014-0552-0

Keywords

Navigation