Multimedia Tools and Applications

, Volume 76, Issue 3, pp 4445–4469 | Cite as

Semi-direct tracking and mapping with RGB-D camera for MAV



In this paper we present a novel semi-direct tracking and mapping (SDTAM) approach for RGB-D cameras which inherits the advantages of both direct and feature based methods, and consequently it achieves high efficiency, accuracy, and robustness. The input RGB-D frames are tracked with a direct method and keyframes are refined by minimizing a proposed measurement residual function which takes both geometric and depth information into account. A local optimization is performed to refine the local map while global optimization detects and corrects loops with the appearance based bag of words and a co-visibility weighted pose graph. Our method has higher accuracy on both trajectory tracking and surface reconstruction compared to state-of-the-art frame-to-frame or frame-model approaches. We test our system in challenging sequences with motion blur, fast pure rotation, and large moving objects, the results demonstrate it can still successfully obtain results with high accuracy. Furthermore, the proposed approach achieves real-time speed which only uses part of the CPU computation power, and it can be applied to embedded devices such as phones, tablets, or micro aerial vehicles (MAVs).


RGB-D SLAM Localization Tracking Mapping Reconstruction Real-time 


  1. 1.
    Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: Computer vision–ECCV 2006. Springer, pp 404–417Google Scholar
  2. 2.
    Bu S, Cheng S, Liu Z, Han J (2014) Multimodal feature fusion for 3d shape recognition and retrieval. IEEE Multimedia 21(4):38–46CrossRefGoogle Scholar
  3. 3.
    Bu S, Han P, Liu Z, Li K, Han J (2014) Shift-invariant ring feature for 3d shape. Vis Comput 30(6–8):867–876CrossRefGoogle Scholar
  4. 4.
    Bu S, Liu Z, Han J, Wu J, Ji R (2014) Learning high-level feature by deep belief networks for 3-d model retrieval and recognition. IEEE Trans Multimedia 16(8):2154–2167CrossRefGoogle Scholar
  5. 5.
    Bu S, Han P, Liu Z, Han J, Lin H (2015) Local deep feature learning framework for 3d shape. Comput Graph 46:117–129CrossRefGoogle Scholar
  6. 6.
    Bylow E, Sturm J, Kerl C, Kahl F, Cremers D (2013) Direct camera pose tracking and mapping with signed distance functions. In: RGB-D workshop on advanced reasoning with depth cameras (RGB-D 2013)Google Scholar
  7. 7.
    Chen C, Liu K, Kehtarnavaz N (2013) Real-time human action recognition based on depth motion maps. J Real-Time Image Proc 1–9Google Scholar
  8. 8.
    Endres F, Hess J, Engelhard N, Sturm J, Cremers D, Burgard W (2012) An evaluation of the rgb-d slam system. In: IEEE international conference on robotics and automation (ICRA), 2012. IEEE, pp 1691–1696Google Scholar
  9. 9.
    Engel J, Sturm J, Cremers D (2012) Camera-based navigation of a low-cost quadrocopter. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), 2012. IEEE, pp 2815–2821Google Scholar
  10. 10.
    Engel J, Sturm J, Cremers D (2012) Accurate figure flying with a quadrocopter using onboard visual and inertial sensing. IMU 320:240Google Scholar
  11. 11.
    Engel J, Schöps T, Cremers D (2014) Lsd-slam: large-scale direct monocular slam. In: Computer Vision–ECCV 2014. Springer, pp 834–849Google Scholar
  12. 12.
    Gálvez-López D, Tardos JD (2011) Real-time loop detection with bags of binary words. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), 2011. IEEE, pp 51–58Google Scholar
  13. 13.
    Glocker B, Shotton J, Criminisi A, Izadi S (2015) Real-time rgb-d camera relocalization via randomized ferns for keyframe encoding. IEEE Trans Vis Comput Graph 21(5):571–583CrossRefGoogle Scholar
  14. 14.
    Glover A, Maddern W, Warren M, Reid S, Milford M, Wyeth G (2012) Openfabmap: an open source toolbox for appearance-based loop closure detection. In: IEEE international conference on robotics and automation (ICRA), 2012, pp 4730–4735Google Scholar
  15. 15.
    Grisetti G, Strasdat H, Konolige K, Burgard W (2011) g2o: a general framework for graph optimizationGoogle Scholar
  16. 16.
    Grzonka S, Grisetti G, Burgard W (2009) Towards a navigation system for autonomous indoor flying. In: IEEE international conference on robotics and automation, 2009. ICRA’09. IEEE, pp 2878– 2883Google Scholar
  17. 17.
    Han J, Pauwels EJ, De Zeeuw PM, De With PH (2012) Employing a rgb-d sensor for real-time tracking of humans across multiple re-entries in a smart environment. IEEE Trans Consum Electron 58(2):255–263CrossRefGoogle Scholar
  18. 18.
    Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43(5):1318–1334CrossRefGoogle Scholar
  19. 19.
    Han J, He S, Qian X, Wang D, Guo L, Liu T (2013) An object-oriented visual saliency detection framework based on sparse coding representations. IEEE Trans Circuits Syst Video Technol 23(12):2009–2021CrossRefGoogle Scholar
  20. 20.
    Han J, Zhang D, Hu X, Guo L, Ren J, Wu F (2014) Background prior based salient object detection via deep reconstruction residual. IEEE Trans Circuits Syst Video Technol 25(8):1309–1321Google Scholar
  21. 21.
    Han J, Zhou P, Zhang D, Cheng G, Guo L, Liu Z, Bu S, Wu J (2014) Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding. ISPRS J Photogramm Remote Sens 89:37–48CrossRefGoogle Scholar
  22. 22.
    Han J, Zhang D, Cheng G, Guo L, Ren J (2015) Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans Geosci Remote Sens 53(6):3325–3337CrossRefGoogle Scholar
  23. 23.
    Han J, Chen C, Shao L, Hu X, Han J (2015) Learning computational models of video memorability from fmri brain imaging. IEEE Trans Cybern 45(8):1692–1703CrossRefGoogle Scholar
  24. 24.
    Handa A, Whelan T, McDonald J, Davison AJ (2014) A benchmark for rgb-d visual odometry, 3d reconstruction and slam. In: IEEE international conference on robotics and automation (ICRA), 2014. IEEE, pp 1524–1531Google Scholar
  25. 25.
    Henry P, Krainin M, Herbst E, Ren X, Fox D (2012) Rgb-d mapping: using kinect-style depth cameras for dense 3d modeling of indoor environments. Int J Robot Res 31(5):647–663CrossRefGoogle Scholar
  26. 26.
    Kerl C, Sturm J, Cremers D (2013) Robust odometry estimation for rgb-d cameras. In: IEEE international conference on robotics and automation (ICRA), 2013. IEEE, pp 3748–3754Google Scholar
  27. 27.
    Kerl C, Sturm J, Cremers D (2013) Dense visual SLAM for RGB-D cameras. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, pp 2100–2106Google Scholar
  28. 28.
    Lee S-O, Lim H, Kim H-G, Ahn SC (2014) Rgb-d fusion: real-time robust tracking and dense mapping with rgb-d data fusion. In: IEEE/RSJ international conference on intelligent robots and systems (IROS 2014), 2014. IEEE, pp 2749–2754Google Scholar
  29. 29.
    Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW), 2010. IEEE, pp 9–14Google Scholar
  30. 30.
    Liu L, Shao L (2013) Learning discriminative representations from rgb-d video data. In: Proceedings of the 23rd international joint conference on artificial intelligence. AAAI Press, pp 1493–1500Google Scholar
  31. 31.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints,. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  32. 32.
    Mur-Artal R, Tardós JD (2014) Fast relocalisation and loop closing in keyframe-based slam. In: IEEE international conference on robotics and automation (ICRA), 2014. IEEE, pp 846–853Google Scholar
  33. 33.
    Mur-Artal R, Montiel J, Tardos JD (2015) Orb-slam: a versatile and accurate monocular slam system. arXiv:1502.00956
  34. 34.
    Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, Kohi P, Shotton J, Hodges S, Fitzgibbon A (2011) Kinectfusion: Real-time dense surface mapping and tracking. In: 10th IEEE international symposium on mixed and augmented reality (ISMAR), 2011. IEEE, pp 127–136Google Scholar
  35. 35.
    Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: IEEE international conference on computer vision (ICCV), 2011. IEEE, pp 2564–2571Google Scholar
  36. 36.
    Segal A, Haehnel D, Thrun S (2009) Generalized-icp. In: Robotics: Science and Systems, vol 2Google Scholar
  37. 37.
    Selig J (2004) Lie groups and lie algebras in robotics. In: Computational noncommutative algebra and applications. Springer, pp 101–125Google Scholar
  38. 38.
    Steinbrucker F, Sturm J, Cremers D (2011) Real-time visual odometry from dense rgb-d images. In: IEEE international conference on computer vision workshops (ICCV Workshops), 2011. IEEE, pp 719–722Google Scholar
  39. 39.
    Steinbrucker F, Sturm J, Cremers D (2014) Volumetric 3d mapping in real-time on a cpu. In: IEEE international conference on robotics and automation (ICRA), 2014. IEEE, pp 2021–2028Google Scholar
  40. 40.
    Strasdat H, Davison AJ, Montiel J, Konolig K (2011) Double window optimisation for constant time visual slam. In: IEEE international conference on computer vision (ICCV), 2011. IEEE, pp 2352– 2359Google Scholar
  41. 41.
    Sturm J, Engelhard N, Endres F, Burgard W, Cremers D (2012) A benchmark for the evaluation of rgb-d slam systems. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), 2012. IEEE, pp 573–580Google Scholar
  42. 42.
    Stückler J, Behnke S (2014) Multi-resolution surfel maps for efficient dense 3d modeling and tracking. J Vis Commun Image Represent 25(1):137–147CrossRefGoogle Scholar
  43. 43.
    Tao D, Jin L, Wang Y, Yuan Y, Li X (2013) Person re-identification by regularized smoothing kiss metric learning. IEEE Trans Circuits Syst Video Technol 23(10):1675–1685CrossRefGoogle Scholar
  44. 44.
    Tao D, Jin L, Liu W, Li X (2013) Hessian regularized support vector machines for mobile image annotation on the cloud. IEEE Trans Multimedia 15(4):833–844CrossRefGoogle Scholar
  45. 45.
    Triggs B, McLauchlan PF, Hartley RI, Fitzgibbon AW (2000) Bundle adjustment–a modern synthesis. In: Vision algorithms: theory and practice. Springer, pp 298–372Google Scholar
  46. 46.
    Whelan T, Kaess M, Fallon M, Johannsson H, Leonard J, McDonald J (2012) Kintinuous: spatially extended kinectfusionGoogle Scholar
  47. 47.
    Whelan T, Kaess M, Johannsson H, Fallon M, Leonard JJ, McDonald J (2015) Real-time large-scale dense rgb-d slam with volumetric fusion. Int J Robot Res 34(4–5):598–626CrossRefGoogle Scholar
  48. 48.
    Whelan T, Leutenegger S, Salas-Moreno RF, Glocker B, Davison AJ (2015) Elasticfusion: dense slam without a pose graph. In: Robotics: science and systemsGoogle Scholar
  49. 49.
    Wu C (2011) Siftgpu: A gpu implementation of scale invariant feature transform (sift)(2007),
  50. 50.
    Yu J, Tao D, Li J, Cheng J (2014) Semantic preserving distance metric learning and applications. Inf Sci 281:674–686MathSciNetCrossRefGoogle Scholar
  51. 51.
    Yu M, Liu L, Shao L (2015) Structure-preserving binary representations for rgb-d action recognition. IEEE Trans Pattern Anal Mach IntellGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Northwestern Polytechnical UniversityXi’anChina
  2. 2.Information Engineering UniversityZhengzhouChina

Personalised recommendations