Evaluation of Interest Point Detectors and Feature Descriptors for Visual Tracking

Article

Abstract

Applications for real-time visual tracking can be found in many areas, including visual odometry and augmented reality. Interest point detection and feature description form the basis of feature-based tracking, and a variety of algorithms for these tasks have been proposed. In this work, we present (1) a carefully designed dataset of video sequences of planar textures with ground truth, which includes various geometric changes, lighting conditions, and levels of motion blur, and which may serve as a testbed for a variety of tracking-related problems, and (2) a comprehensive quantitative evaluation of detector-descriptor-based visual camera tracking based on this testbed. We evaluate the impact of individual algorithm parameters, compare algorithms for both detection and description in isolation, as well as all detector-descriptor combinations as a tracking solution. In contrast to existing evaluations, which aim at different tasks such as object recognition and have limited validity for visual tracking, our evaluation is geared towards this application in all relevant factors (performance measures, testbed, candidate algorithms). To our knowledge, this is the first work that comprehensively compares these algorithms in this context, and in particular, on video streams.

Keywords

Interest point detectors Feature descriptors Visual tracking Dataset Evaluation 

References

  1. Adams, A., Gelfand, N., & Pulli, K. (2008). Viewfinder alignment. Computer Graphics Forum, 27(2), 597–606. doi:10.1111/j.1467-8659.2008.01157.x. CrossRefGoogle Scholar
  2. Agrawal, M., Konolige, K., & Blas, M. R. (2008). CenSurE: Center surround extremas for realtime feature detection and matching. In Proceedings of the European conference on computer vision (ECCV’08) (Vol. 5305, pp. 102–115). doi:10.1007/978-3-540-88693-8_8. Google Scholar
  3. Baker, S., & Matthews, I. (2001). Equivalence and efficiency of image alignment algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’01) (Vol. 1, pp. 1090–1097). Google Scholar
  4. Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., & Szeliski, R. (2007). A database and evaluation methodology for optical flow. In Proceedings of the IEEE intl. conference on computer vision (ICCV’07) (pp. 1–8). doi:10.1109/ICCV.2007.4408903. Google Scholar
  5. Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110, 346–359. doi:10.1016/j.cviu.2007.09.014. CrossRefGoogle Scholar
  6. Beaudet, P. R. (1978). Rotationally invariant image operators. In Proceedings of the intl. joint conference on pattern recognition (pp. 579–583). Google Scholar
  7. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522. doi:10.1109/34.993558. CrossRefGoogle Scholar
  8. Benhimane, S., & Malis, E. (2004). Real-time image-based tracking of planes using efficient second-order minimization. In Proceedings of the IEEE/RSJ intl. conference on intelligent robots and systems (pp. 943–948). Google Scholar
  9. Bleser, G., & Stricker, D. (2008). Advanced tracking through efficient image processing and visual-inertial sensor fusion. In Proceedings of the IEEE virtual reality conference (VR’08) (pp. 137–144). doi:10.1109/VR.2008.4480765. CrossRefGoogle Scholar
  10. Brown, M., & Lowe, D. (2002). Invariant features from interest point groups. In Proceedings of the British machine vision conference (BMVC’02). Google Scholar
  11. Calonder, M., Lepetit, V., & Fua, P. (2008). Keypoint signatures for fast learning and recognition. In Proceedings of the 11th European conference on computer vision (ECCV’08), Marseille, France. Google Scholar
  12. Campbell, J., Sukthankar, R., & Nourbakhsh, I. (2004). Techniques for evaluating optical flow for visual odometry in extreme terrain. In Proceedings of the IEEE/RSJ intl. conference on intelligent robots and systems (Vol. 4, pp. 3704–3711). Google Scholar
  13. Carneiro, G., & Jepson, A. D. (2003). Multi-scale phase-based local features. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’03) (Vol. 1, pp. 736–743). Google Scholar
  14. Carrera, G., Savage, J., & Mayol-Cuevas, W. (2007). Robust feature descriptors for efficient vision-based tracking. In Proceedings of the 12th Iberoamerican congress on pattern recognition (pp. 251–260). doi:10.1007/978-3-540-76725-1_27. Google Scholar
  15. Chekhlov, D., Pupilli, M., Mayol-Cuevas, W., & Calway, A. (2006). Real-time and robust monocular SLAM using predictive multi-resolution descriptors. In Proceedings of the 2nd intl. symposium on visual computing. Google Scholar
  16. Chekhlov, D., Pupilli, M., Mayol, W., & Calway, A. (2007). Robust real-time visual SLAM using scale prediction and exemplar based feature description. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’07) (pp. 1–7). doi:10.1109/CVPR.2007.383026. Google Scholar
  17. Cheng, Y., Maimone, M. W., & Matthies, L. (2006). Visual odometry on the mars exploration rovers—a tool to ensure accurate driving and science imaging. IEEE Robotics & Automation Magazine, 13(2), 54–62. doi:10.1109/MRA.2006.1638016. CrossRefGoogle Scholar
  18. Chum, O., & Matas, J. (2005). Matching with PROSAC—progressive sample consensus. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’05) (pp. 220–226). doi:10.1109/CVPR.2005.221. Google Scholar
  19. Davison, A. J., Reid, I. D., Molton, N. D., & Stasse, O. (2007). MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1052–1067. doi:10.1109/TPAMI.2007.1049. CrossRefGoogle Scholar
  20. DiVerdi, S., & Höllerer, T. (2008). Heads up and camera down: A vision-based tracking modality for mobile mixed reality. IEEE Transactions on Visualization and Computer Graphics, 14(3), 500–512. doi:10.1109/TVCG.2008.26. CrossRefGoogle Scholar
  21. DiVerdi, S., Wither, J., & Höllerer, T. (2008). Envisor: Online environment map construction for mixed reality. In Proceedings of the IEEE virtual reality conference (VR’08) (pp. 19–26). doi:10.1109/VR.2008.4480745. CrossRefGoogle Scholar
  22. Eade, E., & Drummond, T. (2006a). Edge landmarks in monocular SLAM. In Proceedings of the 17th British machine vision conference (BMVC’06), Edinburgh (Vol. 1, pp. 7–16). Google Scholar
  23. Eade, E., & Drummond, T. (2006b). Scalable monocular SLAM. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’06) (Vol. 1, pp. 469–476). doi:10.1109/CVPR.2006.263. Google Scholar
  24. Ebrahimi, M., & Mayol-Cuevas, W. (2009). SUSurE: Speeded up surround extrema feature detector and descriptor for realtime applications. In Workshop on feature detectors and descriptors: the state of the art and beyond. IEEE conference on computer vision and pattern recognition (CVPR’09). Google Scholar
  25. Fiala, M. (2005). ARTag, a fiducial marker system using digital techniques. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’05) (Vol. 2, pp. 590–596), Washington, DC, USA. doi:10.1109/CVPR.2005.74. Google Scholar
  26. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395. doi:10.1145/358669.358692. MathSciNetCrossRefGoogle Scholar
  27. Förstner, W. (1994). A framework for low level feature extraction. In Proceedings of the 3rd European conference on computer vision (ECCV’94), Secaucus, NJ, USA (Vol. II, pp. 383–394). Google Scholar
  28. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. doi:10.1109/34.93808. CrossRefGoogle Scholar
  29. Gauglitz, S., Höllerer, T., Krahwinkler, P., & Roßmann, J. (2009). A setup for evaluating detectors and descriptors for visual tracking. In Proceedings of the 8th IEEE intl. symposium on mixed and augmented reality (ISMAR’09). Google Scholar
  30. Gauglitz, S., Höllerer, T., & Turk, M. (2010). Dataset and evaluation of interest point detectors for visual tracking (Technical Report 2010-06). Department of Computer Science, UC Santa Barbara. Google Scholar
  31. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th ALVEY vision conference (pp. 147–151). Google Scholar
  32. Hartley, R., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge: Cambridge University Press. MATHGoogle Scholar
  33. Horn, B. K. P. (1987). Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America, A, Optics, Image Science & Vision, 4(4), 629–642. MathSciNetCrossRefGoogle Scholar
  34. Julier, S. J., & Uhlmann, J. K. (1997). New extension of the Kalman filter to nonlinear systems. In I. Kadar (Ed.), Proceedings of the SPIE conference on signal processing, sensor fusion, & target recognition VI (Vol. 3068, pp. 182–193). doi:10.1117/12.280797. Google Scholar
  35. Kadir, T., Zisserman, A., & Brady, M. (2004). An affine invariant salient region detector. In Proceedings of the 8th European conference on computer vision (ECCV’04) (pp. 228–241). Google Scholar
  36. Kato, H., & Billinghurst, M. (1999). Marker tracking and HMD calibration for a video-based augmented reality conferencerencing system. In Proceedings of the 2nd IEEE and ACM intl. workshop on augmented reality (IWAR’99) (p. 85), Washington, DC, USA. CrossRefGoogle Scholar
  37. Ke, Y., & Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for local image descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’04) (Vol. 2, pp. 506–513). doi:10.1109/CVPR.2004.183. Google Scholar
  38. Kitchen, L., & Rosenfeld, A. (1982). Gray-level corner detection. Pattern Recognition Letters, 1(2), 95–102. doi:10.1016/0167-8655(82)90020-4. CrossRefGoogle Scholar
  39. Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small AR workspaces. In Proceedings of the 6th IEEE and ACM intl. symposium on mixed and augmented reality (ISMAR’07), Nara, Japan. Google Scholar
  40. Klein, G., & Murray, D. (2008). Improving the agility of keyframe-based SLAM. In Proceedings of the 10th European conference on computer vision (ECCV’08), Marseille, France (pp. 802–815). Google Scholar
  41. Klein, G., & Murray, D. (2009). Parallel tracking and mapping on a camera phone. In Proceedings of the 8th IEEE intl. symposium on mixed and augmented reality (ISMAR’09) (pp. 83–86). doi:10.1109/ISMAR.2009.5336495. CrossRefGoogle Scholar
  42. Lee, S., & Song, J. B. (2004). Mobile robot localization using optical flow sensors. International Journal of Control, Automation, and Systems, 2(4), 485–493. Google Scholar
  43. Lee, T., & Höllerer, T. (2008). Hybrid feature tracking and user interaction for markerless augmented reality. In Proceedings of the IEEE virtual reality conference (VR’08) (pp. 145–152). doi:10.1109/VR.2008.4480766. CrossRefGoogle Scholar
  44. Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479. doi:10.1109/TPAMI.2006.188. CrossRefGoogle Scholar
  45. Levin, A., & Szeliski, R. (2004). Visual odometry and map correlation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’04) (Vol. 1, pp. 611–618). doi:10.1109/CVPR.2004.266. Google Scholar
  46. Lieberknecht, S., Benhimane, S., Meier, P., & Navab, N. (2009). A dataset and evaluation methodology for template-based tracking algorithms. In Proceedings of the IEEE intl. symposium on mixed and augmented reality (ISMAR’09). Google Scholar
  47. Lindeberg, T. (1994). Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics, 21(2), 224–270. Google Scholar
  48. Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the IEEE intl. conference on computer vision (ICCV’99), Corfu (pp. 1150–1157). CrossRefGoogle Scholar
  49. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  50. Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British machine vision conference (BMCV’02) (pp. 384–393). Google Scholar
  51. Matthies, L., & Shafer, S. A. (1987). Error modeling in stereo navigation. IEEE Journal of Robotics and Automation, 3(3), 239–248. CrossRefGoogle Scholar
  52. McCarthy, C. D. (2005). Performance of optical flow techniques for mobile robot navigation (Master’s thesis). Department of Computer Science and Software Engineering, University of Melbourne. Google Scholar
  53. Mikolajczyk, K., & Schmid, C. (2001). Indexing based on scale invariant interest points. In Proceedings of the IEEE intl. conference on computer vision (ICCV’01) (Vol. 1, p. 525). doi:10.1109/ICCV.2001.10069. Google Scholar
  54. Mikolajczyk, K., & Schmid, C. (2002). An affine invariant interest point detector. In Proceedings of the 7th European conference on computer vision (ECCV’02) (pp. 128–142), London, UK. Google Scholar
  55. Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86. doi:10.1023/B:VISI.0000027790.02288.f2. CrossRefGoogle Scholar
  56. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630. doi:10.1109/TPAMI.2005.188. CrossRefGoogle Scholar
  57. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(7), 43–72. CrossRefGoogle Scholar
  58. Mohanna, F., & Mokhtarian, F. (2006). Performance evaluation of corner detectors using consistency and accuracy measures. Computer Vision and Image Understanding, 102(1), 81–94. doi:10.1016/j.cviu.2005.11.001. CrossRefGoogle Scholar
  59. Montemerlo, M., Thrun, S., Koller, D., & Wegbreit, B. (2002). FastSLAM: A factored solution to the simultaneous localization and mapping problem. In Proceedings of the AAAI national conference on artificial intelligence (pp. 593–598). Google Scholar
  60. Montemerlo, M., Thrun, S., Koller, D., & Wegbreit, B. (2003). FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the intl. joint conference on artificial intelligence (IJCAI’03) (pp. 1151–1156). Google Scholar
  61. Moravec, H. (1980). Obstacle avoidance and navigation in the real world by a seeing robot rover (Technical Report CMU-RI-TR-80-03). Robotics Institute, Carnegie Mellon University. Google Scholar
  62. Moreels, P., & Perona, P. (2007). Evaluation of features detectors and descriptors based on 3D objects. International Journal of Computer Vision, 73(3), 263–284. doi:10.1007/s11263-006-9967-1. CrossRefGoogle Scholar
  63. Moreno-Noguer, F., Lepetit, V., & Fua, P. (2007). Accurate non-iterative o(n) solution to the pnp problem. In Proceedings of the IEEE international conference on computer vision (ICCV’07) (pp. 1–8). doi:10.1109/ICCV.2007.4409116. Google Scholar
  64. Neira, J., & Tardos, J. D. (2001). Data association in stochastic mapping using the joint compatibility test. IEEE Transactions on Robotics and Automation, 17(6), 890–897. doi:10.1109/70.976019. CrossRefGoogle Scholar
  65. Nistér, D., Naroditsky, O., & Bergen, J. (2004). Visual odometry. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’04) (Vol. 1, pp. 652–659). doi:10.1109/CVPR.2004.1315094. Google Scholar
  66. Özuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’07), Minneapolis, Minnesota, USA. doi:10.1109/CVPR.2007.383123. Google Scholar
  67. Park, Y., Lepetit, V., & Woo, W. (2008). Multiple 3D object tracking for augmented reality. In Proceedings of the 7th IEEE and ACM intl. symposium on mixed and augmented reality (ISMAR’08) (pp. 117–120). doi:10.1109/ISMAR.2008.4637336. CrossRefGoogle Scholar
  68. Rosten, E., & Drummond, T. (2005). Fusing points and lines for high performance tracking. In Proceedings of the IEEE intl. conference on computer vision (ICCV’05) (Vol. 2, pp. 1508–1511). doi:10.1109/ICCV.2005.104. Google Scholar
  69. Rosten, E., & Drummond, T. (2006). Machine learning for high-speed corner detection. In Proceedings of the IEEE European conference on computer vision (ECCV’06) (Vol. 1, pp. 430–443). doi:10.1007/11744023_34. Google Scholar
  70. Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How Do I Organize My Holiday Snaps?”. In Proceedings of the 7th European conference on computer vision (ECCV’02) (Vol. 1, pp. 414–431), London, UK. Google Scholar
  71. Schmid, C., & Mohr, R. (1997). Local greyvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 530–535. CrossRefGoogle Scholar
  72. Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. International Journal of Computer Vision, 37(2), 151–172. MATHCrossRefGoogle Scholar
  73. Se, S., Lowe, D., & Little, J. (2002). Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. The International Journal of Robotics Research, 21(8), 735–758. doi:10.1177/027836402761412467. CrossRefGoogle Scholar
  74. Seitz, S. M., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’06) (Vol. 1, pp. 519–528). Los Alamitos: IEEE Computer Society. doi:10.1109/CVPR.2006.19. Google Scholar
  75. Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’94) (pp. 593–600). doi:10.1109/CVPR.1994.323794. Google Scholar
  76. Skrypnyk, I., & Lowe, D. G. (2004). Scene modelling, recognition and tracking with invariant image features. In Proceedings of the 3rd IEEE and ACM intl. symposium on mixed and augmented reality (ISMAR’04) (pp. 110–119). doi:10.1109/ISMAR.2004.53. CrossRefGoogle Scholar
  77. Taylor, S., Rosten, E., & Drummond, T. (2009). Robust feature matching in 2.3us. In Workshop, IEEE conference on computer vision and pattern recognition (pp. 15–22). doi:10.1109/CVPRW.2009.5204314. CrossRefGoogle Scholar
  78. Torr, P. H. S., & Zisserman, A. (2000). MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78(1), 138–156. doi:10.1006/cviu.1999.0832. CrossRefGoogle Scholar
  79. Trajkovic, M., & Hedley, M. (1998). Fast corner detection. Image and Vision Computing, 16(2), 75–87. doi:10.1016/S0262-8856(97)00056-5. CrossRefGoogle Scholar
  80. Tuytelaars, T., & van Gool, L. (2000). Wide baseline stereo matching based on local, affinely invariant regions. In Proceedings of the British machine vision conference (BMVC’00) (pp. 412–425). Google Scholar
  81. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’01) (Vol. 1, p. 511). Los Alamitos: IEEE Computer Society. doi:10.1109/CVPR.2001.990517. Google Scholar
  82. Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T., & Schmalstieg, D. (2008). Pose tracking from natural features on mobile phones. In Proceedings of the 7th IEEE and ACM intl. symposium on mixed and augmented reality (ISMAR’08), Cambridge, UK. Google Scholar
  83. Wagner, D., Schmalstieg, D., & Bischof, H. (2009). Multiple target detection and tracking with guaranteed framerates on mobile phones. In Proceedings of the 8th IEEE intl. symposium on mixed and augmented reality (ISMAR’09) (pp. 57–64). doi:10.1109/ISMAR.2009.5336497. CrossRefGoogle Scholar
  84. Wagner, D., Mulloni, A., Langlotz, T., & Schmalstieg, D. (2010). Real-time panoramic mapping and tracking on mobile phones. In Proceedings of the IEEE virtual reality conference (VR’10). Google Scholar
  85. Williams, B., Klein, G., & Reid, I. (2007). Real-time SLAM relocalisation. In Proceedings of the IEEE intl. conference on computer vision (ICCV’07) (pp. 1–8). doi:10.1109/ICCV.2007.4409115. Google Scholar
  86. Winder, S., & Brown, M. (2007). Learning local image descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’07) (pp. 1–8). doi:10.1109/CVPR.2007.382971. Google Scholar
  87. Winder, S., Hua, G., & Brown, M. (2009). Picking the best daisy. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’09) (pp. 178–185). doi:10.1109/CVPRW.2009.5206839. Google Scholar
  88. Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38. doi:10.1145/1177352.1177355.
  89. Zhang, Z. (1997). Parameter estimation techniques: a tutorial with application to conic fitting. Image and Vision Computing, 15, 59–76. CrossRefGoogle Scholar
  90. Zimmermann, K., Matas, J., & Svoboda, T. (2009). Tracking by an optimal sequence of linear predictors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 677–692. doi:10.1109/TPAMI.2008.119. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Steffen Gauglitz
    • 1
  • Tobias Höllerer
    • 1
  • Matthew Turk
    • 1
  1. 1.Dept. of Computer ScienceUniversity of California, Santa BarbaraSanta BarbaraUSA

Personalised recommendations