Advertisement

Generic 3D Representation via Pose Estimation and Matching

  • Amir R. ZamirEmail author
  • Tilman Wekel
  • Pulkit Agrawal
  • Colin Wei
  • Jitendra Malik
  • Silvio Savarese
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9907)

Abstract

Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross modality pose estimation).

In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learnt features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.

Keywords

Generic vision Representation Descriptor learning Pose estimation Wide-baseline matching Street view 

Notes

Acknowledgement

MURI (1186514-1-TBCJE), Nissan (1188371-1-UDARQ).

Supplementary material

419975_1_En_33_MOESM1_ESM.pdf (36.8 mb)
Supplementary material 1 (pdf 37647 KB)

References

  1. 1.
  2. 2.
  3. 3.
    Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building Rome in a day. Commun. ACM 54(10), 105–112 (2011)CrossRefGoogle Scholar
  4. 4.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving (2015)Google Scholar
  5. 5.
    Alahi, A., Ortiz, R., Vandergheynst, P.: FREAK: fast retina keypoint. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 510–517. IEEE (2012)Google Scholar
  6. 6.
    Angladon, V., Gasparini, S., Charvillat, V.: The toulouse vanishing points dataset. In: Proceedings of the 6th ACM Multimedia Systems Conference (MMSys 2015) (2015)Google Scholar
  7. 7.
    Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2918. IEEE (2012)Google Scholar
  8. 8.
    Badino, H., Yamamoto, A., Kanade, T.: Visual odometry by multi-frame feature integration. In: 2013 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 222–229. IEEE (2013)Google Scholar
  9. 9.
    Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: PN-Net: conjoined triple deep network for learning local image descriptors. arXiv preprint arXiv:1601.05030 (2016)
  10. 10.
    Banks, M.S., Salapatek, P.: Infant visual perception. In: Mussen, P.H. (eds.) Handbook of Child Psychology: Formerly Carmichael’s Manual of Child Psychology (1983)Google Scholar
  11. 11.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 43–57 (2011)CrossRefGoogle Scholar
  13. 13.
    Caprile, B., Torre, V.: Using vanishing points for camera calibration. Int. J. Comput. Vis. 4(2), 127–139 (1990)CrossRefGoogle Scholar
  14. 14.
    Chen, D.M., Baatz, G., Köser, K., Tsai, S.S., Vedantham, R., Pylvä, T., Roimela, K., Chen, X., Bach, J., Pollefeys, M., et al.: City-scale landmark identification on mobile devices. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 737–744. IEEE (2011)Google Scholar
  15. 15.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)Google Scholar
  16. 16.
    Denis, P., Elder, J.H., Estrada, F.J.: Efficient edge-based methods for estimating manhattan frames in urban imagery. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 197–210. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)Google Scholar
  18. 18.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
  19. 19.
    Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: a comparison to SIFT (2014). arXiv preprint arXiv:1405.5769
  20. 20.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE (2012)Google Scholar
  21. 21.
    Geiger, A., Ziegler, J., Stiller, C.: StereoScan: dense 3d reconstruction in real-time. In: Intelligent Vehicles Symposium (IV) (2011)Google Scholar
  22. 22.
    Gibson, E.J., Walk, R.D.: The Visual Cliff, vol. 1. WH Freeman Company, New York (1960)Google Scholar
  23. 23.
    Girshick, R.: Fast R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  24. 24.
    Goedemé, T., Tuytelaars, T., Van Gool, L.: Fast wide baseline matching for visual navigation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. I–24 (2004)Google Scholar
  25. 25.
    Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286 (2015)Google Scholar
  26. 26.
    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  27. 27.
    Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1849–1856. IEEE (2009)Google Scholar
  28. 28.
    Held, R., Hein, A.: Movement-produced stimulation in the development of visually guided behavior. J. Comp. Physiol. Psychol. 56(5), 872 (1963)CrossRefGoogle Scholar
  29. 29.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1413–1421 (2015)Google Scholar
  30. 30.
    Köser, K., Koch, R.: Differential spatial resection - pose estimation using a single local image feature. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 312–325. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  31. 31.
    Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)
  32. 32.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  33. 33.
    Kümmerle, R., Steder, B., Dornhege, C., Ruhnke, M., Grisetti, G., Stachniss, C., Kleiner, A.: On measuring the accuracy of SLAM algorithms. Auton. Robot. 27(4), 387–407 (2009)CrossRefGoogle Scholar
  34. 34.
    Li, B., Peng, K., Ying, X., Zha, H.: Simultaneous vanishing point detection and camera calibration from single images. In: Boyle, R., et al. (eds.) ISVC 2010, Part II. LNCS, vol. 6454, pp. 151–160. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  35. 35.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  36. 36.
    Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)zbMATHGoogle Scholar
  37. 37.
    Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5188–5196. IEEE (2015)Google Scholar
  38. 38.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)CrossRefGoogle Scholar
  39. 39.
    Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)CrossRefGoogle Scholar
  40. 40.
    Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3D objects. Int. J. Comput. Vis. 73(3), 263–284 (2007)CrossRefGoogle Scholar
  41. 41.
    Morel, J.M., Yu, G.: ASIFT: a new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2(2), 438–469 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Nistér, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 1, pp. I–652. IEEE (2004)Google Scholar
  43. 43.
    Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: Conference on Computer Vision and Pattern Recognition, Miami, FL, June 2009Google Scholar
  44. 44.
    Pritchett, P., Zisserman, A.: Wide baseline stereo matching. In: Sixth International Conference on Computer Vision, 1998, pp. 754–760. IEEE (1998)Google Scholar
  45. 45.
    Rader, N., Bausano, M., Richards, J.E.: On the nature of the visual-cliff-avoidance response in human infants. Child Dev. 61–68 (1980)Google Scholar
  46. 46.
    Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)Google Scholar
  47. 47.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
  48. 48.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012)Google Scholar
  49. 49.
    Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015)Google Scholar
  50. 50.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. IEEE Trans. Pattern Anal. Mach. Intell. 36(8) (2014)Google Scholar
  51. 51.
    Smith, L., Gasser, M.: The development of embodied cognition: six lessons from babies. Artif. Life 11(1–2), 13–29 (2005)CrossRefGoogle Scholar
  52. 52.
    Song, S., Chandraker, M., Guest, C.C.: Parallel, real-time monocular visual odometry. In: 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE (2013)Google Scholar
  53. 53.
    Tarr, M.J., Black, M.J.: A computational and evolutionary perspective on the role of representation in vision. CVGIP: Image Underst. 60(1), 65–73 (1994)CrossRefGoogle Scholar
  54. 54.
    Tell, D., Carlsson, S.: Combining appearance and topology for wide baseline matching. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350, pp. 68–81. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  55. 55.
    Tola, E., Lepetit, V., Fua, P.: A fast local descriptor for dense matching. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  56. 56.
    Trzcinski, T., Christoudias, M., Lepetit, V., Fua, P.: Learning image descriptors with the boosting-trick. In: Advances in Neural Information Processing Systems, pp. 269–277 (2012)Google Scholar
  57. 57.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2015)Google Scholar
  58. 58.
    Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised embedding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, 2nd edn, pp. 639–655. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  59. 59.
    Wu, C.: VisualSFM: a visual structure from motion system (2011). http://ccwu.me/vsfm/
  60. 60.
    Wu, C., Agarwal, S., Curless, B., Seitz, S.M.: Multicore bundle adjustment. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3057–3064. IEEE (2011)Google Scholar
  61. 61.
    Wu, C., Clipp, B., Li, X., Frahm, J.M., Pollefeys, M.: 3D model matching with viewpoint-invariant patches (VIP). In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  62. 62.
    Xiao, J., Shah, M.: Two-frame wide baseline matching. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, 2003, pp. 603–609. IEEE (2003)Google Scholar
  63. 63.
    Xu, C., Lu, C., Liang, X., Gao, J., Zheng, W., Wang, T., Yan, S.: Multi-loss regularized deep neural network. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1–1 (2015)CrossRefGoogle Scholar
  64. 64.
    Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  65. 65.
    Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks (2015). arXiv preprint arXiv:1504.03641v1
  66. 66.
    Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1592–1599 (2015)Google Scholar
  67. 67.
    Zhang, Z., Ganesh, A., Liang, X., Ma, Y.: TILT: transform invariant low-rank textures. Int. J. Comput. Vis. 99(1), 1–24 (2012)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Amir R. Zamir
    • 1
    Email author
  • Tilman Wekel
    • 1
  • Pulkit Agrawal
    • 2
  • Colin Wei
    • 1
  • Jitendra Malik
    • 2
  • Silvio Savarese
    • 1
  1. 1.Stanford UniversityStanfordUSA
  2. 2.University of CaliforniaBerkeleyUSA

Personalised recommendations