International Journal of Computer Vision

, Volume 80, Issue 2, pp 189–210

Modeling the World from Internet Photo Collections

Article

Abstract

There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain.” This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world’s well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.

Keywords

Structure from motion 3D scene analysis Internet imagery Photo browsers 3D navigation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H., Nistér, D., & Pollefeys, M. (2006). Towards urban 3D reconstruction from video. In Proceedings of the international symposium on 3D data processing, visualization, and transmission. Google Scholar
  2. Aliaga, D. G. et al. (2003). Sea of images. IEEE Computer Graphics and Applications, 23(6), 22–30. CrossRefGoogle Scholar
  3. Aliaga, D., Yanovsky, D., Funkhouser, T., & Carlbom, I. (2003). Interactive image-based rendering using feature globalization. In Proceedings of the SIGGRAPH symposium on interactive 3D graphics (pp. 163–170). Google Scholar
  4. Aloimonos, Y. (Ed.). (1993). Active perception. Mahwah: Lawrence Erlbaum Associates. Google Scholar
  5. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923. MATHCrossRefMathSciNetGoogle Scholar
  6. Baumberg, A. (2000). Reliable feature matching across widely separated views. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 774–781), June 2000. Google Scholar
  7. Blake, A., & Yuille, A. (Eds.). (1993). Active vision. Cambridge: MIT Press. Google Scholar
  8. Brown, M., & Lowe, D. G. (2005). Unsupervised 3D object recognition and reconstruction in unordered datasets. In Proceedings of the international conference on 3D digital imaging and modelling (pp. 56–63). Google Scholar
  9. Buehler, C., Bosse, M., McMillan, L., Gortler, S., & Cohen, M. (2001). Unstructured lumigraph rendering. In SIGGRAPH conference proceedings (pp. 425–432). Google Scholar
  10. Chen, S., & Williams, L. (1993). View interpolation for image synthesis. In SIGGRAPH conference proceedings (pp. 279–288). Google Scholar
  11. Chew, L. P. (1987). Constrained Delaunay triangulations. In Proceedings of the symposium on computational geometry (pp. 215–222). Google Scholar
  12. Cooper, M., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In Proceedings of the ACM international conference on multimedia (pp. 364–373). Google Scholar
  13. Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH conference proceedings (pp. 11–20). Google Scholar
  14. Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 60(2), 111–134. CrossRefGoogle Scholar
  15. Feiner, S., MacIntyre, B., Hollerer, T., & Webster, A. (1997). A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment. In Proceedings of the IEEE international symposium on wearable computers (pp. 74–81). Google Scholar
  16. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision (Vol. 2, pp. 816–823), October 2005. Google Scholar
  17. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395. CrossRefMathSciNetGoogle Scholar
  18. Fitzgibbon, A. W., & Zisserman, A. Automatic camera recovery for closed and open image sequences. In Proceedings of the European conference on computer vision (pp. 311–326), June 1998. Google Scholar
  19. Förstner, W. (1986). A feature-based correspondence algorithm for image matching. International Archives Photogrammetry & Remote Sensing, 26(3), 150–166. Google Scholar
  20. Goesele, M., Snavely, N., Seitz, S. M., Curless, B., & Hoppe, H. (2007, to appear). Multi-view stereo for community photo collections. In Proceedings of the international conference on computer vision. Google Scholar
  21. Gortler, S. J., Grzeszczuk, R., Szeliski, R., & Cohen, M. F. (1996). The lumigraph. In SIGGRAPH conference proceedings (pp. 43–54), August 1996. Google Scholar
  22. Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465). Google Scholar
  23. Grzeszczuk, R. (2002). Course 44: image-based modeling. In SIGGRAPH 2002 Google Scholar
  24. Hannah, M. J. (1988). Test results from SRI’s stereo system. In Image understanding workshop (pp. 740–744), Cambridge, MA, April 1988. Los Altos: Morgan Kaufmann. Google Scholar
  25. Harris, C., & Stephens, M. J. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–152). Google Scholar
  26. Hartley, R. I. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593. CrossRefGoogle Scholar
  27. Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry. Cambridge: Cambridge University Press. MATHGoogle Scholar
  28. Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. In SIGGRAPH conference proceedings. Google Scholar
  29. Irani, M., & Anandan, P. (1998). Video indexing based on mosaic representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 86(5), 905–921. Google Scholar
  30. Johansson, B., & Cipolla, R. (2002). A system for automatic pose-estimation from a single image in a city scene. In Proceedings of the IASTED international conference on signal processing, pattern recognition and applications. Google Scholar
  31. Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105. MATHCrossRefGoogle Scholar
  32. Kadobayashi, R., & Tanaka, K. (2005). 3D viewpoint-based photo search and information browsing. In Proceedings of the ACM international conference on research and development in information retrieval (pp. 621–622). Google Scholar
  33. Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. In SIGGRAPH conference proceedings. Google Scholar
  34. Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference proceedings (pp. 31–42). Google Scholar
  35. Lippman, A. (1980). Movie maps: an application of the optical videodisc to computer graphics. In SIGGRAPH conference proceedings (pp. 32–43). Google Scholar
  36. Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135. CrossRefGoogle Scholar
  37. Lourakis, M., & Argyros, A. (2004). The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg–Marquardt algorithm (Technical Report 340). Inst. of Computer Science-FORTH, Heraklion, Crete, Greece. Available from www.ics.forth.gr/~lourakis/sba.
  38. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  39. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application in stereo vision. In International joint conference on artificial Intelligence (pp. 674–679). Google Scholar
  40. Matas, J. et al. (2004). Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767. CrossRefGoogle Scholar
  41. McCurdy, N., & Griswold, W. (2005). A systems architecture for ubiquitous video. In Proceedings of the international conference on mobile systems, applications, and services (pp. 1–14). Google Scholar
  42. McMillan, L., & Bishop, G. (1995) Plenoptic modeling: An image-based rendering system. In SIGGRAPH conference proceedings (pp. 39–46). Google Scholar
  43. Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86. CrossRefGoogle Scholar
  44. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72. CrossRefGoogle Scholar
  45. Moravec, H. (1983). The Stanford cart and the CMU rover. Proceedings of the IEEE, 71(7), 872–884. CrossRefGoogle Scholar
  46. Naaman, M., Paepcke, A., & Garcia-Molina, H. (2003). From where to what: Metadata sharing for digital photographs with geographic coordinates. In Proceedings of the international conference on cooperative information systems (pp. 196–217). Google Scholar
  47. Naaman, M., Song, Y. J., Paepcke, A., & Garcia-Molina, H. (2004). Automatic organization for digital photographs with geographic coordinates. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries (pp. 53–62). Google Scholar
  48. Nistér, D. (2000). Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proceedings of the European conference on computer vision (pp. 649–663). Google Scholar
  49. Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777. CrossRefGoogle Scholar
  50. Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2118–2125). Google Scholar
  51. Nocedal, J., & Wright, S. J. (1999). Springer series in operations research. Numerical optimization. New York: Springer. Google Scholar
  52. Oliensis, J. (1999). A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision, 34(2–3), 163–192. CrossRefGoogle Scholar
  53. Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25. CrossRefGoogle Scholar
  54. Pollefeys, M., & Van Gool, L. (2002). From images to 3D models. Communications of the ACM, 45(7), 50–55. CrossRefGoogle Scholar
  55. Pollefeys, M., van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., & Koch, R. (2004). Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3), 207–232. CrossRefGoogle Scholar
  56. Robertson, D. P., & Cipolla, R. (2002). Building architectural models from many views using map constraints. In Proceedings of the European conference on computer vision (Vol. II, pp. 155–169). Google Scholar
  57. Rodden, K., & Wood, K. R. (2003). How do people manage their digital photographs? In Proceedings of the conference on human factors in computing systems (pp. 409–416). Google Scholar
  58. Román, A., et al. (2004). Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE visualization 2004 (pp. 537–544). Google Scholar
  59. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). Labelme: a database and web-based tool for image annotation (Technical Report MIT-CSAIL-TR-2005-056). Massachusetts Institute of Technology. Google Scholar
  60. Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the European conference on computer vision (Vol. 1, pp. 414–431). Google Scholar
  61. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42. MATHCrossRefGoogle Scholar
  62. Schindler, G., Dellaert, F., & Kang, S. B. (2007). Inferring temporal order of images from 3D structure. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  63. Schmid, C., & Zisserman, A. (1997). Automatic line matching across views. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 666–671). Google Scholar
  64. Seitz, S. M., & Dyer, C. M. (1996). View morphing. In SIGGRAPH conference proceedings (pp. 21–30). Google Scholar
  65. Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 519–526), June 2006. Google Scholar
  66. Shi, J., & Tomasi, C. Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600), June 1994. Google Scholar
  67. Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (pp. 1470–1477), October 2003. Google Scholar
  68. Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3), 835–846. CrossRefGoogle Scholar
  69. Spetsakis, M. E., & Aloimonos, J. Y. (1991). A multiframe approach to visual motion perception. International Journal of Computer Vision, 6(3), 245–255. CrossRefGoogle Scholar
  70. Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of multiple wide-baseline views. In Proceedings of the international conference on computer vision (pp. 1194–1201), October 2003. Google Scholar
  71. Szeliski, R. (2006). Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2(1). Google Scholar
  72. Szeliski, R., & Kang, S. B. (1994). Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1), 10–28. CrossRefGoogle Scholar
  73. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In Proceedings of the European conference on computer vision (Vol. 2, pp. 16–29), May 2006. Google Scholar
  74. Tanaka, H., Arikawa, M., & Shibasaki, R. (2002). A 3-d photo collage system for spatial navigations. In Revised papers from the second Kyoto workshop on digital cities II, computational and sociological approaches (pp. 305–316). Google Scholar
  75. Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., & Master, N. (2003). Calibrated, registered images of an extended urban area. International Journal of Computer Vision, 53(1), 93–107. CrossRefGoogle Scholar
  76. Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154. CrossRefGoogle Scholar
  77. Toyama, K., Logan, R., & Roseway, A. (2003). Geographic location tags on digital images. In Proceedings of the international conference on multimedia (pp. 156–166). Google Scholar
  78. Triggs, B., et al. (1999). Bundle adjustment—a modern synthesis. In International workshop on vision algorithms (pp. 298–372), September 1999. Google Scholar
  79. Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1), 61–85. CrossRefGoogle Scholar
  80. Vergauwen, M., & Van Gool, L. (2006). Web-based 3D reconstruction service. Machine Vision and Applications, 17(2), 321–329. Google Scholar
  81. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the conference on human factors in computing systems (pp. 319–326). Google Scholar
  82. Zitnick, L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. In SIGGRAPH conference proceedings (pp. 600–608). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Noah Snavely
    • 1
  • Steven M. Seitz
    • 1
  • Richard Szeliski
    • 2
  1. 1.University of WashingtonSeattleUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations