Skip to main content
Log in

Abstract

There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain.” This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world’s well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  • Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H., Nistér, D., & Pollefeys, M. (2006). Towards urban 3D reconstruction from video. In Proceedings of the international symposium on 3D data processing, visualization, and transmission.

  • Aliaga, D. G. et al. (2003). Sea of images. IEEE Computer Graphics and Applications, 23(6), 22–30.

    Article  Google Scholar 

  • Aliaga, D., Yanovsky, D., Funkhouser, T., & Carlbom, I. (2003). Interactive image-based rendering using feature globalization. In Proceedings of the SIGGRAPH symposium on interactive 3D graphics (pp. 163–170).

  • Aloimonos, Y. (Ed.). (1993). Active perception. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923.

    Article  MATH  MathSciNet  Google Scholar 

  • Baumberg, A. (2000). Reliable feature matching across widely separated views. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 774–781), June 2000.

  • Blake, A., & Yuille, A. (Eds.). (1993). Active vision. Cambridge: MIT Press.

    Google Scholar 

  • Brown, M., & Lowe, D. G. (2005). Unsupervised 3D object recognition and reconstruction in unordered datasets. In Proceedings of the international conference on 3D digital imaging and modelling (pp. 56–63).

  • Buehler, C., Bosse, M., McMillan, L., Gortler, S., & Cohen, M. (2001). Unstructured lumigraph rendering. In SIGGRAPH conference proceedings (pp. 425–432).

  • Chen, S., & Williams, L. (1993). View interpolation for image synthesis. In SIGGRAPH conference proceedings (pp. 279–288).

  • Chew, L. P. (1987). Constrained Delaunay triangulations. In Proceedings of the symposium on computational geometry (pp. 215–222).

  • Cooper, M., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In Proceedings of the ACM international conference on multimedia (pp. 364–373).

  • Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH conference proceedings (pp. 11–20).

  • Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 60(2), 111–134.

    Article  Google Scholar 

  • Feiner, S., MacIntyre, B., Hollerer, T., & Webster, A. (1997). A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment. In Proceedings of the IEEE international symposium on wearable computers (pp. 74–81).

  • Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision (Vol. 2, pp. 816–823), October 2005.

  • Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    Article  MathSciNet  Google Scholar 

  • Fitzgibbon, A. W., & Zisserman, A. Automatic camera recovery for closed and open image sequences. In Proceedings of the European conference on computer vision (pp. 311–326), June 1998.

  • Förstner, W. (1986). A feature-based correspondence algorithm for image matching. International Archives Photogrammetry & Remote Sensing, 26(3), 150–166.

    Google Scholar 

  • Goesele, M., Snavely, N., Seitz, S. M., Curless, B., & Hoppe, H. (2007, to appear). Multi-view stereo for community photo collections. In Proceedings of the international conference on computer vision.

  • Gortler, S. J., Grzeszczuk, R., Szeliski, R., & Cohen, M. F. (1996). The lumigraph. In SIGGRAPH conference proceedings (pp. 43–54), August 1996.

  • Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465).

  • Grzeszczuk, R. (2002). Course 44: image-based modeling. In SIGGRAPH 2002

  • Hannah, M. J. (1988). Test results from SRI’s stereo system. In Image understanding workshop (pp. 740–744), Cambridge, MA, April 1988. Los Altos: Morgan Kaufmann.

    Google Scholar 

  • Harris, C., & Stephens, M. J. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–152).

  • Hartley, R. I. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.

    Article  Google Scholar 

  • Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. In SIGGRAPH conference proceedings.

  • Irani, M., & Anandan, P. (1998). Video indexing based on mosaic representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 86(5), 905–921.

    Google Scholar 

  • Johansson, B., & Cipolla, R. (2002). A system for automatic pose-estimation from a single image in a city scene. In Proceedings of the IASTED international conference on signal processing, pattern recognition and applications.

  • Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105.

    Article  MATH  Google Scholar 

  • Kadobayashi, R., & Tanaka, K. (2005). 3D viewpoint-based photo search and information browsing. In Proceedings of the ACM international conference on research and development in information retrieval (pp. 621–622).

  • Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. In SIGGRAPH conference proceedings.

  • Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference proceedings (pp. 31–42).

  • Lippman, A. (1980). Movie maps: an application of the optical videodisc to computer graphics. In SIGGRAPH conference proceedings (pp. 32–43).

  • Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.

    Article  Google Scholar 

  • Lourakis, M., & Argyros, A. (2004). The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg–Marquardt algorithm (Technical Report 340). Inst. of Computer Science-FORTH, Heraklion, Crete, Greece. Available from www.ics.forth.gr/~lourakis/sba.

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application in stereo vision. In International joint conference on artificial Intelligence (pp. 674–679).

  • Matas, J. et al. (2004). Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.

    Article  Google Scholar 

  • McCurdy, N., & Griswold, W. (2005). A systems architecture for ubiquitous video. In Proceedings of the international conference on mobile systems, applications, and services (pp. 1–14).

  • McMillan, L., & Bishop, G. (1995) Plenoptic modeling: An image-based rendering system. In SIGGRAPH conference proceedings (pp. 39–46).

  • Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.

    Article  Google Scholar 

  • Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72.

    Article  Google Scholar 

  • Moravec, H. (1983). The Stanford cart and the CMU rover. Proceedings of the IEEE, 71(7), 872–884.

    Article  Google Scholar 

  • Naaman, M., Paepcke, A., & Garcia-Molina, H. (2003). From where to what: Metadata sharing for digital photographs with geographic coordinates. In Proceedings of the international conference on cooperative information systems (pp. 196–217).

  • Naaman, M., Song, Y. J., Paepcke, A., & Garcia-Molina, H. (2004). Automatic organization for digital photographs with geographic coordinates. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries (pp. 53–62).

  • Nistér, D. (2000). Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proceedings of the European conference on computer vision (pp. 649–663).

  • Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777.

    Article  Google Scholar 

  • Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2118–2125).

  • Nocedal, J., & Wright, S. J. (1999). Springer series in operations research. Numerical optimization. New York: Springer.

    Google Scholar 

  • Oliensis, J. (1999). A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision, 34(2–3), 163–192.

    Article  Google Scholar 

  • Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25.

    Article  Google Scholar 

  • Pollefeys, M., & Van Gool, L. (2002). From images to 3D models. Communications of the ACM, 45(7), 50–55.

    Article  Google Scholar 

  • Pollefeys, M., van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., & Koch, R. (2004). Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3), 207–232.

    Article  Google Scholar 

  • Robertson, D. P., & Cipolla, R. (2002). Building architectural models from many views using map constraints. In Proceedings of the European conference on computer vision (Vol. II, pp. 155–169).

  • Rodden, K., & Wood, K. R. (2003). How do people manage their digital photographs? In Proceedings of the conference on human factors in computing systems (pp. 409–416).

  • Román, A., et al. (2004). Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE visualization 2004 (pp. 537–544).

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). Labelme: a database and web-based tool for image annotation (Technical Report MIT-CSAIL-TR-2005-056). Massachusetts Institute of Technology.

  • Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the European conference on computer vision (Vol. 1, pp. 414–431).

  • Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.

    Article  MATH  Google Scholar 

  • Schindler, G., Dellaert, F., & Kang, S. B. (2007). Inferring temporal order of images from 3D structure. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Schmid, C., & Zisserman, A. (1997). Automatic line matching across views. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 666–671).

  • Seitz, S. M., & Dyer, C. M. (1996). View morphing. In SIGGRAPH conference proceedings (pp. 21–30).

  • Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 519–526), June 2006.

  • Shi, J., & Tomasi, C. Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600), June 1994.

  • Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (pp. 1470–1477), October 2003.

  • Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3), 835–846.

    Article  Google Scholar 

  • Spetsakis, M. E., & Aloimonos, J. Y. (1991). A multiframe approach to visual motion perception. International Journal of Computer Vision, 6(3), 245–255.

    Article  Google Scholar 

  • Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of multiple wide-baseline views. In Proceedings of the international conference on computer vision (pp. 1194–1201), October 2003.

  • Szeliski, R. (2006). Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2(1).

  • Szeliski, R., & Kang, S. B. (1994). Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1), 10–28.

    Article  Google Scholar 

  • Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In Proceedings of the European conference on computer vision (Vol. 2, pp. 16–29), May 2006.

  • Tanaka, H., Arikawa, M., & Shibasaki, R. (2002). A 3-d photo collage system for spatial navigations. In Revised papers from the second Kyoto workshop on digital cities II, computational and sociological approaches (pp. 305–316).

  • Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., & Master, N. (2003). Calibrated, registered images of an extended urban area. International Journal of Computer Vision, 53(1), 93–107.

    Article  Google Scholar 

  • Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154.

    Article  Google Scholar 

  • Toyama, K., Logan, R., & Roseway, A. (2003). Geographic location tags on digital images. In Proceedings of the international conference on multimedia (pp. 156–166).

  • Triggs, B., et al. (1999). Bundle adjustment—a modern synthesis. In International workshop on vision algorithms (pp. 298–372), September 1999.

  • Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1), 61–85.

    Article  Google Scholar 

  • Vergauwen, M., & Van Gool, L. (2006). Web-based 3D reconstruction service. Machine Vision and Applications, 17(2), 321–329.

    Google Scholar 

  • von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the conference on human factors in computing systems (pp. 319–326).

  • Zitnick, L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. In SIGGRAPH conference proceedings (pp. 600–608).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noah Snavely.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Snavely, N., Seitz, S.M. & Szeliski, R. Modeling the World from Internet Photo Collections. Int J Comput Vis 80, 189–210 (2008). https://doi.org/10.1007/s11263-007-0107-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-007-0107-3

Keywords

Navigation