Modeling the World from Internet Photo Collections

Abstract

There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain.” This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world’s well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.

This is a preview of subscription content, log in to check access.

References

  1. Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H., Nistér, D., & Pollefeys, M. (2006). Towards urban 3D reconstruction from video. In Proceedings of the international symposium on 3D data processing, visualization, and transmission.

  2. Aliaga, D. G. et al. (2003). Sea of images. IEEE Computer Graphics and Applications, 23(6), 22–30.

    Article  Google Scholar 

  3. Aliaga, D., Yanovsky, D., Funkhouser, T., & Carlbom, I. (2003). Interactive image-based rendering using feature globalization. In Proceedings of the SIGGRAPH symposium on interactive 3D graphics (pp. 163–170).

  4. Aloimonos, Y. (Ed.). (1993). Active perception. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  5. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923.

    MATH  Article  MathSciNet  Google Scholar 

  6. Baumberg, A. (2000). Reliable feature matching across widely separated views. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 774–781), June 2000.

  7. Blake, A., & Yuille, A. (Eds.). (1993). Active vision. Cambridge: MIT Press.

    Google Scholar 

  8. Brown, M., & Lowe, D. G. (2005). Unsupervised 3D object recognition and reconstruction in unordered datasets. In Proceedings of the international conference on 3D digital imaging and modelling (pp. 56–63).

  9. Buehler, C., Bosse, M., McMillan, L., Gortler, S., & Cohen, M. (2001). Unstructured lumigraph rendering. In SIGGRAPH conference proceedings (pp. 425–432).

  10. Chen, S., & Williams, L. (1993). View interpolation for image synthesis. In SIGGRAPH conference proceedings (pp. 279–288).

  11. Chew, L. P. (1987). Constrained Delaunay triangulations. In Proceedings of the symposium on computational geometry (pp. 215–222).

  12. Cooper, M., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In Proceedings of the ACM international conference on multimedia (pp. 364–373).

  13. Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH conference proceedings (pp. 11–20).

  14. Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 60(2), 111–134.

    Article  Google Scholar 

  15. Feiner, S., MacIntyre, B., Hollerer, T., & Webster, A. (1997). A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment. In Proceedings of the IEEE international symposium on wearable computers (pp. 74–81).

  16. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision (Vol. 2, pp. 816–823), October 2005.

  17. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    Article  MathSciNet  Google Scholar 

  18. Fitzgibbon, A. W., & Zisserman, A. Automatic camera recovery for closed and open image sequences. In Proceedings of the European conference on computer vision (pp. 311–326), June 1998.

  19. Förstner, W. (1986). A feature-based correspondence algorithm for image matching. International Archives Photogrammetry & Remote Sensing, 26(3), 150–166.

    Google Scholar 

  20. Goesele, M., Snavely, N., Seitz, S. M., Curless, B., & Hoppe, H. (2007, to appear). Multi-view stereo for community photo collections. In Proceedings of the international conference on computer vision.

  21. Gortler, S. J., Grzeszczuk, R., Szeliski, R., & Cohen, M. F. (1996). The lumigraph. In SIGGRAPH conference proceedings (pp. 43–54), August 1996.

  22. Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465).

  23. Grzeszczuk, R. (2002). Course 44: image-based modeling. In SIGGRAPH 2002

  24. Hannah, M. J. (1988). Test results from SRI’s stereo system. In Image understanding workshop (pp. 740–744), Cambridge, MA, April 1988. Los Altos: Morgan Kaufmann.

    Google Scholar 

  25. Harris, C., & Stephens, M. J. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–152).

  26. Hartley, R. I. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.

    Article  Google Scholar 

  27. Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry. Cambridge: Cambridge University Press.

    Google Scholar 

  28. Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. In SIGGRAPH conference proceedings.

  29. Irani, M., & Anandan, P. (1998). Video indexing based on mosaic representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 86(5), 905–921.

    Google Scholar 

  30. Johansson, B., & Cipolla, R. (2002). A system for automatic pose-estimation from a single image in a city scene. In Proceedings of the IASTED international conference on signal processing, pattern recognition and applications.

  31. Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105.

    MATH  Article  Google Scholar 

  32. Kadobayashi, R., & Tanaka, K. (2005). 3D viewpoint-based photo search and information browsing. In Proceedings of the ACM international conference on research and development in information retrieval (pp. 621–622).

  33. Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. In SIGGRAPH conference proceedings.

  34. Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference proceedings (pp. 31–42).

  35. Lippman, A. (1980). Movie maps: an application of the optical videodisc to computer graphics. In SIGGRAPH conference proceedings (pp. 32–43).

  36. Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.

    Article  Google Scholar 

  37. Lourakis, M., & Argyros, A. (2004). The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg–Marquardt algorithm (Technical Report 340). Inst. of Computer Science-FORTH, Heraklion, Crete, Greece. Available from www.ics.forth.gr/~lourakis/sba.

  38. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  39. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application in stereo vision. In International joint conference on artificial Intelligence (pp. 674–679).

  40. Matas, J. et al. (2004). Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.

    Article  Google Scholar 

  41. McCurdy, N., & Griswold, W. (2005). A systems architecture for ubiquitous video. In Proceedings of the international conference on mobile systems, applications, and services (pp. 1–14).

  42. McMillan, L., & Bishop, G. (1995) Plenoptic modeling: An image-based rendering system. In SIGGRAPH conference proceedings (pp. 39–46).

  43. Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.

    Article  Google Scholar 

  44. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72.

    Article  Google Scholar 

  45. Moravec, H. (1983). The Stanford cart and the CMU rover. Proceedings of the IEEE, 71(7), 872–884.

    Article  Google Scholar 

  46. Naaman, M., Paepcke, A., & Garcia-Molina, H. (2003). From where to what: Metadata sharing for digital photographs with geographic coordinates. In Proceedings of the international conference on cooperative information systems (pp. 196–217).

  47. Naaman, M., Song, Y. J., Paepcke, A., & Garcia-Molina, H. (2004). Automatic organization for digital photographs with geographic coordinates. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries (pp. 53–62).

  48. Nistér, D. (2000). Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proceedings of the European conference on computer vision (pp. 649–663).

  49. Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777.

    Article  Google Scholar 

  50. Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2118–2125).

  51. Nocedal, J., & Wright, S. J. (1999). Springer series in operations research. Numerical optimization. New York: Springer.

    Google Scholar 

  52. Oliensis, J. (1999). A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision, 34(2–3), 163–192.

    Article  Google Scholar 

  53. Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25.

    Article  Google Scholar 

  54. Pollefeys, M., & Van Gool, L. (2002). From images to 3D models. Communications of the ACM, 45(7), 50–55.

    Article  Google Scholar 

  55. Pollefeys, M., van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., & Koch, R. (2004). Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3), 207–232.

    Article  Google Scholar 

  56. Robertson, D. P., & Cipolla, R. (2002). Building architectural models from many views using map constraints. In Proceedings of the European conference on computer vision (Vol. II, pp. 155–169).

  57. Rodden, K., & Wood, K. R. (2003). How do people manage their digital photographs? In Proceedings of the conference on human factors in computing systems (pp. 409–416).

  58. Román, A., et al. (2004). Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE visualization 2004 (pp. 537–544).

  59. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). Labelme: a database and web-based tool for image annotation (Technical Report MIT-CSAIL-TR-2005-056). Massachusetts Institute of Technology.

  60. Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the European conference on computer vision (Vol. 1, pp. 414–431).

  61. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.

    MATH  Article  Google Scholar 

  62. Schindler, G., Dellaert, F., & Kang, S. B. (2007). Inferring temporal order of images from 3D structure. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  63. Schmid, C., & Zisserman, A. (1997). Automatic line matching across views. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 666–671).

  64. Seitz, S. M., & Dyer, C. M. (1996). View morphing. In SIGGRAPH conference proceedings (pp. 21–30).

  65. Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 519–526), June 2006.

  66. Shi, J., & Tomasi, C. Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600), June 1994.

  67. Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (pp. 1470–1477), October 2003.

  68. Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3), 835–846.

    Article  Google Scholar 

  69. Spetsakis, M. E., & Aloimonos, J. Y. (1991). A multiframe approach to visual motion perception. International Journal of Computer Vision, 6(3), 245–255.

    Article  Google Scholar 

  70. Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of multiple wide-baseline views. In Proceedings of the international conference on computer vision (pp. 1194–1201), October 2003.

  71. Szeliski, R. (2006). Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2(1).

  72. Szeliski, R., & Kang, S. B. (1994). Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1), 10–28.

    Article  Google Scholar 

  73. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In Proceedings of the European conference on computer vision (Vol. 2, pp. 16–29), May 2006.

  74. Tanaka, H., Arikawa, M., & Shibasaki, R. (2002). A 3-d photo collage system for spatial navigations. In Revised papers from the second Kyoto workshop on digital cities II, computational and sociological approaches (pp. 305–316).

  75. Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., & Master, N. (2003). Calibrated, registered images of an extended urban area. International Journal of Computer Vision, 53(1), 93–107.

    Article  Google Scholar 

  76. Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154.

    Article  Google Scholar 

  77. Toyama, K., Logan, R., & Roseway, A. (2003). Geographic location tags on digital images. In Proceedings of the international conference on multimedia (pp. 156–166).

  78. Triggs, B., et al. (1999). Bundle adjustment—a modern synthesis. In International workshop on vision algorithms (pp. 298–372), September 1999.

  79. Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1), 61–85.

    Article  Google Scholar 

  80. Vergauwen, M., & Van Gool, L. (2006). Web-based 3D reconstruction service. Machine Vision and Applications, 17(2), 321–329.

    Google Scholar 

  81. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the conference on human factors in computing systems (pp. 319–326).

  82. Zitnick, L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. In SIGGRAPH conference proceedings (pp. 600–608).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Noah Snavely.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Snavely, N., Seitz, S.M. & Szeliski, R. Modeling the World from Internet Photo Collections. Int J Comput Vis 80, 189–210 (2008). https://doi.org/10.1007/s11263-007-0107-3

Download citation

Keywords

  • Structure from motion
  • 3D scene analysis
  • Internet imagery
  • Photo browsers
  • 3D navigation