Bringing 3D Models Together: Mining Video Liaisons in Crowdsourced Reconstructions

  • Ke WangEmail author
  • Enrique Dunn
  • Mikel Rodriguez
  • Jan-Michael Frahm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10114)


The recent advances in large-scale scene modeling have enabled the automatic 3D reconstruction of landmark sites from crowdsourced photo collections. Here, we address the challenge of leveraging crowdsourced video collections to identify connecting visual observations that enable the alignment and subsequent aggregation, of disjoint 3D models. We denote these connecting image sequences as video liaisons and develop a data-driven framework for fully unsupervised extraction and exploitation. Towards this end, we represent video contents in terms of a histogram representation of iconic imagery contained within existing 3D models attained from a photo collection. We then use this representation to efficiently identify and prioritize the analysis of individual videos within a large-scale video collection, in an effort to determine camera motion trajectories connecting different landmarks. Results on crowdsourced data illustrate the efficiency and effectiveness of our proposed approach.


Video Segment Video Summarization Image Cluster Photo Collection Video Dataset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



Supported in part by the NSF No. IIS-1349074, No. CNS-1405847. Partially funded by MITRE Corp.

Supplementary material

416263_1_En_25_MOESM1_ESM.pdf (598 kb)
Supplementary material 1 (pdf 597 KB)

Supplementary material 2 (mp4 9781 KB)


  1. 1.
    Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Commun. ACM 54, 105–112 (2011)CrossRefGoogle Scholar
  2. 2.
    Agarwal, S., Mierle, K., et al.: Ceres solver.
  3. 3.
    Ahmed, M.T., Dailey, M.N., Landabaso, J.L., Herrero, N.: Robust key frame extraction for 3d reconstruction from video streams. In: VISAPP (1), pp. 231–236 (2010)Google Scholar
  4. 4.
    Ajmal, M., Ashraf, M.H., Shakir, M., Abbas, Y., Shah, F.A.: Video summarization: techniques and classification. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2012. LNCS, vol. 7594, pp. 1–13. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33564-8_1 CrossRefGoogle Scholar
  5. 5.
    Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. TPAMI 24(5), 603–619 (2002)CrossRefGoogle Scholar
  6. 6.
    Frahm, J.-M., et al.: Building rome on a cloudless day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_27 CrossRefGoogle Scholar
  7. 7.
    Heinly, J., Schonberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the world* in six days *(as captured by the yahoo 100 million image dataset). In: CVPR (2015)Google Scholar
  8. 8.
    Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.: A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 41(6), 797–819 (2011)CrossRefGoogle Scholar
  9. 9.
    Klingner, B., Martin, D., Roseborough, J.: Street view motion-from-structure-from-motion. In: ICCV (2013)Google Scholar
  10. 10.
    Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.-M.: Modeling and recognition of landmark image collections using iconic scene graphs. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 427–440. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88682-2_33 CrossRefGoogle Scholar
  11. 11.
    Lou, Y., Snavely, N., Gehrke, J.: MatchMiner: efficient spanning structure mining in large image collections. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 45–58. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33709-3_4 CrossRefGoogle Scholar
  12. 12.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004)CrossRefGoogle Scholar
  13. 13.
    Meeker, M.: Internet trends (2016).
  14. 14.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)Google Scholar
  15. 15.
    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)Google Scholar
  16. 16.
    Shi, J., Tomasi, C.: Good features to track. In: CVPR (1994)Google Scholar
  17. 17.
    Smith, C.: By the numbers: 135 amazing youtube statistics.
  18. 18.
    Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. In: ACM TOG (2006)Google Scholar
  19. 19.
    Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. IJCV 80, 189–210 (2008)CrossRefGoogle Scholar
  20. 20.
    Tompkin, J., Kim, K.I., Kautz, J., Theobalt, C.: Videoscapes: exploring sparse, unstructured video collections. In: ACM TOG (2012)Google Scholar
  21. 21.
    Zach, C., Gallup, D., Frahm, J.M.: Fast gain-adaptive KLT tracking on the GPU. In: CVPR Workshops (2008)Google Scholar
  22. 22.
    Zheng, E., Wang, K., Dunn, E., Frahm, J.-M.: Joint object class sequencing and trajectory triangulation (JOST). In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 599–614. Springer, Cham (2014). doi: 10.1007/978-3-319-10584-0_39 Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ke Wang
    • 1
    Email author
  • Enrique Dunn
    • 2
  • Mikel Rodriguez
    • 3
  • Jan-Michael Frahm
    • 1
  1. 1.Department of Computer ScienceUniversity of North CarolinaChapel HillUSA
  2. 2.Department of Computer ScienceStevens Institute of TechnologyHobokenUSA
  3. 3.Mitre CorporationMcleanUSA

Personalised recommendations