Placepedia: Comprehensive Place Understanding with Multi-faceted Annotations

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


Place is an important element in visual understanding. Given a photo of a building, people can often tell its functionality, e.g.  a restaurant or a shop, its cultural style, e.g.  Asian or European, as well as its economic type, e.g.  industry oriented or tourism oriented. While place recognition has been widely studied in previous work, there remains a long way towards comprehensive place understanding, which is far beyond categorizing a place with an image and requires information of multiple aspects. In this work, we contribute Placepedia\(^{1}\), a large-scale place dataset with more than 35M photos from 240K unique places. Besides the photos, each place also comes with massive multi-faceted information, e.g. GDP, population, etc., and labels at multiple levels, including function, city, country, etc. This dataset, with its large amount of data and rich annotations, allows various studies to be conducted. Particularly, in our studies, we develop 1) PlaceNet, a unified framework for multi-level place recognition, and 2) a method for city embedding, which can produce a vector representation for a city that captures both visual and multi-faceted side information. Such studies not only reveal key challenges in place understanding, but also establish connections between visual observations and underlying socioeconomic/cultural implications. (\(^{1}\)The dataset is available at:



This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719). My gratitude also goes to Yuqi Zhang. As an equal contributor, she spent tons of time in collecting and organizing data.

Supplementary material

504479_1_En_6_MOESM1_ESM.pdf (2 mb)
Supplementary material 1 (pdf 2098 KB)


  1. 1.
    Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. In: CVPR, pp. 5297–5307 (2016)Google Scholar
  2. 2.
    Arandjelović, R., Zisserman, A.: DisLocation: scalable descriptor distinctiveness for location recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 188–204. Springer, Cham (2015). Scholar
  3. 3.
    Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. In: Noise reduction in speech processing, vol. 2, pp. 1–4. Springer, Heidelberg (2009).
  4. 4.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–970 (2015)Google Scholar
  5. 5.
    Cao, S., Snavely, N.: Graph-based discriminative learning for location recognition. In: CVPR, pp. 700–707 (2013)Google Scholar
  6. 6.
    Chen, D.M., et al.: City-scale landmark identification on mobile devices. In: CVPR 2011, pp. 737–744. IEEE (2011)Google Scholar
  7. 7.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  8. 8.
    Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.: What makes paris look like paris? (2012)Google Scholar
  9. 9.
    En, S., Lechervy, A., Jurie, F.: RPnet: an end-to-end network for relative camera pose estimation. In: ECCV (2018)Google Scholar
  10. 10.
    Gavves, E., Snoek, C.G.: Landmark image retrieval using visual synonyms. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1123–1126. ACM (2010)Google Scholar
  11. 11.
    Gavves, E., Snoek, C.G., Smeulders, A.W.: Visual synonyms for landmark image retrieval. Comput. Vis. Image Underst. 116(2), 238–249 (2012)CrossRefGoogle Scholar
  12. 12.
    Gronat, P., Havlena, M., Sivic, J., Pajdla, T.: Building streetview datasets for place recognition and city reconstruction. Research Reports of CMP, Czech Technical University in Prague (2011)Google Scholar
  13. 13.
    Gronat, P., Obozinski, G., Sivic, J., Pajdla, T.: Learning and calibrating per-location classifiers for visual place recognition. In: CVPR, pp. 907–914 (2013)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: CVPR, pp. 1026–1034 (2015)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)CrossRefGoogle Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  17. 17.
    Hong, Z., Petillot, Y., Lane, D., Miao, Y., Wang, S.: Textplace: visual place recognition and topological localization through reading scene texts. In: ICCV 2019 (2019)Google Scholar
  18. 18.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)Google Scholar
  19. 19.
    Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–441 (2018)Google Scholar
  20. 20.
    Huang, Q., Xiong, Y., Lin, D.: Unifying identification and context learning for person recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  21. 21.
    Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: a holistic dataset for movie understanding. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)Google Scholar
  22. 22.
    Huang, Q., Xiong, Y., Xiong, Y., Zhang, Y., Lin, D.: From trailers to storylines: An efficient way to learn from movies. arXiv preprint arXiv:1806.05341 (2018)
  23. 23.
    Huang, Q., Yang, L., Huang, H., Wu, T., Lin, D.: Caption-supervised face recognition: Training a state-of-the-art face model without manual annotation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)Google Scholar
  24. 24.
    Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 304–317. Springer, Heidelberg (2008). Scholar
  25. 25.
    Johns, E., Yang, G.Z.: Ransac with 2d geometric cliques for image retrieval and place recognition. In: CVPR Workshop, pp. 4321–4329 (2015)Google Scholar
  26. 26.
    Kang, Y., et al.: Extracting human emotions at different places based on facial expressions and spatial clustering analysis. Transactions in GIS (2019)Google Scholar
  27. 27.
    Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 748–761. Springer, Heidelberg (2010). Scholar
  28. 28.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)Google Scholar
  29. 29.
    Li, Y., Crandall, D.J., Huttenlocher, D.P.: Landmark classification in large-scale image collections. In: 2009 IEEE 12th international conference on computer vision, pp. 1957–1964. IEEE (2009)Google Scholar
  30. 30.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: CVPR, pp. 2980–2988 (2017)Google Scholar
  31. 31.
    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1096–1104 (2016)Google Scholar
  32. 32.
    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)Google Scholar
  33. 33.
    Lopez-Antequera, M., Gomez-Ojeda, R., Petkov, N., Gonzalez-Jimenez, J.: Appearance-invariant place recognition by discriminatively training a convolutional neural network. Pattern Recogn. Lett. 92, 89–95 (2017)CrossRefGoogle Scholar
  34. 34.
    Loy, C.C., et al.: Wider face and pedestrian challenge 2018: Methods and results. arXiv preprint arXiv:1902.06854 (2019)
  35. 35.
    Lu, H., Zhang, C., Liu, G., Ye, X., Miao, C.: Mapping china’s ghost cities through the combination of nighttime satellite data and daytime satellite data. Remote Sensing 10(7), 1037 (2018)CrossRefGoogle Scholar
  36. 36.
    Maaten, L.V.D., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(8), 2579–2605 (2008)zbMATHGoogle Scholar
  37. 37.
    Milford, M., et al.: Sequence searching with deep-learnt depth for condition-and viewpoint-invariant route-based place recognition. In: CVPR Workshops, pp. 18–25 (2015)Google Scholar
  38. 38.
    Mishkin, D., Perdoch, M., Matas, J.: Place recognition with WXBS retrieval. In: CVPR 2015 Workshop on Visual Place Recognition in Changing Environments, vol. 30 (2015)Google Scholar
  39. 39.
    Nice, K.A., Thompson, J., Wijnands, J.S., Aschwanden, G.D., Stevenson, M.: The ‘paris-end’of town? urban typology through machine learning. arXiv preprint arXiv:1910.03220 (2019)
  40. 40.
    Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: CVPR, pp. 3456–3465 (2017)Google Scholar
  41. 41.
    Panphattarasap, P., Calway, A.: Visual place recognition using landmark distribution descriptors. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10114, pp. 487–502. Springer, Cham (2017). Scholar
  42. 42.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR 2007, pp. 1–8. IEEE (2007)Google Scholar
  43. 43.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  44. 44.
    Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)Google Scholar
  45. 45.
    Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10146–10155 (2020)Google Scholar
  46. 46.
    Sattler, T., Havlena, M., Radenovic, F., Schindler, K., Pollefeys, M.: Hyperpoints and fine vocabularies for large-scale location recognition. In: CVPR, pp. 2102–2110 (2015)Google Scholar
  47. 47.
    Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: BMVC, vol. 1, p. 4 (2012)Google Scholar
  48. 48.
    Schindler, G., Brown, M., Szeliski, R.: City-scale location recognition. In: CVPR 2007, pp. 1–7. Citeseer (2007)Google Scholar
  49. 49.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)Google Scholar
  50. 50.
    Shi, X., Khademi, S., van Gemert, J.: Deep visual city recognition visualization. arXiv preprint arXiv:1905.01932 (2019)
  51. 51.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  52. 52.
    Sizikova, E., Singh, V.K., Georgescu, B., Halber, M., Ma, K., Chen, T.: Enhancing place recognition using joint intensity - depth analysis and synthetic data. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 901–908. Springer, Cham (2016). Scholar
  53. 53.
    Son, J.S., Thill, J.-C.: Is your city economic, cultural, or political? recognition of city image based on multidimensional scaling of quantified web pages. In: Thill, J.-C. (ed.) Spatial Analysis and Location Modeling in Urban and Regional Systems. AGIS, pp. 63–95. Springer, Heidelberg (2018). Scholar
  54. 54.
    Stumm, E., Mei, C., Lacroix, S., Nieto, J., Hutter, M., Siegwart, R.: Robust visual place recognition with graph kernels. In: CVPR, pp. 4535–4544 (2016)Google Scholar
  55. 55.
    Sun, X., Ji, R., Yao, H., Xu, P., Liu, T., Liu, X.: Place retrieval with graph-based place-view model. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp. 268–275. ACM (2008)Google Scholar
  56. 56.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  57. 57.
    Teichmann, M., Araujo, A., Zhu, M., Sim, J.: Detect-to-retrieve: efficient regional aggregation for image search. In: CVPR, pp. 5109–5118 (2019)Google Scholar
  58. 58.
    Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: CVPR, pp. 1808–1817 (2015)Google Scholar
  59. 59.
    Torii, A., Sivic, J., Pajdla, T., Okutomi, M.: Visual place recognition with repetitive structures. In: CVPR, pp. 883–890 (2013)Google Scholar
  60. 60.
    Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  61. 61.
    Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: robust landmark retrieval. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 79–88. ACM (2015)Google Scholar
  62. 62.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)Google Scholar
  63. 63.
    Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  64. 64.
    Yang, J., Zhang, S., Wang, G., Li, M.: Scene and place recognition using a hierarchical latent topic model. Neurocomputing 148, 578–586 (2015)CrossRefGoogle Scholar
  65. 65.
    Yang, L., Chen, D., Zhan, X., Zhao, R., Loy, C.C., Lin, D.: Learning to cluster faces via confidence and connectivity estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  66. 66.
    Yang, L., Huang, Q., Huang, H., Xu, L., Lin, D.: Learn to propagate reliably on noisy affinity graphs. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)Google Scholar
  67. 67.
    Yang, L., Zhan, X., Chen, D., Yan, J., Loy, C.C., Lin, D.: Learning to cluster faces on an affinity graph. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  68. 68.
    Zhang, X., Yang, L., Yan, J., Lin, D.: Accelerated training for massive classification via dynamic class selection. In: AAAI (2018)Google Scholar
  69. 69.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)CrossRefGoogle Scholar
  70. 70.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)Google Scholar
  71. 71.
    Zhu, Y., Wang, J., Xie, L., Zheng, L.: Attention-based pyramid aggregation network for visual place recognition. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 99–107. ACM (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The Chinese University of Hong KongHong KongChina

Personalised recommendations