One net to rule them all: efficient recognition and retrieval of POI from geo-tagged photos

  • Pai PengEmail author
  • Xiaoling Gu
  • Suguo Zhu
  • Lidan Shou
  • Gang Chen


In this work, we present DeepCamera, a novel framework that combines visual recognition and spatial recognition for identifying places-of-interest (POIs) from smartphone photos. Both deep visual features and geographic features of images are explored in our framework. For visual recognition, we first design the HashNet model extended from an ordinary convolutional neural network (ConvNet) by adding a “hash layer” following the last fully connected layer. Furthermore, we compress multiple pre-trained deep HashNets into one single shallow and hash network namely “SHNet” that outputs semantic labels and compact hash codes simultaneously. As a result, it significantly reduces the time and memory consumption during POI recognition. For spatial recognition, a new layer called Spatial Layer is appended to a ConvNet to capture spatial information. Finally, both visual and spatial knowledge contribute to generating a hybrid probability distribution over all possible POI candidates by plugging the spatial layer into SHNet. Notably, the proposed SHNet model can be used for general visual recognition and retrieval. The experiments conducted on real-world datasets and classic datasets (MNIST and CIFAR-10) demonstrate the competitive accuracy and run-time performance of our proposed framework.


Places-of-interest Image recognition Image retrieval Deep hashing 



The project was supported by the National Basic Research Program (973 Program, GrantNo.2015CB352400), and the National Science Foundation of China (GrantNo. 61802100, 61672455, 61528207 and 61472348). The project was also supported by the Natural Science Foundation of Zhejiang Province of China (GrantNo. LY18F020005).


  1. 1.
    Ba LJ, Caurana R (2014) Do deep nets really need to be deep? In: NIPS, pp 2654–2662Google Scholar
  2. 2.
    Cheng Z, Ding Y, He X, Zhu L, Song X, Kankanhalli MS (2018) A3ncf: an adaptive aspect attention model for rating prediction. In: IJCAI, pp 3748–3754Google Scholar
  3. 3.
    Cheng Z, Shen J, Nie L, Chua TS, Kankanhalli MS (2017) Exploring user-specific information in music retrieval. In: SIGIR, pp 655–664Google Scholar
  4. 4.
    Cheng Z, Shen J, Zhu L, Kankanhalli MS, Nie L (2017) Exploiting music play sequence for music recommendation. In: IJCAI, pp 3654–3660Google Scholar
  5. 5.
    Gao F, Wang Y, Li P, Tan M, Yu J, Zhu Y (2017) Deepsim: deep similarity for image quality assessment. Neurocomputing 257:104–114CrossRefGoogle Scholar
  6. 6.
    Gao F, Yu J, Zhu S, Huang Q, Tian Q (2018) Blind image quality prediction by exploiting multi-level deep representations. Pattern Recogn 81:432–442CrossRefGoogle Scholar
  7. 7.
    Girshick RB (2015) Fast r-cnn. arXiv:1504.08083
  8. 8.
    Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587Google Scholar
  9. 9.
    Gu X, Peng P, Li M, Wu S, Shou L, Chen G (2015) Cross-scenario eyeglasses retrieval via Egypt model. In: ICMR. ACM, pp 463–466Google Scholar
  10. 10.
    Gu X, Wong Y, Peng P, Shou L, Chen G, Kankanhalli M S (2017) Understanding fashion trends from street photos via neighbor-constrained embedding learning. In: ACM multimedia, pp 190–198Google Scholar
  11. 11.
    Gu X, Wu S, Peng P, Shou L, Chen K, Chen G (2017) Csir4g: an effective and efficient cross-scenario image retrieval model for glasses. Inf Sci 417:310–327CrossRefGoogle Scholar
  12. 12.
    He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCVGoogle Scholar
  13. 13.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR. IEEE Computer Society, pp 770–778Google Scholar
  14. 14.
    He K, Gkioxari G, Dollár P, Girshick RB (2017) Mask r-cnn. In: ICCV. IEEE Computer Society, pp 2980–2988Google Scholar
  15. 15.
    Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257CrossRefGoogle Scholar
  16. 16.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, vol 37, pp 448–456Google Scholar
  17. 17.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105Google Scholar
  18. 18.
    Li R, Jia J (2016) Visual question answering with question representation update (qru). In: NIPS, pp 4655–4663Google Scholar
  19. 19.
    Liong VE, Lu J, Wang G, Moulin P, Zhou J (2015) Deep hashing for compact binary codes learning. In: CVPR, pp 2475–2483Google Scholar
  20. 20.
    Luo X, Nie L, He X, Wu Y, Chen ZD, Xu XS (2018) Fast scalable supervised hashing. In: SIGIR, pp 735–744Google Scholar
  21. 21.
    Nie L, Wang M, Zha ZJ, Chua TS (2012) Oracle in image search: a content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13CrossRefGoogle Scholar
  22. 22.
    Nie L, Yan S, Wang M, Hong R, Chua TS (2012) Harvesting visual concepts for image search with complex queries. In: ACM multimedia, pp 59–68Google Scholar
  23. 23.
    Nistér D, Stewénius H (2006) Scalable recognition with a vocabulary tree. In: CVPR. IEEE Computer Society, pp 2161–2168Google Scholar
  24. 24.
    Peng P, Shou L, Chen K, Chen G, Wu S (2013) The knowing camera: recognizing places-of-interest in smartphone photos. In: SIGIR, pp 969–972Google Scholar
  25. 25.
    Peng P, Shou L, Chen K, Chen G, Wu S (2014) The knowing camera 2: recognizing and annotating places-of-interest in smartphone photos. In: SIGIR, pp 707–716Google Scholar
  26. 26.
    Ren S, He K, Girshick RB, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv:1506.01497
  27. 27.
    Rui Y, Huang TS, Chang SF (1999) Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 10(1):39–62CrossRefGoogle Scholar
  28. 28.
    Salakhutdinov R, Hinton GE (2009) Semantic hashing. Int J Approx Reasoning 50:969–978CrossRefGoogle Scholar
  29. 29.
    Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLRGoogle Scholar
  30. 30.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
  31. 31.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. JLMR 15:1929–1958MathSciNetzbMATHGoogle Scholar
  32. 32.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9Google Scholar
  33. 33.
    Xia R, Pan Y, Lai H, Liu C, Yan S (2014) Supervised hashing for image retrieval via image representation learning. In: AAAI, pp 2156–2162Google Scholar
  34. 34.
    Xie L, Shen J, Zhu L (2016) Online cross-modal hashing for web image retrieval. In: AAAI, pp 294–300Google Scholar
  35. 35.
    Xie L, Shen J, Han J, Zhu L, Shao L (2017) Dynamic multi-view hashing for online image retrieval. In: IJCAI, pp 3133–3139Google Scholar
  36. 36.
    Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICML, vol 37, pp 2048–2057Google Scholar
  37. 37.
    Yang HF, Lin K, Chen CS (2015) Supervised learning of semantics-preserving hashing via deep neural networks for large-scale image search. arXiv:1507.00101
  38. 38.
    Yu J, Yang X, Gao F, Tao D (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024CrossRefGoogle Scholar
  39. 39.
    Yu J, Zhang B, Kuang Z, Lin D, Fan J (2017) Iprivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Trans Inf Forensics Secur 12(5):1005–1016CrossRefGoogle Scholar
  40. 40.
    Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems.
  41. 41.
    Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332CrossRefGoogle Scholar
  42. 42.
    Zhou ZH (2012) Ensemble methods: foundations and algorithms, 1st edn. Chapman & Hall/CRCGoogle Scholar
  43. 43.
    Zhu L, Huang Z, Chang X, Song J, Shen HT (2017) Exploring consistent preferences: discrete hashing with pair-exemplar for scalable landmark search. In: ACM Multimedia, pp 726– 734Google Scholar
  44. 44.
    Zhu L, Huang Z, Liu X, He X, Sun J, Zhou X (2017) Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans Multimed 19(9):2066–2079CrossRefGoogle Scholar
  45. 45.
    Zhu L, Huang Z, Li Z, Xie L, Shen HT (2018) Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Trans Neural Netw Learn Syst 29:5264–5276CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Pai Peng
    • 1
  • Xiaoling Gu
    • 2
  • Suguo Zhu
    • 2
  • Lidan Shou
    • 3
  • Gang Chen
    • 3
  1. 1.YoutuLab, Tencent Technology (Shanghai) Co.ShanghaiChina
  2. 2.Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  3. 3.Database Laboratory, College of Computer Science and TechnologyZhejiang UniversityHangzhouChina

Personalised recommendations