## Abstract

The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database. This is a challenging task due to changes in viewpoint, imaging conditions and the large size of the image database. The contribution of this work is two-fold. First, we cast the place recognition problem as a classification task and use the available geotags to train a classifier for each location in the database in a similar manner to per-exemplar SVMs in object recognition. Second, as only one or a few positive training examples are available for each location, we propose two methods to calibrate all the per-location SVM classifiers without the need for additional positive training data. The first method relies on p-values from statistical hypothesis testing and uses only the available negative training data. The second method performs an affine calibration by appropriately normalizing the learnt classifier hyperplane and does not need any additional labelled training data. We test the proposed place recognition method with the bag-of-visual-words and Fisher vector image representations suitable for large scale indexing. Experiments are performed on three datasets: 25,000 and 55,000 geotagged street view images of Pittsburgh, and the 24/7 Tokyo benchmark containing 76,000 images with varying illumination conditions. The results show improved place recognition accuracy of the learnt image representation over direct matching of raw image descriptors.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
The notion most commonly used in statistics is in fact the p value. The p value associated to a score is the quantity \(\alpha (s)\) defined by \(\alpha (s)=1-F_0(s)\); so the more significant the score is, the closer to 1 the cdf value is, and the closer to 0 the p-value is. To keep the presentation simple, we avoid the formulation in terms of p-values and we only talk of the probabilistic calibrated values obtained from the cdf \(F_0\).

- 2.
When the calibration by re-normalization method is used the \(\mathbf {w}_j\) contains the re-normalized weights and the bias \(b_j\) is zero. However, to cover both calibration methods we include the bias term in the derivations in this section.

## References

Agarwal, S., Snavely, N., Simon, I., Seitz, S. & Szeliski, R. (2009). Building Rome in a day. In

*ICCV*(pp. 72–79).Arandjelović, R. & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In

*IEEE PAMI*.Aubry, M., Maturana, D., Efros, A., Russell, B. & Sivic, J. (2014). Seeing 3D chairs: exemplar part-based 2D–3D alignment using a large dataset of CAD models. In

*CVPR*.Aubry, M., Russell, B. & Sivic, J. (2014) Painting-to-3D model alignment via discriminative visual elements. ACM Transactions on Graphics.

Bay, H., Tuytelaars, T. & Van Gool, L. (2006). SURF: Speeded up robust features. In

*ECCV*.Cao, S. & Snavely, N. (2013). Graph-based discriminative learning for location recognition. In

*IEEE Conference on CVPR*(pp. 700–707).Casella, G. & Berger, R. (2001). Statistical inference.

Chen, D., Baatz, G., Köser, Tsai, S., Vedantham, R., Pylvanainen, T., Roimela, K., Chen, X., Bach, J., Pollefeys, M., Girod, B. & Grzeszczuk, R. (2011). City-scale landmark identification on mobile devices. In

*CVPR*.Chum, O., Philbin, J., Sivic, J., Isard, M. & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In

*ICCV*.Csurka, G., Bray, C., Dance, C., & Fan, L. (2004). Visual categorization with bags of keypoints. In

*Workshop on Statistical Learning in Computer Vision, ECCV*(pp. 1–22).Cummins, M. & Newman, P. (2009). Highly scalable appearance-only SLAM - FAB-MAP 2.0. In

*Proceedings of Robotics: Science and Systems*, Seattle, USA.Dalal, N. & Triggs, B. (2005). Histogram of oriented gradients for human detection. In

*CVPR*.Doersch, C., Gupta, A. & Efros, A.A. (2013). Mid-level visual element discovery as discriminative mode seeking. In

*NIPS*.Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. A. (2012). What makes Paris look like Paris?

*SIGGRAPH*,*31*(4), 101.Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification.

*Journal of Machine Learning Research*,*9*, 1871–1874.Gebel, M., & Weihs, C. (2007). Calibrating classifier scores into probabilities.

*Advances in Data Analysis*(pp. 141–148). Berlin: Springer.Gharbi, M., Malisiewicz, T., Paris, S., & Durand, F. (2012). A Gaussian approximation of feature space for fast image similarity. Technical Report, MIT.

Google: ICMLA 2011 streetview recognition challenge. http://www.icmla-conference.org/icmla11/challenge.htm.

Gronát, P.: Project webpage: Learning and calibrating per-location classifiers for visual place recognition. http://www.di.ens.fr/willow/research/perlocation/.

Gronát, P. (2015). streetget. http://www.di.ens.fr/willow/research/streetget/.

Gronát, P., Obozinski, G., Sivic, J. & Pajdla, T. (2013). Learning and calibrating per-location classifiers for visual place recognition. In

*CVPR*.Hariharan, B., Malik, J. & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In

*ECCV*.Hays, J. & Efros, A.A. (2008). im2gps: estimating geographic information from a single image. In

*CVPR*.Irschara, A., Zach, C., Frahm, J.M. & Bischof, H. (2009). From structure-from-motion point clouds to fast location recognition. In

*CVPR*.Jégou, H. & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In

*ECCV*(pp. 774–787).Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search.

*IEEE Transactions on PAMI*,*33*(1), 117–128.Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes.

*IEEE Transactions on PAMI*,*34*, 1704–1716.Kalogerakis, E., Vesselova, O., Hays, J., Efros, A. & Hertzmann, A. (2009). Image sequence geolocation with human travel priors. In

*ICCV*(pp. 253–260).Klingner, B., Martin, D. & Roseborough, J. (2013). Street view motion-from-structure-from-motion. In

*ICCV*.Knopp, J., Sivic, J. & Pajdla, T. (2010). Avoidng confusing features in place recognition. In

*ECCV*.Li, Y., Crandall, D. & Huttenlocher, D. (2009). Landmark classification in large-scale image collections. In

*ICCV*.Li, Y., Snavely, N. & Huttenlocher, D. (2010). Location recognition using prioritized feature matching. In ECCV.

Li, Y., Snavely, N., Huttenlocher, D. & Fua, P. (2012). Worldwide pose estimation using 3d point clouds. In

*ECCV*.Lowe, D. (2004). Distinctive image features from scale-invariant keypoints.

*IJCV*,*60*(2), 91–110.Malisiewicz, T., Gupta, A. & Efros, A.A. (2011). Ensemble of exemplar-svms for object detection and beyond. In

*ICCV*.Muja, M. & Lowe, D.G. (2014). Scalable nearest neighbor algorithms for high dimensional data. In

*IEEE Transactions on PAMI 36*.Nister, D. & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In

*CVPR*.Philbin, J., Chum, O., Isard, M., Sivic, J. & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In

*CVPR*.Philbin, J., Sivic, J. & Zisserman, A. (2010). Geometric latent dirichlet allocation on a matching graph for large-scale image datasets. In

*IJCV*.Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

*Advances in Large Margin Classifiers*,*10*(3), 61–74.Sattler, T., Leibe, B. & Kobbelt, L. (2012). Improving image-based localization by active correspondence search. In

*ECCV*.Sattler, T., Weyand, T., Leibe, B., & Kobbelt, L. (2012). Image retrieval for image-based localization revisited. In

*Proceedings of BMVC*.Scheirer, W., Kumar, N., Belhumeur, P.N. & Boult, T.E. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In

*CVPR*.Schindler, G., Brown, M. & Szeliski, R. (2007). City-scale location recognition. In

*CVPR*.Scholkopf, B., & Smola, A. (2002).

*Learning with kernels*. Cambridge: MIT press.Shrivastava, A., Malisiewicz, T., Gupta, A. & Efros, A.A. (2011). Data-driven visual similarity for cross-domain image matching. In

*SIGGRAPH ASIA*.Singh, S., Gupta, A. & Efros, A.A. (2012). Unsupervised discovery of mid-level discriminative patches. In

*ECCV*.Sivic, J., Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In ICCV. http://www.robots.ox.ac.uk/vgg.

Tighe, J. & Lazebnik, S. (2013). Finding things: Image parsing with regions and per-exemplar detectors. In

*CVPR*.Torii, A. (2015). Project webpage: 24/7 place recognition by view synthesis. http://www.ok.ctrl.titech.ac.jp/torii/project/247/.

Torii, A., Arandjelović, R., Sivic, J., Okutomi, M. & Pajdla, T. (2015). 24/7 place recognition by view synthesis. In

*CVPR*.Torii, A., Sivic, J. & Pajdla, T. (2011). Visual localization by linear combination of image descriptors. In

*IEEE Workshop on Mobile Vision*.Torii, A., Sivic, J., Pajdla, T. & Okutomi, M. (2013) Visual place recognition with repetitive structures. In

*CVPR*.Turcot, P., & Lowe, D. (2009). Better matching with fewer features: The selection of useful features in large database recognition problem. In

*WS-LAVD, ICCV*.Zadrozny, B. & Elkan, C. (2002) Transforming classifier scores into accurate multiclass probability estimates. In

*ACM SIGKDD*.Zamir, A. & Shah, M. (2010) Accurate image localization based on google maps street view. In

*ECCV*.

## Acknowledgments

This work was supported by the MSR-INRIA laboratory, the EIT-ICT labs, Google, the ERC project LEAP and the EC project RVO13000 - Conceptual development of research organization. Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory, contract FA8650-12-C-7212. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL or the U.S. Government.

## Author information

## Additional information

Communicated by Svetlana Lazebnik.

## Appendix

### Appendix

In Sect. 8 we show that the simple calibration by normalization often results in surprisingly good place recognition performance without the need for any additional positive or negative calibration data. In this appendix, we give a possible explanation why this simple calibration works. We focus on the case of a single positive training example, i.e. when training set \(\mathcal P = \mathbf {x}^{+}\), which is the typical case for place recognition where only one positive example is available for each place. The analysis holds also for the case of multiple expanded positive examples as in our case the positive examples are coming from the same database of Street View images, and hence have very similar statistics (illumination, capturing conditions, the same camera, etc.).

In particular, we first analyze the SVM objective and show that the learnt hyperplane \(\mathbf {w}\) can be interpreted as a new descriptor \(\mathbf {x}^{*}\) that replaces the original positive example \(\mathbf {x}^{+}\) and is re-weighted to increase its separation from the negative data. Second, we show that when \(\mathbf {x}^{*}\) is normalized, i.e. \(\mathbf {x}^{*} = \frac{\mathbf {w}}{||\mathbf {w}||}\), the dot-product \(\mathbf {q}^T\mathbf {x}^*\) corresponds to measuring the cosine of the angle between the (normalized) query descriptor \(\mathbf {q}\) and the new descriptor \(\mathbf {x}^{*}\), which was found to work well in the literature for descriptor matching, as discussed in Sect. 5.2. The two steps are given next.

### Analysis of Per-Exemplar SVM Objective

For a single positive example \(\mathcal P = \mathbf {x}^{+}\), the per-exemplar SVM objective (3) can be written as

In the following, we analyze the objective (13) and provide intuition why *re-normalized* weight vector \(\mathbf {w}\) can be interpreted as a new descriptor. In particular, we show first that when the weight \(C_2\) of the negative data in objective (13) goes to zero the learnt normalized \(\mathbf {\widetilde{w}}\) is identical to the original positive training data point \(\mathbf {x}^+\). Second, when \(C_2>0\), the learnt vector \(\mathbf {\widetilde{w}}\) moves away from the positive vector \(\mathbf {x}^+\) to increase its separation from the negative data. The two cases are detailed next.

*Case I *
\(C_2\rightarrow 0\). The goal is to show that when the weight \(C_2\) of the negative data in objective (13) goes towards zero, the resulting hyperplane vector \(\mathbf {w}\) is parallel with the vector of positive training descriptor \(\mathbf {x}^+\). When \(\mathbf {w}\) is normalized to have unit L2 norm the two vectors are identical. First, let us decompose \(\mathbf {w}\) into parallel and orthogonal part with respect to the positive training data point \(\mathbf {x}^+\), i.e. \(w = \mathbf {w}^{\perp } + \mathbf {w}^{||}\), where \((\mathbf {w}^{\perp })^T \mathbf {x}^+ = 0\). Next, we observe that when the weight of the negative data diminishes (\(C_2\rightarrow 0\)), any non-zero component \(\mathbf {w}^{\perp }\) will increase the value of the objective. As a result, for \(C_2\rightarrow 0\) the objective is minimized by \(\mathbf {w}^{||}\), i.e. the optimal \(\mathbf {w}\) is parallel with \(\ \mathbf {x}^+\).

In detail, for \(\mathbf {w} = \mathbf {w}^{\perp }+\mathbf {w}^{||}\), the objective (3) can be written as

Note that the orthogonal part \(\mathbf {w}^{\perp }\) does not change the value of the second term in (14) because \((\mathbf {w}^{\perp }+\mathbf {w}^{||})^T \mathbf {x}^+ = (\mathbf {w}^{||})^T \mathbf {x}^+\), and hence (14) reduces to

In the limit case as \(C_2 \rightarrow 0\) any non-zero component \(\mathbf {w}^{\perp }\) will increase the value of the objective (15). This can be seen by noting that the third term vanishes when \(C_2 \rightarrow 0\) and hence the objective is dominated by the first two terms. Further, the second term in (15) is independent of \(\mathbf {w}^{\perp }\). Finally, the first term will always increase for any non-zero value of \(w^{\perp }\) as \(||\mathbf {w}^{\perp }+\mathbf {w}^{||}||^{2} \ge ||\mathbf {w}^{||}||\) for any \(\mathbf {w}^{\perp }\ne 0\).

As a result, in the limit case when \(C_2 \rightarrow 0\) the optimal \(\mathbf {w}\) is parallel with \(\mathbf {x}^+\). Note also, that when \(C_2\) is exactly equal to zero, \(C_2 = 0\), the optimal \(\mathbf {w}\) vanishes, i.e. the objective (15) is minimized by trivial solution \(||\mathbf {w}|| = 0\) and \(b = -1\). The effect of decreasing the parameter \(C_2\) is illustrated in Fig. 10.

*Case II*
\(C_2>0\). When the weight \(C_2\) of the negative data in the objective (15) increases the direction of the optimal *w* will be different from \(\mathbf {w}^{||}\) and will change to take into account the loss on the negative data points. Explicitly writing the hinge-loss \(h(x) = \max (1-x,0)\) in the last term of (15), we see that \(\mathbf {w}\) will move in the direction that reduces \(\sum _{\mathbf {x} \in \mathcal N}\max \left( 1+\mathbf {w}^T x + b ,0 \right) \), i.e. that reduces the dot product \(\mathbf {w}^T \mathbf {x}\) on the negative examples that are active (support vectors).

### The Need for Normalization of \(\mathbf{w}\)

Above we have shown that the learnt hyperplane \(\mathbf {w}\) moves away from the positive example \(\mathbf {x}^+\) in a manner that reduces the loss on the negative data. The aim is to use this learnt vector \(\mathbf {w}\) as a new descriptor \(\mathbf {x}^*\) replacing the original positive example \(\mathbf {x}^+\). However, we wish to measure the cosine of the angle between the the new descriptor \(\mathbf {x}^*\)and the query image \(\mathbf {q}\). This is equivalent to the normalized dot product, hence the vector \(\mathbf {w}\) needs to be normalized.

## Rights and permissions

## About this article

### Cite this article

Gronát, P., Sivic, J., Obozinski, G. *et al.* Learning and Calibrating Per-Location Classifiers for Visual Place Recognition.
*Int J Comput Vis* **118, **319–336 (2016). https://doi.org/10.1007/s11263-015-0878-x

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Place recognition
- Exemplar SVM
- Geo-localization
- Classifier calibration