Automatic POI Matching Using an Outlier Detection Based Approach

  • Alexandre AlmeidaEmail author
  • Ana Alves
  • Rui Gomes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11191)


Points of Interest (POI) are widely used in many applications nowadays mainly due to the increasing amount of related data available online, notably from volunteered geographic information (VGI) sources. Being able to connect these data from different sources is useful for many things like validating, correcting and also removing duplicated data in a database. However, there is no standard way to identify the same POIs across different sources and doing it manually could be very expensive. Therefore, automatic POI matching has been an attractive research topic. In our work, we propose a novel data-driven machine learning approach based on an outlier detection algorithm to match POIs automatically. Surprisingly, works that have been presented so far do not use data-driven machine learning approaches. The reason for this might be that such approaches need a training dataset to be constructed by manually matching some POIs. To mitigate this, we have taken advantage of the Crosswalk API, available at the time we started our project, which allowed us to retrieve already matched POI data from different sources in US territory. We trained and tested our model with a dataset containing Factual, Facebook and Foursquare POIs from New York City and were able to successfully apply it to another dataset of Facebook and Foursquare POIs from Porto, Portugal, finding matches with an accuracy around 95%. These are encouraging results that confirm our approach as an effective way to address the problem of automatically matching POIs. They also show that such a model can be trained with data available from multiple sources and be applied to other datasets with different locations from those used in training. Furthermore, as a data-driven machine learning approach, the model can be continuously improved by adding new validated data to its training dataset.


Machine learning Outlier detection Point-Of-Interest GIS 



The authors would like to thank the funding by URBY.SENSE project (POCI-01-0145-FEDER-016848). URBY.SENSE is co-financed by COMPETE 2020, Portugal 2020 - Programa Operacional Competitividade e Internacionalização (POCI), Fundo Europeu de Desenvolvimento Regional (FEDER) and Fundação para a Ciência e a Tecnologia (FCT).


  1. 1.
  2. 2.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  3. 3.
    Scheffler, T., Schirru, R., Lehmann, P.: Matching points of interest from different social networking sites. In: Glimm, B., Krüger, A. (eds.) KI 2012. LNCS (LNAI), vol. 7526, pp. 245–248. Springer, Heidelberg (2012). Scholar
  4. 4.
    McKenzie, G., Janowicz, K., Adams, B.: A weighted multi-attribute method for matching user-generated points of interest. Cartogr. Geogr. Inf. Sci. 41, 125–137 (2014)CrossRefGoogle Scholar
  5. 5.
    Novack, T., Peters, R., Zipf, A.: Graph-based matching of points-of-interest from collaborative geo-datasets. ISPRS Int. J. Geo-Inf. 7, 117 (2018)CrossRefGoogle Scholar
  6. 6.
    Li, L., Xing, X., Xia, H., Huang, X.: Entropy-weighted instance matching between different sourcing points of interest. Entropy 18, 45 (2016)CrossRefGoogle Scholar
  7. 7.
    Dalvi, N., Olteanu, M., Raghavan, M., Bohannon, P.: Deduplicating a places database. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, pp. 409–418 (2014)Google Scholar
  8. 8.
    Yu, F., McMeekin, David A., Arnold, L., West, G.: Semantic web technologies automate geospatial data conflation: conflating points of interest data for emergency response services. In: Kiefer, P., Huang, H., Van de Weghe, N., Raubal, M. (eds.) LBS 2018. LNGC, pp. 111–131. Springer, Cham (2018). Scholar
  9. 9.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)CrossRefGoogle Scholar
  10. 10.
    Chandola, V., Banerjee, A., Kumar, V.: Outlier detection: a survey (2007)Google Scholar
  11. 11.
    Beldar, Alka P., Wadne, Vinod S.: The detail survey of anomaly/outlier detection methods in data mining. Int. J. Multidiscip. Curr. Res. (2015)Google Scholar
  12. 12.
    Heinzerling, B., Strube, M., Lin, C.-Y.: Trust, but verify! Better entity linking through automatic verification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 828–838. Association for Computational Linguistics, Valencia, Spain (2017)Google Scholar
  13. 13.
    Paulheim, H.: Identifying wrong links between datasets by multi-dimensional outlier detection. In: CEUR Workshop Proceedings, vol. 1162, pp. 27–38 (2014)Google Scholar
  14. 14.
    Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Signal Process. 99, 215–249 (2014)CrossRefGoogle Scholar
  15. 15.
    Bellinger, C., Sharma, S., Japkowicz, N.: One-class versus binary classification: which and when? In: 2012 11th International Conference on Machine Learning and Applications, pp. 102–106 (2012)Google Scholar
  16. 16.
    Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data. 6, 1–39 (2012)CrossRefGoogle Scholar
  17. 17.
    Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)CrossRefGoogle Scholar
  18. 18.
    Tun, J.S.: Semi-supervised outlier detection algorithms. (2018)
  19. 19.
    Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press (1997)Google Scholar
  20. 20.
    Alkan, B.B., Alkan, N., Atakan, C., Terzi, Y.: Use of biplot technique for the comparison of the missing value imputation methods. Int. J. Data Anal. Tech. Strat. 7, 217–230 (2015)CrossRefGoogle Scholar
  21. 21.
    Ghorbani, S., Desmarais, M.C.: Performance comparison of recent imputation methods for classification tasks over binary data. Appl. Artif. Intell. 31, 1–22 (2017)Google Scholar
  22. 22.
    Doukremt: Levenshtein and Hamming distance computation.
  23. 23.
    Ratté, J.-B.: Jaro-winkler-distance: find the Jaro Winkler distance which indicates the similarity score between two strings.
  24. 24.
    Fuzzywuzzy: fuzzy string matching in Python.
  25. 25.
    Levenshtein, V.I.: Binary codes capable of correcting deletions. Inser. Reversals. Sov. Phys. Dokl. 10, 707 (1966)MathSciNetGoogle Scholar
  26. 26.
    Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull. del la Société Vaud. Sci. Naturelles 37, 547–579 (1901)Google Scholar
  27. 27.
    Sørensen, T.J.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. I kommission hos E. Munksgaard, København (1948)Google Scholar
  28. 28.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84, 414–420 (1989)CrossRefGoogle Scholar
  29. 29.
    Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage (1990)Google Scholar
  30. 30.
    FuzzyWuzzy: fuzzy string matching in Python – ChairNerd.
  31. 31.
    Ratle, F., Kanevski, M., Terrettaz-Zufferey, A.-L., Esseiva, P., Ribaux, O.: A comparison of one-class classifiers for novelty detection in forensic case data. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 67–76. Springer, Heidelberg (2007). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Universidade de Coimbra, CISUCCoimbraPortugal
  2. 2.Instituto Politécnico de Coimbra, ISECCoimbraPortugal

Personalised recommendations