Skip to main content
Log in

Using machine learning methods for disambiguating place references in textual documents

  • Published:
GeoJournal Aims and scope Submit manuscript

Abstract

This paper presents a machine learning method for disambiguating place references in text. Solving this task can have important applications in the digital humanities and computational social sciences, by supporting the geospatial analysis of large document collections. We combine multiple features that capture the similarity between candidate disambiguations, the place references, and the context where the place references occur, in order to rank and choose from a set of candidate disambiguations, obtained from a knowledge base containing geospatial coordinates and textual descriptions for different places from all around the world. The proposed method was evaluated through English corpora used in previous work in this area, and also with a subset of the English Wikipedia. Experimental results demonstrate that the proposed method is indeed effective, showing that out-of-the-box learning algorithms and relatively simple features can obtain a high accuracy in this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://nlp.stanford.edu/software/CRF-NER.html

  2. http://dumps.wikimedia.org/index.html

  3. http://dbpedia.org/index.html

  4. http://lucene.apache.org/index.html

  5. http://people.cs.umass.edu/~vdang/ranklib.html

References

  • Adams, B., & Janowicz, K. (2012). On the geo-indicativeness of non-georeferenced text. In Proceedings of the international AAAI conference on weblogs and social media.

  • Adams, B., & McKenzie. (2013). Inferring thematic places from spatially referenced natural language descriptions. In D. Sui, S. Elwood, & M. Goodchild (Eds.), Crowdsourcing Geographic Knowledge, Springer.

  • Amitay, E., Har’El, N., Sivan, R., & Soffer A. (2004). Web-a-where: Geotagging web content. In Proceedings of the ACM SIGIR conference on information retrieval.

  • Anastácio, I., Calado, P., & Martins B. (2011). Supervised learning for linking named entities to wikipedia pages. In Proceedings of the text analysis conference.

  • Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation . Journal of Machine Learning Research, 3(1), 993–1022.

  • Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings of the conference on compression and complexity of sequences.

  • Brown, T., Baldridge, J., Esteva, M., & Xu, W. (2012). The substantial words are in the ground and sea: Computationally linking text and geography. In Texas studies in literature and language: Linguistics and literary studies: Computation and convergence.

  • Bunescu, R., & Pasca, M. (2006). Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the European conference of the association for computational linguistics.

  • Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An overview. Microsoft research technical report.

  • Cucerzan, S.-P. (2007). Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning.

  • Dias, D., Anastácio, I., & Martins, B. (2012). Geocoding textual documents through hierarchical classifiers based on language models. Linguamática, Revista para o Processamento Automático das Línguas Ibéricas, 4(2), 13–25.

  • Ding, J., Gravano., & Shivakumar, N. (2000). Computing geographical scopes of web resources. In Proceedings of the International Conference on Very Large Data Bases, Cairo, Egypt.

  • Dutton, G. (1996). Encoding and handling geospatial data with hierarchical triangular meshes. In Advances in GIS research II.

  • Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the Annual Meeting on Association for Computational Linguistics, Michigan, USA.

  • Gale, W., Church, K., & Yarowsky, D. (1992). One sense per discourse. In Proceedings of the MLT workshop on speech and natural language.

  • Jenness, J. (2008). Calculating areas and centroids on the sphere. In Proceedings of the annual ESRI international user conference.

  • Ji, H., & Grishman, R. (2011). Knowledge base population: Successful approaches and challenges. In Proceedings of the annual meeting of the association for computational linguistics.

  • Leidner, J. (2007). Toponym resolution: A comparison and taxonomy of heuristics and methods. PhD thesis, University of Edinburgh.

  • Lieberman, M., & Samet, H. (2011). Multifaceted toponym recognition for streaming news. In Proceedings of the ACM SIGIR conference on information retrieval.

  • Lieberman, M., & Samet, H. (2012). Adaptive context features for toponym resolution in streaming news. In Proceedings of the ACM SIGIR conference on information retrieval.

  • Lieberman, M., Samet, H., & Sankaranarayanan, J. (2010). Geotagging with local lexicons to build indexes for textually-specified spatial data. In Proceedings of the IEEE international conference on data engineering.

  • Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., & Wellner B. (2008). SpatialML annotation scheme, corpora, and tools. In Proceedings of the international conference on language resources and evaluation.

  • Martins, B., Anastácio, I., & Calado, P. (2010). A machine learning approach for resolving place references in text. In Procedings of the AGILE international conference on geographic information science.

  • Mehler, A., Bao, Y., Li, X., Wang, Y., & Skiena, S. (2006). Spatial analysis of news sources. IEEE Transactions on Visualization and Computer Graphics, 12(5).

  • Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the ACM conference on conference on information and knowledge management.

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes, 1(30), 3–26.

  • Pohl, A. (2010). Classifying the wikipedia articles into the opencyc taxonomy. In Proceedings of the ISWC workshop on the web of linked entities.

  • Qin, T., Liu, T.-Y., Zhang, X.-D., Wang, D.-S., Xiong, W.-Y., & Li, H. (2008). Learning to rank relational objects and its application to web search. In Proceeding of the international conference on world wide web.

  • Roller, S., Speriosu, M., Rallapalli, S., Wing, B., & Baldridge, J. (2012). Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning.

  • Santos, J., Anastácio, I., & Martins, B. (2013). The entity linking system from dmir at the 2013 tac-kbp entity linking tasks. In Proceedings of the text analysis conference.

  • Smith, D. A., & Crane, G. (2001). Disambiguating geographic names in a historical digital library. In Proceedings of the European conference on digital libraries.

  • Speriosu, M., & Baldridge, J. (2013). Text-driven toponym resolution using indirect supervision. In Proceedings of the annual metting of the association for computational linguistics.

  • Vincenty, T. (1975). Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey Review, XXIII(176), 88–93.

  • Zheng, Z., Li, F., Huang, M., & Zhu, X. (2010). Learning to link entities with knowledge base. In Proceedings of the conference of the North American chapter of the association for computational linguistics.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João Santos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santos, J., Anastácio, I. & Martins, B. Using machine learning methods for disambiguating place references in textual documents. GeoJournal 80, 375–392 (2015). https://doi.org/10.1007/s10708-014-9553-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10708-014-9553-y

Keywords

Navigation