Things and Strings: Improving Place Name Disambiguation from Short Texts by Combining Entity Co-Occurrence with Topic Modeling
Place name disambiguation is the task of correctly identifying a place from a set of places sharing a common name. It contributes to tasks such as knowledge extraction, query answering, geographic information retrieval, and automatic tagging. Disambiguation quality relies on the ability to correctly identify and interpret contextual clues, complicating the task for short texts. Here we propose a novel approach to the disambiguation of place names from short texts that integrates two models: entity co-occurrence and topic modeling. The first model uses Linked Data to identify related entities to improve disambiguation quality. The second model uses topic modeling to differentiate places based on the terms used to describe them. We evaluate our approach using a corpus of short texts, determine the suitable weight between models, and demonstrate that a combined model outperforms benchmark systems such as DBpedia Spotlight and Open Calais in terms of F1-score and Mean Reciprocal Rank.
KeywordsPlace name disambiguation Natural language processing LDA Wikipedia DBpedia Linked Data
Similar to other named entities, including persons, organizations, and events, place names can be ambiguous. A single place name can be shared among multiple places. To give a concrete example, Washington is a place name for more than 43 populated places in the United States alone.1 Although most of these Washingtons can be accurately located by adding the proper state name or county name, they are all simply referred to as Washington in daily conversations, (social) media, photo annotations, and so forth. Figure 1 depicts the distribution of the most common place names for U.S. cities, towns, villages, boroughs, and census-designated places. As shown on the map, these places are distributed across the U.S., indicating that the ambiguity of place names is a widespread phenomenon. It is worth noting that places which share a common name can be of the same or a different type, e.g., the state of Washington and the city of Washington, Pennsylvania. The situation is even more difficult on a global scale where place names may appear more than 100 times. For example, it takes merely a 45 min car ride to get from Berlin to East London, both located in South Africa. Thus, it is important to devise effective computational approaches to address the disambiguation problem.
Given the wide availability of digital gazetteers, i.e., place name dictionaries, such as GeoNames, the Getty Thesaurus of Geographic Names, the Alexandria Digital Library Gazetteer, and Google Places, we assume that the places to be disambiguated are known, i.e. that there is a candidate list of places for any given place name list. After all, unknown places cannot be disambiguated. Thus, we define the task of place name disambiguation as follows: given a short text which contains a place name and given a list of candidate places that share this name, determine to which specific place the text refers.
Humans are very good at detecting and interpreting contextual clues in texts to disambiguate place names. Thus, as extension of named entity recognition, place name disambiguation has been tackled using computational approaches that aim at utilizing these contextual clues as well [5, 7]. This context typically stems from the terms surrounding the place name under consideration. Typically, short texts from social media, news headlines (and abstracts), captions, and so forth, offer less contextual clues and thus negatively impact disambiguation quality. Consequently, new approaches have to be develop that can extract and interpret other contextual clues.
One such approach is to focus on the detection of surrounding entities and use these as contextual clues. Besides the place itself, these entities may include other places, actors, objects, organizations, and events. Examples of such associated entities are landmarks, sports teams, well known figures such as politicians or celebrities, and nearby places that share a common administrative unit . Intuitively, when a text mentions Washington along with Redskins, an American football team based in Washington, D.C., it is very likely that the Washington in the text refers to Washington, D.C., rather than another places called with the same toponym. It has been shown that such a co-occurrence model increases disambiguation quality [11, 18].
In addition to entities, implicit thematic information buried in the text can also provide contextual evidence to disambiguate place names. Similar to entities, some particular thematic topics are more likely to be mentioned along with a place, which is characterized by those topics. Topic modeling makes it possible to discover topics from the text and match texts with similar topics. Thus, given topics learned from a corpus of texts about candidate places and the topics discovered from the short text under consideration, computing a similarity score between topics representative for the text and for each of the candidate places can provide additional contextual clues . For example, when people are talking about Washington, DC, political topics featuring terms such as conservative, policy, and liberal are more likely to be mentioned than when talking about the (small) city of Washington, Pennsylvania.
We apply topic modeling to place name disambiguation, an approach that has not been taken before.
We integrate this topic-based model with a reworked version of our previous entity-based co-occurrence model  and learn the appropriate weights for this integrated model.
We compare the integrated model to three well known systems (TextRazor, DBpedia Spotlight, and Open Calais) as baselines and demonstrate that our model outperforms all of them.
2 Related Work
As an extension of named entity disambiguation, place name disambiguation can be conducted using the general approaches from named entity disambiguation. Wikipedia, as a valuable source for ground truth descriptions of named entities, has been used in a number of studies. For example, Bunescu and Pasca  trained a vector space model to host the contextual and categorical terms derived from Wikipedia, and employed TF-IDF to determine the importance of these terms. Milne and Witten  describes a method for augmenting unstructured text with links to Wikipedia articles. For ambiguous links, the authors proposed a machine learning approach and trained several models based on Wikipedia data. Two named entity disambiguation modules were introduced by Mihalcea and Csomai . One measured the overlaps between context and candidate descriptions, and the other trained a supervised learning model based on manually assigned links in the Wikipedia articles.
For studies specifically focusing on place name disambiguation, Jones and Purves  discussed using related places to resolve place ambiguity. Machado et al.  proposed an ontological gazetteer which records the semantic relations between places to help disambiguate place names based on related places and alternative place names. In a similar approach, Spitz et al.  constructed a network of place relatedness based on English Wikipedia articles. Zhang and Gelernter  proposed a supervised machine learning approach to rank candidate places for ambiguous toponyms in Twitter messages that relies on the metadata of tweets and context to a limited extent. In previous work, we leveraged the structured Linked Data in DBpedia for place name disambiguation and demonstrated that a combination of Wikipedia and DBpedia data leads to generally better performance .
The work at hand differs from these previous studies. We apply topic modeling for place name disambiguation and integrate the trained topic model with an entity-based model which captures the co-occurrence relations. Thereby we combine a things-based perspective with a strings-based perspective.
In the following, we assume that the surface forms of place names have been extracted prior to disambiguation, so the primary task of place name disambiguation is to identify the place to which a surface form refers. To accomplish this a list of candidate entities, i.e., places, is selected. In prior work, knowledge bases, such as Wikipedia, DBpedia, and WordNet have been used to obtain candidate entities [6, 10, 15], and here we employ DBpedia as the source of candidate entities. Once a set of candidate places has been identified, the likelihood that the surface form refers to each entity is measured and the disambiguation result is returned if the computed score exceeds a given threshold.
3.1 Entity-Based Co-Occurrence Model
In this section we describe the entity-based co-occurrence method. Wikipedia and DBpedia are used as the sources to train our model. We define the entities from Wikipedia as those words or phrases on a Wikipedia page of the candidate places which have links to another page about these entities. The entities from DBpedia are either subjects or objects of those RDF triples which contain the candidate place entities. Not all RDF triples are selected, but those that fall under the DBpedia namespace, i.e., with prefix dbp2 and dbo.3 While dbo provides a cleaner and better structured mapping-based dataset, it does not provide a complete coverage of the original properties and types from the Wikipedia infoboxes. In order to avoid data bias we use both dbo and dbp. Literals were excluded as well. We treat the subject or object of a triple as a whole, i.e., as an individual entity, instead of further tokenizing it into terms. The harvested entities differ greatly. They include related places (of different types), time zone information, known figures that were born or died at the given place, events that took place there, companies, organizations,4 sports teams, as well as representative landmarks such as buildings or other physical objects.
Table 1 shows some sample entities for Washington, Louisiana, derived from Wikipedia and DBpedia. It should be noted that there is considerable overlap between place data extracted from Wikipedia and DBpedia. Moreover, some properties such as population density in Wikipedia can occur for most or even all candidate places. Such entities which appear frequently but help less to uniquely identify a place will not play a crucial rule in disambiguating the place names.
Sample entities for Washington, LA, from Wikipedia and DBpedia
Wikipedia — St. Landry Parish; Opelousas; Eunice; population density; medianhousehold income; American Civil War; Connecticut; cattle; cow; corn...
DBpedia — United States; Central Time Zone; St. Landry Parish, Louisiana; John M. Parker; KNEX-FM; Louisiana Highway 10...
3.2 Topic-Based Model
In this section we introduce the topic-based model. It makes use of the fact that text is geo-indicative  even without having any direct geographic references. Hence, even everyday language should be able to provide additional evidence for place name disambiguation. For example, terms such as humid, hot, festival, poverty, and even American Civil War are more likely to be uttered when referring to Washington, Louisiana than Washington, Maine. The latter rarely experiences hot and humid weather, does not host a popular festival, has substantially less poverty problems compared to its namesake, and did not play a notable role in the civil war. Here we use Latent Dirichlet allocation (LDA) for topic modeling. LDA is a popular unsupervised machine learning algorithm used to discover topics in a large document collection . Each document is modeled as a probability vector over a set of topics, providing a dimensionally-reduced representation of the documents in the corpus.
We use the geo-referenced text from the English Wikipedia as the source material for discovering these thematic patterns. We start with the idea that a collection of texts that describe various features in a local region–such as museums, parks, mountains, architectural landmarks, etc.–give us a foundation for differentiating places referenced in other texts based on thematic, non-geographically specific, terms. For this we need a systematic way to associate the training documents in Wikipedia with well-defined regions. Because administrative regions vary widely in area, they do not provide a good mechanism for aggregation. Instead, our solution is to aggregate the geo-referenced texts in Wikipedia based on an equal area grid over the Earth. This solution means that articles with point-based geo-references are binned together if they spatially intersect with a grid cell, while text related to areal features (such as national parks) can be associated with multiple grid cells.
Once we identified all articles that have geo-references that spatially intersect with a grid cell we can combine all the text to create a grid document. For the English Wikipedia the geo-referenced articles intersect with 63,473 grid cells at Fuller level 7. The resulting 63,473 grid documents serve as the training data input for LDA topic modeling. We utilized the MALLET implementation of LDA with hyperparameter optimization, which allows for topics to vary in importance in the generated corpus, and we trained the LDA topic model with 512 topics.
3.3 Integrated Model (ETM)
The first model makes use of the co-occurrence of entities as contextual clue to disambiguate place names, while the second model puts emphasis on linguistic aspects, namely co-occurring topics. As argued in the introduction, applying a single model, which extracts partial contextual clues, is often not sufficient to differentiate place names from short texts. Thus, we combine the entity-based model and string-based topic model to an integrated approach called ETM (Entity & Topic Model).
In this section we evaluate the performance of our proposed ETM and describe the methods through which we gathered the testing corpus and the metrics employed for the evaluation.
4.1 Preparing the Test Corpus
Three example records of the test corpus extracted from websites.
Oxford, Wisconsin — Located in Marquette County in south-central Open image in new window, just minutes west of Interstate 39, Oxford invites you to experience our small town charm along with the area’s many year-round outdoor attractions.
Jackson, Montana — The tiny town of Jackson, Open image in new window has made a name for itself as a winter sports destination for the adventurous.
Dayton, Nevada — Since the Native-American tribes in the area were nomadic, this made Dayton the first and oldest permanent non-native settlement in Open image in new window.
To construct the corpus, we first derive ambiguous place names from a list of the most common U.S. place names on Wikipedia.5 As the list also presents the full place names which could be used to identify the place of interest, we feed the full place names into the Bing Search API,6 which returns a list of websites related to the place along with URLs. URLs containing “Wikipedia” are filtered out. We then visit the selected websites and extract sentences which contain the full place name. The auxiliary part of the full place name (state or county name) is removed, so the remaining place name is ambiguous. The result of this approach is a set of real-world, i.e., not synthetic, sentences containing ambiguous place names. These sentences comprise our ground truth data.
Sample ground truth sentences are shown in Table 2. The full place name and test sentence are separated by an em-dash, and the auxiliary part of the full place name is removed (shown as striken for example purposes). This resulting data contains noise. Some sentences, for instance, contain no meaningful entities or terms that can be categorized into topics, while others seem to be automatically generated from templates. This noise, however, can help evaluate the robustness of our models. In total, the testing corpus consists of 5,500 sentences. The average length of a test sentence is 22.54 words with a median of 19. Note that stop words count towards these statistics, while auxiliary parts of the place name do not.
In this section, we present the results of our evaluation and compare them to other well recognized named entity disambiguation systems as baselines.
DBpedia Spotlight7, TextRazor8, and Open Calais9 were selected as baseline systems to be compared to ETM. DBpedia Spotlight is based on DBpedia’s rich knowledge base of structured data , which is also employed by our proposed model. Two endpoints of DBpedia Spotlight Web Service (V. 0.7) were used for testing, namely Annotate and Candidates. The Candidates endpoint returns a ranked list of candidates for each recognized entity and concept, while Annotate simply returns the best candidate according to the context. TextRazor and Open Calais are two commercial Web services for named entity recognition and named entity disambiguation. Both services offer application programming interfaces (APIs). The TextRazor API returns only one candidate for each entity recognized from the test sentence. Experiments were conducted  to compare several named entity disambiguation systems which included DBpedia Spotlight (V. 0.6, confidence = 0, support = 0) and TextRazor. In the experiments, TextRazor demonstrates the best performance in terms of F-score. Open Calais API also returns only one candidate for each recognized entity, while it provides additional social tags for each test text instance.
Figure 3 also shows how F-score and MRR change along percentiles. Note that the 0.9 at the x-axis refers to the 90th percentile, which means that the candidate places with top 10 % of scores are selected as the disambiguation result. As shown in the plots, when percentile increases, the F-scores of both individual models increase very slightly until the 60th percentile when the scores start increasing dramatically. The MRR for the entity co-occurrence model along percentiles has a similar trend as the F-score, while the MRR for the topic model drops when less candidate places are selected.
Comparison of systems at best performance in terms of Precision, Recall, F1-Score and Mean Reciprocal Rank (MRR).
Confidence = 0.2; Support = 0
\(\lambda \) = 0.48; 94th percentile
5 Conclusions and Further Work
In this paper we proposed a novel approach to tackle the challenging task of disambiguating place names from short texts. Place name disambiguation is an important part of knowledge extraction and a core component of geographic information retrieval systems. We have presented two models that are driven by different perspectives, namely an entity-based co-occurrence model and a topic-based model. The first model focuses on the semantic connections between entities and thereby on things, while the second model works on the linguistic level by investigating topics associated with places and thereby takes a string-based perspective. The integration of both models (called ETM) shows a substantially better performance than the used baseline systems with respect to F-score and MRR.
Nonetheless, there is space for future improvements. For the entity-based model, properties other than those with namespaces of dbo and dbp have been filtered out. The same is true for literals. Both of these could be added to a future version of ETM, although they would require more work on the used similarity functions in case of the literals and a better alignment to ensure that properties from different namespaces are not mere duplicates. In our work, the ETM is realized as a convex combination of the entity-based co-occurrence model and the topic-based model. Other approaches could be investigated as well. We have used LDA for topic modeling but this is not the only choice that can be used and other approaches will be tested in the future.
As for the experiment, although place entities in our testing corpus have highly ambiguous place names, those places are all some kind of administrative divisions (i.e., cities, towns, villages, etc.) and located within the United States. A potential improvement could be seeking more ambiguous place names from other types of places which are outside of the United States.
The authors would like to acknowledge partial support by the National Science Foundation (NSF) under award 1440202 EarthCube Building Blocks: Collaborative Proposal: GeoLink Leveraging Semantics and Linked Data for Data Sharing and Discovery in the Geosciences.
- 1.Adams, B., Janowicz, K.: On the geo-indicativeness of non-georeferenced text. In: International AAAI Conference on Web and Social Media (ICWSM), pp. 375–378 (2012)Google Scholar
- 2.Adams, B., McKenzie, G., Gahegan, M.: Frankenplace: interactive thematic mapping for ad hoc exploratory search. In: Proceedings of the 24th International Conference on World Wide Web, pp. 12–22. ACM (2015)Google Scholar
- 5.Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. EACL 6, 9–16 (2006)Google Scholar
- 6.Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. EMNLP-CoNLL 7, 708–716 (2007)Google Scholar
- 7.Fader, A., Soderland, S., Etzioni, O., Center, T.: Scaling Wikipedia-based named entity disambiguation to arbitrary web text. In: Proceedings of the IJCAI Workshop on User-contributed Knowledge, Artificial Intelligence: An Evolving Synergy, Pasadena, CA, USA, pp. 21–26, 2009 (2011)Google Scholar
- 10.Han, X., Zhao, J., Structural semantic relatedness: a knowledge-based method to named entity disambiguation. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 50–59. Association for Computational Linguistics (2010)Google Scholar
- 11.Hu, Y., Janowicz, K., Prasad, S.: Improving Wikipedia-based place name disambiguation in short texts using structured data from DBpedia. In: Proceedings of the 8th Workshop on Geographic Information Retrieval, p. 8. ACM (2014)Google Scholar
- 12.Janowicz, K., Hitzler, P.: The digital earth as knowledge engine. Semant. Web 3(3), 213–221 (2012)Google Scholar
- 15.Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C., Dbpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8. ACM (2011)Google Scholar
- 16.Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 233–242. ACM (2007)Google Scholar
- 17.Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM conference on Information and knowledge management, pp. 509–518. ACM (2008)Google Scholar
- 20.Rizzo, G., van Erp, M., Troncy, R.: Benchmarking the extraction and disambiguation of named entities on the semantic web. In: LREC, pp. 4593–4600 (2014)Google Scholar
- 22.Spitz, A., Geiß, J., Gertz, M., So far away, yet so close: augmenting toponym disambiguation and similarity with text-based networks. In: Proceedings of the Third International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich 2016, pp. 2: 1–2: 6. ACM, New York, NY, USA (2016)Google Scholar
- 23.Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)Google Scholar
- 24.Zhang, W., Gelernter, J.: Geocoding location expressions in Twitter messages: a preference learning method. J. Spat. Inf. Sci. 2014(9), 37–70 (2014)Google Scholar