# Clustering-based disambiguation of fine-grained place names from descriptions

- 160 Downloads

## Abstract

Everyday place descriptions often contain place names of fine-grained features, such as buildings or businesses, that are more difficult to disambiguate than names referring to larger places, for example cities or natural geographic features. Fine-grained places are often significantly more frequent and more similar to each other, and disambiguation heuristics developed for larger places, such as those based on population or containment relationships, are often not applicable in these cases. In this research, we address the disambiguation of fine-grained place names from everyday place descriptions. For this purpose, we evaluate the performance of different existing clustering-based approaches, since clustering approaches require no more knowledge other than the locations of ambiguous place names. We consider not only approaches developed specifically for place name disambiguation, but also clustering algorithms developed for general data mining that could potentially be leveraged. We compare these methods with a novel algorithm, and show that the novel algorithm outperforms the other algorithms in terms of disambiguation precision and distance error over several tested datasets.

## Keywords

Place description Place name disambiguation Fine-grained place Clustering## 1 Introduction

*toponym resolution*[24]. This research focuses on the second challenge, i.e., place name disambiguation, with everyday place descriptions as the target document source.

Everyday place descriptions often contain place names of fine-grained features (e.g., names of streets, buildings and local points of interest). Most studies in the field of *toponym resolution* focus on larger geographic features such as populated places (e.g., cities or towns) or natural geographic features (e.g., rivers or mountains). For these features, disambiguation heuristics can leverage the size, population, or containment relationships of candidate places, possibly based on external knowledge bases (e.g., WordNet or Wikipedia). Such heuristics quickly fail when dealing with the fine-grained places in everyday place descriptions. Fine-grained places are often significantly more frequent and more similar to each other than those larger (natural or political) gazetteered places. Even disambiguation approaches based on machine-learning techniques are difficult to be applied for fine-grained places due to the lack of good-quality training data, as well as the challenge of locating previously-unseen place names.

In this research we use map-based clustering approaches that have been developed for place name disambiguation. Map-based approaches should be relatively robust for fine-grained places as they only require knowledge of the locations of ambiguous candidate entries. However, it remains to be seen whether these algorithms are suitable for the task of this research. Some of them are defined for large geographic features and may not perform equally well on fine-grained places. Some algorithms are parameter-sensitive, and require manual input, and thus substantial pre-knowledge of the data. Therefore, we will also take a look at more generic clustering algorithms that exist in fields such as statistics, pattern recognition, and machine learning. In particular, we will compare existing clustering algorithms with a novel algorithm that is is designed to be robust, parameter- and granularity-independent. We will show that the new algorithm, despite being parameter-independent, achieves state-of-art disambiguation precision and minimum distance error for several tested datasets.

- 1.
a comparison of different clustering algorithms for disambiguating fine-grained place names extracted from everyday place descriptions;

- 2.
an in-depth analysis of algorithms from five categories (ad-hoc, density-based, hierarchical-based, partitioning relocation-based, and others) in terms of performance, reasons, and relative suitability of the task for each; and

- 3.
a new clustering algorithm which out-performs the other tested algorithms for the collected datasets.

- 1.
it does not require manual input of parameter values and works well for data with different contexts, i.e., size of spatial coverage, distance between places, levels of granularity (parameter-independent).

- 2.
it achieves the highest average disambiguation precision and has overall minimal distance errors for the tested datasets, compared to other algorithms even with their best-performing parameter values. Note that these values are typically hard to determine without pre-knowledge of the data; and

- 3.
its performance is robust for descriptions with different contexts. Compared to other algorithms, it has low variation in both precision and distance error for different input data.

The remainder of the paper is structured as follows: in Section 2 a review of relevant clustering algorithms is given. Section 3 proposes a new algorithm. Section 4 explains the input dataset as well as the experiment. Section 5 presents the obtained results as well as the corresponding discussions. Section 6 concludes this paper.

## 2 Related work

In the following section, related work in disambiguating place names from text, as well as relevant clustering algorithms is introduced.

### 2.1 Place name disambiguation

The goal of toponym resolution is to recognize place names from textual documents (e.g., web texts or historical documents) and link them to geographic locations, and the process often involves an external source of knowledge, typically as a gazetteer. A gazetteer is often regarded as a geospatial dictionary of geographic names and typically contains three core components: place name, place type, and footprints (usually as points by latitudes and longitudes) [16, 20].

Place name disambiguation, as the subtask of toponym resolution that is considered in this research, focuses on disambiguating a place name with multiple corresponding gazetteer entries. For example, GeoNames^{1} lists 14 populated places ‘Melbourne’ world-wide. Various approaches have been proposed in the past years mainly in the context of Geographic Information Retrieval (GIR), in order to georeference place names in text or geotagging whole documents. Typically, place name disambiguation is done by considering context place names, i.e., other place names occurred in the same document (discourse), and computing the likelihood of each of the candidate gazetteer entry to correspond this place name. The likelihood is computed as a score given some available knowledge of the context place names as well as the place name to be disambiguated, such as their locations or spatial containment relationships. For example, if ‘Melbourne’ and ‘Florida’ occur together in a document, then the place name ‘Melbourne’ is more likely to be corresponding to the gazetteer entry ‘Melbourne, Florida, United States’ rather than ‘Melbourne, Victoria, Australia’. There are also more recent language modeling approaches based on machine-learning techniques that not only consider context place names, but also other non-geographical words as well (e.g., [12, 32, 38]). Many geotagging systems – systems that determine the geo-focus for the entire document for geographic information retrieval purposes (e.g., [26, 35]) – heavily rely on toponym resolution.

Depending on the knowledge used, disambiguation approaches can generally be classified into map-, knowledge-, and machine learning-based [6]. Map-based approaches rely mainly on the locations of the gazetteer entries of places names from a document, and use heuristics such as minimum point-wise distance, minimum convex hull, or closest to the centroid of all entries locations for disambiguation (e.g., [2, 33]). Previous studies that focus on disambiguating fine-grained places (e.g., [13, 28, 29]), are largely based on map-based approaches as well. Knowledge-based methods leverage external knowledge of places such as containment relationships, population, or prominence (e.g., [1, 8]). Machine learning-based approaches have the advantage of using non-geographical context words such as events, person names, or organization names to assist disambiguation, through creating models from training data representing the likelihood of seeing each of these context word associated with places [31, 34]. The selection of the disambiguation approach is usually task- and data source-dependent [6], and it is also common that different approaches are used in hybrid manners.

### 2.2 Relevant clustering algorithms

Clustering is a division of data into meaningful groups of objects. A variety of algorithms exist, e.g., a review of clustering algorithms for data mining is given by Berkhin [5]. In this section, we introduce clustering algorithms from two categories: ones that have been used for place name disambiguation before (including ad-hoc ones), as well as selected ones from the data mining community. These algorithms will be compared to the newly developed algorithm later in this paper. For the task of place name disambiguation, the input to these algorithms are the locations of all ambiguous candidate gazetteer entries of all place names from a document, in the form of a point cloud.

#### 2.2.1 Clustering algorithms used for place name disambiguation

*Overall minimum distance*heuristic aims at selecting gazetteer entries so that they are as geographically close to each other as possible. The closeness is typically measured either by the average location-wise distance, or area of the convex hull of these locations. An illustration of the algorithm is given in Fig. 2 (left): for each combination of ambiguous place name entries (one entry for each place name), create a cluster. Then, choose the minimum cluster representing the disambiguated locations, according to one of the measurement methods. This algorithm has been used in [2, 18, 25] and will generate only one cluster.

The *centroid based* heuristic is explained in Fig. 2 (right). The algorithm first computes the geographic focus (centroid) of all ambiguous entry locations, and calculates the distance of each entry location to it. Then, two standard deviations of the calculated distances are used as a threshold to exclude entry locations that are too far away from the centroid. Next, the centroid is recalculated based on the remaining entry locations. Finally, for each place name, select the entry that is closest to the centroid for disambiguation. The algorithm is used in [9, 33] and will also derive only one cluster.

The *Minimum distance to unambiguous referents* heuristic consists of two-steps. It first identifies unambiguous place names, i.e., place names with only one gazetteer entry or ones that can be easily disambiguated based on some heuristics (e.g., when the method is used in conjunction with knowledge-based methods). Then, use a scoring function for the disambiguation of the remaining ambiguous entries, such as based on average minimum distance to those unambiguous entry locations, or weighed average distance considering times of occurrence in document or textual distance. The method appears in [7, 33] and again will generate one cluster.

The DBSCAN algorithm (Density Based Spatial Clustering of Applications with Noise) is a density-based method that relies on two parameters: the neighborhood distance threshold *ε*, and the minimum number of points to form a cluster *MinPts*. There is no straightforward way to fit the parameters without pre-knowledge of the data. Moncla *et al.* use DBSCAN for the purpose of place name disambiguation [28], and the parameters in their case were empirically adjusted, since the authors have good understanding of the spatial coverages of the input data as hiking itineraries. A heuristic is proposed to estimate the value of parameters based on *k-dist graph* (a line plot representing the distances to the *k*-st nearest neighbor of each point) in the paper of DBSCAN [15]. However, it is not trivial to detect the threshold, which requires a selection of value *k* as well as knowledge of the percentage of noise within the data.

#### 2.2.2 General clustering algorithms for data mining

This section introduces clustering algorithms from four groups: density-based, hierarchical-based, partitioning relocation-based, and uncategorized ones.

Using DBSCAN requires a-priori knowledge of the input data to determine the parameters. Some data, such as everyday descriptions in this research, have potentially various conversational contexts, and thus distances between the places mentioned. The algorithm OPTICS (Ordering Points To Identify the Clustering Structure) [4] address the problem by building an augmented ordering of data which is consistent with DBSCAN, but covers a spectrum of all different *ε*^{′}≤ *ε*. The OUTCLUST algorithm exploits local density to find clusters that are mostly deviating from the overall population (clustering by exceptions) [3] given *k*, the number of nearest neighbors for computing local densities, as well as *f*, a frequency threshold, for detecting outliers.

Hierarchical clustering algorithms typically build cluster hierarchies and flexibly partition data at different granularity levels. The main disadvantage is the vagueness of when to terminate the iterative process of merging or dividing subclusters. CURE (Clustering Using REpresentatives) [17] samples an input dataset and uses an agglomeration process to produce the requested number of clusters. CHAMELEON [21] leverages dynamic modelling method for cluster aggregation considering k-nearest neighbor connectivity graph. HDBSCAN [10] extends DBSCAN based on excluding border-points from the clusters and follows the definition of density-levels.

Partitioning relocation clustering divides data into several subsets, and certain greedy heuristics are then used for iterative optimization. The KMeans algorithm [19] divides the data into *k* clusters through some random initial samples as well as an iterative process to update the centroids of the clusters until convergence. A Gaussian Mixture Model (GMM) [11] attempts to find a mixture of probability distributions that best model the input dataset through methods such as the Expectation-Maximization (EM) algorithm. KMeans is often regarded as a special case of GMM.

There are other algorithms that do not belong to the previous three categories. The SNN (Shared Nearest Neighbours) algorithm [14] blends a density based approach by first constructing a linkage matrix representing the similarity, e.g., distance, among shared nearest neighbors based on *k*-nearest neighbors (KNN). The remaining part of the algorithm is similar to DBSCAN. Spectral clustering relies on the eigenvalues of the similarity matrix (e.g., KNN) of the data and performs partition of the data into the required number of clusters. Compared to KMeans, spectral clustering cares about connectivity instead of compactness (e.g., geometrical proximity). Kohen’s Self Organizing Maps (SOM) [23] is an artificial neural network-based clustering technique applying competitive learning using a grid of neurons. It is able to perform dimensionality reduction and map high-dimensional data to (typically) two-dimensional representation.

## 3 A new robust, parameter-independent algorithm

The task of this research is the following: Given a place description *D* with *i* gazetteered place names already extracted, {*p*_{1},*p*_{2},…,*p*_{i}}, each name has a set of (one or more) corresponding gazetteer entries \(\{{p_{i}^{1}},{p_{i}^{2}},\ldots ,{p_{i}^{j}}\}\) that it can be matched to. For example, the place name *Melbourne* (*p*_{1}) refer to multiple places on earth, such as *Melbourne, Victoria, Australia* (\({p_{i}^{1}}\)), or *Melbourne, Florida, US* (\({p_{i}^{2}}\)). In order to disambiguate each place name and link it to the entry that it is actually referring to (e.g., *p*_{1} to \({p_{1}^{1}}\)), clustering algorithms can be used to either minimize the geographic distances between the disambiguated entries according to some objective function (e.g., minimal average pairwise distance), or to derive high-density clusters that are likely to represent the geographic extents where the original descriptions are embedded. The input to such a clustering algorithm is a 2-dimensional point cloud with the locations of all ambiguous entries \({p_{m}^{n}}, m \in (1, i)\).

The task is to select clusters by these objectives rather than to classify input data into several clusters. Such clusters will then be used for disambiguation, since they are expected to capture the true entries that the place names actually refer to. Points not captured by these clusters will be regard as noise. Therefore, certain clustering algorithms seem more suitable for this task than others, e.g., DBSCAN over KMeans. Furthermore, algorithms that are not parameter-sensitive or require no parameter are preferable, as place descriptions may have various spatial coverages, distance between places, and levels of granularity, thus no pre-knowledge can be assumed. In this section, we propose a novel density-based clustering algorithm *DensityK*. The algorithm is robust, parameter-independent, and consists of three steps.

### 3.1 Step one: computing point-wise distance matrix

In the first step, the algorithm computes all point-wise distances of an input point cloud, and the time complexity is *O*(*n*^{2}) (*n* is the number of input points). The time complexity can be reduced to *O*((*n*^{2} − *n*)/2) with a distance dictionary to avoid re-computation (but needs *O*(*n*^{2}) memory). The worst case time complexity is equal to DBSCAN, both without any indexing mechanism for neighborhood queries. In practice, DBSCAN is expected to be faster since it requires a defined distance threshold *ε* and only considers point-wise distances below or equal to the value. With an index, e.g., R-Tree, the computation time can be reduced. *O*(*n*^{2}) is also the worst case time complexity for algorithms that require computing neighborhood distances, e.g., OUTCLUST, SNN, and HDBSCAN. Still, a distance upper-bound value can be enforced for DensityK as an optional parameter to facilitate proceeding time, with an indexing approach similar to DBSCAN.

### 3.2 Step two: deriving cluster distance

In the second step, DensityK analyzes the computed point-wise distances, and derives a cluster distance automatically. The cluster distance is similar to the parameter *ε* in DBSCAN, and will be used in the next step for generating clusters.

*DensityK function*is computed given the point-wise distances in the first step, as shown in Function (1).

*K*(

*d*) represents the average point density for points within a given distance interval (

*d*−Δ

*d*,

*d*] for all points in an annular region. The reason to apply annular search region for computing point density instead of circular region (i.e., Δ

*d*=

*d*) is because we found the former one leads to better clustering results. A comparison of applying the two search regions is given later in this section. In Function (1), the expression

*c*

*o*

*u*

*n*

*t*(

*p*∈

*r*

*e*

*g*

*i*

*o*

*n*(

*p*

_{i},(

*d*−Δ

*d*,

*d*])) represents the number of points that are at a distance between

*d*−Δ

*d*and

*d*(including

*d*) from point

*p*

_{i}. If there is no point within all the search regions for all points for a distance interval (

*d*

_{j}−Δ

*d*,

*d*

_{j}], skip to the next interval ((

*d*

_{j},

*d*

_{j}+ Δ

*d*]). Thus,

*K*(

*d*) is aways positive. The denominator of the left side of the function is the area of the annular region. Δ

*d*is for discretizing the function and is set to 100

*m*in this research. The resulting cluster distance threshold will be the integer multiple of Δ

*d*. We will demonstrate below in this section that the clustering result is little sensitive to the value of Δ

*d*.

The approach is inspired by Ripley’s *K* function [30] which was originally designed to assess the degree of departure of a point set from complete spatial randomness, ranging from spatial homogeneity to a clustered pattern. Ripley’s K function cannot be used to derive clusters nor cluster distances, yet the idea of detecting point density accordingly to distance threshold meets our interest. The goal of this research is to derive a cluster distance threshold which leads to clusters with significantly large point densities. DensityK is a new algorithm with a different purpose than Ripley’s *K* function, but Ripley’s *K* function can be regarded as a cumulative version of the DensityK function. If the point-wise distances from the last step are sorted, the time complexity of computing DensityK function is *O*(*n*) as it makes at most *n* comparisons regarding different values of *d*.

*d*with significantly large point densities. Two illustrative examples are given in Fig. 3a and b with different input data. For each of the two sample functions,

*K*(

*d*) starts at a non-zero value for the first

*d*: 100

*m*(the value of Δ

*d*), which means there are some points that are within 100

*m*from other points in the input point cloud. As

*d*grows, the value of

*K*(

*d*) continues to decrease. For different input data, it is also possible that

*K*(

*d*) starts from a low value, and then increases until a maximum value is reached, after which it starts to decrease again.

Next, the mean *μ* and standard deviation *σ* of all *K*(*d*) values (a finite set since the function is discretized by Δ*d*) are calculated. Then, the 2*σ* rule is applied, and the minimum value of *d* is selected as the cluster distance, that is *d* > *d*_{0}, *d*_{0} = argmax*d**K*(*d*) and *K*(*d*) = *μ* + 2*σ*. The derived cluster distances are also shown in Fig. 3a, b. Intuitively, the cluster distance is the value of *d* at the ‘valley’ of a DensityK - a visually identifiable (at least roughly) x-value where the decrease pace of *K*(*d*) value dramatically changes, leading to values close to zero. It is found that the resulting cluster distances always sit somewhere at the ‘valley’ of the functions (in terms of *K*(*d*) values) for different input data, and the derived clusters afterwards match quite well to the actual *spatial contexts* (spatial extents where the descriptions are embedded).

A comparison of annular and circular (replacing all Δ*d* by *d* in Function (1)) search regions is shown in Fig. 3c and d, with the same input data as in a and b respectively. When tested on sample data, it is found that when applying annular regions, the derived clusters are always more constrained (as the computed cluster distances are smaller) and closer to the actual spatial contexts than those derived from circular regions. Such more constrained clusters are preferred as they are more likely to exclude ambiguous entries. It is found that they lead to higher disambiguation precision on the tested data as well. This phenomenon is most likely because when applying annular regions, the DensityK functions are more sensitive to the change of local density. In comparison, applying circular regions results in smoother density functions and possibly much larger cluster distances derived.

*d*. As shown in Fig. 4, the DensityK function plots generated for the same input data with three different Δ

*d*values 100, 250, and 500

*m*are similar, and the cluster distances derived are the same. Δ

*d*should be set to a constant, small number (e.g., the values in Fig. 4) for all input data, just for the purpose of discretization. Such a small number works well for various input data, even those with large cluster distances. Note that there is no single-optimal cluster distance for disambiguation. For example, different cluster distances from 2500

*m*and 3500

*m*may lead to the same disambiguation result for a given input; however, a cluster distance with value 25000

*m*for the same input may increase or reduce the disambiguation precision, depending on the distances between the actual locations of the place names.

*K*(

*d*) for different

*d*values, and stores tuples of (

*d*,

*K*(

*d*)) in the list variable

*KFunction*. Then, the cluster distance is derived given

*KFunction*.

### 3.3 Step three: deriving clusters and disambiguation

*top-cluster*for this place name. For example, if an entry of a place name appears in the cluster with the largest number of points, the entry will be selected for disambiguation. If no corresponding entry of the place name is found in the first cluster, then the next cluster is chosen, until one entry is found. Thus, the worst case time complexity of this step is

*O*(

*n*

*m*) (

*m*is the number of clusters derived). In practice, as most places names are expected to be located within the first cluster, the time complexity is close to

*O*(

*n*). The reason that we consider multiple clusters derived instead of only the first cluster is because it is possible that the input place names are from multiple spatial foci, i.e., the locations of some of the named places are relatively far away. In such cases, these isolated place names will be missed by the first cluster thus cannot be disambiguated correctly. The complete disambiguation procedure of this step is given in Algorithm 2.

## 4 Experiment on comparison of the clustering algorithms

This section describes the input datasets, preprocessing procedure, used gazetteer and parser, and the final input to the algorithm to be tested. Then, the experiment settings in terms of algorithms and values used for their parameters are introduced.

### 4.1 Dataset and preprocessing

Two sets of place descriptions are used in the experiment. The first one contains 42 descriptions submitted by graduate students about the University of Melbourne campus, which are relatively focused in spatial extent [36]. The second one was harvested from web texts (e.g., Wikipedia, tourist sites, and blogs) about places around and inside Melbourne, Australia [22]. The two datasets cover more than 1000 distinct gazetteered places. Two example descriptions from the two datasets are shown below respectively, with gazetteered place names highlighted:

“... If you go into the

Old Quad, you will reach a square courtyard and at the back of the courtyard. You can either turn left to go to theArts Faculty Building, or turn right into theJohn Medley BuildingandWilson Hall[...] If you continue walk along the road on the right side where you’re facingUnion House, you can see theBeaurepaireand Swimming Pool. There will also be a sport tracks and theUniversity Ovalbehind it ...”

“...

St Margaret’s Schoolis an independent, non-denominational day school with a co-educational primary school to Year 4 and for girls from Year 5 to Year 12. The school is located inBerwick, a suburb ofMelbourne[...] In 2006, St Margaret’s School Council announced its decision to establish a brother school for St Margaret’s. This school opened in 2009 namedBerwick Grammar School, currently catering for boys in Years 5 to 12, with one year level being added each year ... ”

Place name recognition is outside the scope of this research, and therefore we use place names extracted by a previously-developed parser from each of the description dataset [27]. The parser applies sentence and word tokenisation, POS tagging, as well as full-text chunk parsing. It trains a Conditional Random Field (CRF) model on a manually annotated corpus. To enhance its performance, it also exploits external resources including gazetteers. Since place name extraction precision and recall are irrelevant in this research, all place names had been cleaned manually beforehand, and incorrectly extracted noises such as person names are stripped-off.

^{2}GoogleV3 geocoder,

^{3}and GeoNames.

^{4}For example, the name

*St Margaret’s School*has a total of 11 corresponding entries from the three gazetteers. The retrieved entries from the three sources were then synthesized, and duplicated entries referring to the same places were removed. The numbers of ambiguous gazetteer entries retrieved are shown in Fig. 5, representing the ambiguities of these place names.

### 4.2 Experiment setup

*k*-dist), OPTICS, OUTCLUST, CURE, CHAMELEON, HDBSCAN, KMeans, GMM, SNN, Spectral, SOM and DensityK. For

*k*-dist, the author did not give a straightforward way to determine a threshold. Therefore, we use the 2

*σ*rule in the same way as it is used in DensityK (Algorithm 1), to enable a fair comparison. For algorithms that have not been used for place name disambiguation before (i.e., from

*k*-dist to SOM), Algorithm 2 is used on the generated clusters for disambiguation. In case a top-cluster of a place name contains more than one gazetteer entries of this place name, the place name cannot be disambiguated and the case will be regarded as a failure. Different parameters of the algorithms are tested, as shown in Table 1.

Parameter configurations of algorithms to be tested for place name disambiguation

Parameter | Notion | Value | Algorithms |
---|---|---|---|

Distance threshold (meters) | | 200, 2000, 20000 | DBSCAN |

No. of nearest neighbors | | 5, 10, 25 | OUTCLUST, SNN, Chameleon, Spectral |

No. of clusters to derive | | 3, 5, 10, 20 | OPTICS, CURE, KMeans, GMM, Spectral |

Minimum points in cluster | | 1, 5, 10 | DBSCAN, |

Frequency threshold | | 0.1, 0.2, 0.5 | OUTCLUST |

Weighting coefficient | | 0.1, 1, 10 | Chameleon |

SOM dimension |
| (5, 5), (10, 10), (20, 20) | SOM |

There is a number of algorithmic features that are important in the place name disambiguation task. The first one is robustness: that an algorithm should ideally work on different input datasets and have mimimum variance in precision and distance error. The next feature is minimum parameter-dependency. A parameter-free algorithm, or an algorithm with parameters automatically determinable, is desirable. Again, this is because for place name disambiguation, no pre-knowledge such as distances between places, or the extent of the space should be assumed for an input. Lastly, an algorithm should also ideally be parameter-insensitive, i.e., modifying parameter values will not lead to significantly different results. Regarding these features, the degree of satisfaction of each of these algorithms when used for fine-grained place name disambiguation will be discussed.

## 5 Clustering algorithm performance results

Average precision of each algorithm with the best-performing parameters on the tested datasets

Category | Algorithm | Precision |
---|---|---|

Ad-hoc | OMD | 76.7% |

Centroid | 57.2% | |

DTUR | 69.3% | |

Density-based | DBSCAN | 81.5% |

DBSCAN | 75.4% | |

OPTICS | 73.2% | |

OUTCLUST | 70.6% | |

Hierarchical-based | CURE | 78.9% |

CHAMELEON | 58.3% | |

HDBSCAN | 75.7% | |

Partitioning relocation-based | KMeans | 73.4% |

GMM | 80.8% | |

Others | SNN | 70.5% |

Spectral | 74.4% | |

SOM | 73.1% | |

The new algorithm | DensityK | 83.1% |

*noise place names*: place names with their actual location not captured by gazetteer. For example, the place name

*Union House*is referring to a building in the University of Melbourne campus. Its true location has no corresponding gazetteer entry, and the ambiguous gazetteer entries retrieved for this place name in the input point cloud are elsewhere around the world with the same name. Such cases are common for fine-grained place names, while prominent place names (e.g., natural or political) are less likely to be missing in a gazetteer. Another disadvantage of the overall minimum distance method is scalability, as its time cost is significantly larger (over ten times) than other algorithms for most of the dataset tested, particularly for documents with large number of place names and high ambiguities. The centroid-based method performs badly as the input point cloud is spread over the earth, and the centroid is somewhere in the middle and far from the actual focus of the groundtruth locations.

DBSCAN is robust against noise place names, as it can capture the spatial context (the highlighted red region shown in Fig. 6) of the original description and neglect entries outside of it. For the example point cloud, when the parameter *ε* is set to 2000*m*, the resulting disambiguation precision is higher than with other values selected from Table 1. More groundtruth entries are missed by the cluster generated with a value of 200*m*, and more ambiguous entries are included with a value of 20000*m*. For the clusters generated by the *k*-dist method, the value of *ε* determined in this case is roughly 300*km*, which is significantly larger than the most suitable value (somewhere between 1000 and 2000*m*). Consequently, *k*-dist performs badly in this case.

*NumberOfClusters*(

*c*) of OPTICS is problematic to define. Nevertheless, it is found that setting the value to 10 generally leads to optimal results regardless of input. OUTCLUST has the same drawback of merging nearby points from the spatial context, and it is decided by both parameters

*k*and

*f*. The two parameters are more sensitive to input data compared to

*c*of OPTICS, and there is no straightforward method to determine the values either. A large input

*k*value will result in few clusters, as more data points will be regarded as neighbors, and vice versa. Compared to OPTICS, OUTCLUST focuses more on relative density by considering nearest neighbors rather than absolute density, thus, boundary points that are relatively close to some clusters while isolated from others, are more likely to be merged.

*c*, similar to OPTICS. The derived clusters by CURE are generally similar to OPTICS. CHAMELEON is more parameter-sensitive than CURE, and the resulting disambiguation precision is not as good as CURE even with the best-performing parameters. As for HDBSCAN, although it does not require any mandatory input parameters, the resulting precision for some input data is only slightly worse than DensityK. However, HDBSCAN is not robust against different input data – it performs quite well for some data, but significantly worse for others. We discuss this in more detail later in this section.

*k*clusters. As a partition-based algorithm, it is not expected to perform well on fine-grained place name disambiguation, which is not a classification problem, and the resulting average precision is worse than HDBSCAN and CURE. For some input data, GMM performs well and achieves the same precisions as DensityK, or as DBSCAN with the best performing parameter values. The performance is generally good (measured by average precision) and robust (e.g., compared to HDBSCAN, which is discussed later). In addition, for most input data, setting different values of

*c*, once larger than 10, makes little difference to the clustering compared to algorithms such as KMeans or CURE. Still, there is no easy way to automatically determine the value of

*c*, and a single value does not always lead to the highest precisions for different input data.

*k*, the number of nearest neighbors to consider, and different

*k*values often result in significantly different clustering results, as shown in the figure. A large

*k*value tends to result in only a few large, well-separated clusters, and small local variations in density have little impact. Similar to OUTCLUST, there is no easy way to determine a suitable, meaningful number of nearest neighbors to consider. Spectral clustering also has the problem of parameter sensitivity, both for

*c*and

*k*. Its precision is almost always worse than algorithms such as DBSCAN, CURE, and GMM, even with the best-performing parameter values. The resulting clusters generated by SOM are often similar in pattern to those derived by CURE or KMeans, but the average precision is much lower (even lower than Spectral clustering). One advantage of SOM is that the SOM dimension can easily be set to large numbers, which typically leads to higher precisions compared to adopting small values such as (5,5). When it is set to more than (20,20), continually increasing the values makes minimal difference to the resulting clusters, as well as precisions.

*ε*set to 2000

*m*for this particular input, as shown in Fig. 7. Compared to the results generated by the other algorithms, as shown in Figs. 8, 9, 10 and 11, it can be seen that the first-ranking cluster (the purple circles) generated by DensityK is most focused and similar to the highlighted ground truth spatial context shown in Fig. 6.

*c*for OPTICS, CURE, and GMM). In comparison, parameters such as

*k*or

*ε*are more sensitive to input, and cannot be determined easily each time a new input is given. Here we further evaluate the robustness of the five algorithms over different input data, in terms of variation in precision and average distance error, i.e., the average distance between the ground truth locations of place names and the entries selected by these algorithms. We randomly select documents from our dataset, and the results are shown in Fig. 13. DensityK has almost always the highest precision, as well as low variation compared to the other algorithms, particularly HDBSCAN and OPTICS. In terms of distance errors, DensityK has the least variance as well as overall minimum distance errors.

## 6 Conclusions

Place descriptions in everyday communication provide a rich source of knowledge about places. In order to utilize such knowledge in information systems, an important step is to locate the places being referred to. The problem of locating place names from text sources is often called toponym resolution, which consists of two tasks: place name identification from text, and place name disambiguation. This research looks at the second task, and more specifically, disambiguating fine-grained place names extracted from place descriptions. We focus on clustering-based disambiguation approaches, as clustering approaches require minimum pre-knowledge of the place names to be disambiguated compared to knowledge- and machine learning-based approaches.

For this purpose, we first select clustering algorithms that have been used for place name disambiguation in the literature, or are from other communities (e.g., data mining) and are regarded as promising for this task. We evaluate and compare the performance of these algorithms based on two different datasets using precision and distance error. For algorithms that require parameters, different values of each parameter are tested in a grid-search manner. We then analyze the performance and associated causes for each algorithm, its parameter-dependency and parameter-sensitivity, robustness (in terms of variance of their performance over different input data), and discuss the suitability of each algorithm for fine-grained place name disambiguation based on these criteria.

Furthermore, a new clustering algorithm, DensityK, is presented. It is designed to overcome several identified limitations of the previous algorithms. It out-performs the other tested algorithms and achieves state-of-art disambiguation precision on the test datasets. The algorithm is based on analyzing local densities of an input point cloud, which consists of all ambiguous gazetteer entries corresponding to the place names extracted from an input document. It then derives a density threshold for determining clusters that have significantly larger densities than other points. Compared to the other algorithms, DensityK is parameter-independent, robust against different input data with various spatial extents, densities, and granularities, which makes it most desirable for the task of this research. This is reflected by consistently achieving higher precision and overall minimum distance error compared to other competitive algorithms. The worst time complexity of the algorithm is same as DBSCAN (*O*(*n*^{2})), when both are considered without any indexing mechanism for neighborhood queries. The time complexity is better than algorithms such as overall minimum distance clustering.

The focus of this research is to provide recommendations for the selection of appropriate methods of clustering-based disambiguation, for fine-grained place names from place descriptions. We have not yet considered further optimizing the developed algorithm, although we explained briefly how indexing and optional parameters can be used for facilitating processing time in Section 3.1. Optimization is important considering certain applications such as processing streaming data for goals such as geographic information retrieval. Finally, a clustering algorithm for this purpose can be used in conjunction with other knowledge- or machine-learning based approaches to enhance precision, which is beyond the scope of this research.

## Footnotes

## Notes

## References

- 1.Adelfio MD, Samet H (2013) Structured toponym resolution using combined hierarchical place categories. In: Proceedings of the 7th workshop on geographic information retrieval, pp 49–56Google Scholar
- 2.Amitay E, Har’El N, Sivan R, Soffer A (2004) Web-a-where: geotagging web content. In: Proceedings of SIGIR ’04 conference on research and development in information retrieval, pp 273–280Google Scholar
- 3.Angiulli F (2006) Clustering by exceptions. In: Proceedings of the national conference on artificial intelligence, pp 312–317Google Scholar
- 4.Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD conference. Philadelphia, pp 49–60Google Scholar
- 5.Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas CTM (eds) Grouping multidimensional data. Springer, Berlin, pp 25–71Google Scholar
- 6.Buscaldi D (2011) Approaches to disambiguating toponyms. SIGSPATIAL Special 3(2):16–19CrossRefGoogle Scholar
- 7.Buscaldi D, Magnini B (2010) Grounding toponyms in an Italian local news corpus. In: Proceedings of the 6th workshop on geographic information retrieval, pp 70–75Google Scholar
- 8.Buscaldi D, Rosso P (2008) A conceptual density-based approach for the disambiguation of toponyms. Int J Geogr Inf Sci 22(3):301–313CrossRefGoogle Scholar
- 9.Buscaldi D, Rosso P (2008) Map-based vs. knowledge-based toponym disambiguation. In: Proceedings of the 2nd international workshop on geographic information retrieval, pp 19–22Google Scholar
- 10.Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pei J, Tseng VS, Cao L, Motoda HXG (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 160–172Google Scholar
- 11.Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332CrossRefGoogle Scholar
- 12.Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM International conference on information and knowledge management, pp 759–768Google Scholar
- 13.Derungs C, Palacio D, Purves RS (2012) Resolving fine granularity toponyms: evaluation of a disambiguation approach. In: Proceedings of the 7th international conference on geographic information science, pp 1–5Google Scholar
- 14.Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining, pp 47–58Google Scholar
- 15.Ester M, Kriegel HP, Sander J, Xu X, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, pp 226–231Google Scholar
- 16.Goodchild MF, Hill LL (2008) Introduction to digital gazetteer research. Int J Geograph Inf Sci 22(10):1039–1044. https://doi.org/10.1080/13658810701850497. http://www.tandfonline.com/doi/abs/10.1080/13658810701850497, arXiv:1011.1669v3 CrossRefGoogle Scholar
- 17.Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM Sigmod record, vol 27. ACM, pp 73–84Google Scholar
- 18.Habib MB, Keulen MV, van Keulen M (2012) Improving toponym disambiguation by iteratively enhancing certainty of extraction. In: Proceedings of the international conference on knowledge discovery and information retrieval, KDIR 2012. Barcelona, pp 399–410Google Scholar
- 19.Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc 28(1):100–108Google Scholar
- 20.Hill LL (2000) Core elements of digital gazetteers: placenames, categories, and footprints. In: Research and advanced technology for digital libraries. Springer, pp 280–290 https://doi.org/10.1007/3-540-45268-0_26. http://link.springer.com/10.1007/3-540-45268-0_26
- 21.Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75CrossRefGoogle Scholar
- 22.Kim J, Vasardani M, Winter S (2015) Harvesting large corpora for generating place graphs. In: Workshop on cognitive engineering for spatial information processes, COSIT 2015, pp 20–26Google Scholar
- 23.Kohonen T (1998) The self-organizing map. Neurocomputing 21(1–3):1–6CrossRefGoogle Scholar
- 24.Leidner JL (2008) Toponym resolution in text: annotation, evaluation and applications of spatial grounding of place names. Universal-PublishersGoogle Scholar
- 25.Leidner JL, Sinclair G, Webber B (2003) Grounding spatial named entities for information extraction and question answering. In: Proceedings of the HLT-NAACL 2003 workshop on analysis of geographic references, pp 31–38Google Scholar
- 26.Lieberman MD, Samet H, Sankaranarayanan J, Sperling J (2007) STEWARD: architecture of a spatio-textual search engine. In: Samet H, Shahabi C, Schneider M (eds) Proceedings of the 15th annual ACM international symposium on advances in geographic information systems, Seattle, pp 186–193Google Scholar
- 27.Liu F, Vasardani M, Baldwin T (2014) Automatic identification of locative expressions from social media text: a comparative analysis. In: Proceedings of the 4th international workshop on location and the web. ACM, pp 9–16Google Scholar
- 28.Moncla L, Renteria-Agualimpia W, Nogueras-iso J, Gaio M (2014) Geocoding for texts with fine-grain toponyms : an experiment on a geoparsed hiking descriptions corpus. In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems, pp 183–192Google Scholar
- 29.Palacio D, Derungs C, Purves R (2015) Development and evaluation of a geographic information retrieval system using fine grained toponyms. J Spat Inf Sci 2015(11):1–29Google Scholar
- 30.Ripley BD (1976) The second-order analysis of stationary point processes. J Appl Probab 13(2):255–266CrossRefGoogle Scholar
- 31.Roberts K, Bejan CA, Harabagiu SM (2010) Toponym disambiguation using events. In: FLAIRS conference, vol 10, p 1Google Scholar
- 32.Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1500–1510Google Scholar
- 33.Smith DA, Crane G (2001) Disambiguating geographic names in a historical digital library. In: International conference on theory and practice of digital libraries, Springer, pp 127–136Google Scholar
- 34.Smith DA, Mann GS (2003) Bootstrapping toponym classifier. In: Proceedings of the HLT-NAACL 2003 workshop on analysis of geographic references. Association for Computational Linguistics, pp 45-49Google Scholar
- 35.Teitler BE, Lieberman MD, Panozzo D, Sankaranarayanan J, Samet H, Sperling J (2008) NewsStand: a new view on news. In: Aref W G, Mokbel M F, Schneider M (eds) Proceedings of the 16th ACM SIGSPATIAL international conference on advances in geographic information systems, pp 144–153Google Scholar
- 36.Vasardani M, Timpf S, Winter S, Tomko M (2013) From descriptions to depictions: a conceptual framework. In: Tenbrink T, Stell J, Galton A, Wood Z (eds) Spatial information theory: 11th international conference COSIT 2013. Springer, pp 299–319Google Scholar
- 37.Vasardani M, Winter S, Richter KF (2013b) Locating place names from place descriptions. Int J Geogr Inf Sci 27(12):2509–2532CrossRefGoogle Scholar
- 38.Wing B, Baldridge J (2014) Hierarchical discriminative classification for text-based geolocation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 336–348Google Scholar