Abstract
With the growth of the Linked Data Web, timeefficient link discovery frameworks have become indispensable for implementing the fourth Linked Data principle, i.e., the provision of links between data sources. Due to the sheer size of the Data Web, detecting links even when using trivial link specifications based on a single property can be timedemanding. Moreover, nontrivial link discovery tasks require complex link specifications and are consequently even more challenging to optimize with respect to runtime. In this paper, we present a hybrid approach to link discovery that allows combining timeefficient algorithms specialized on specific data types. Especially, we present the HYPPO algorithm, which can process numeric data efficiently. These algorithms are combined by using original insights on the translation of complex link specifications to combinations of atomic specifications via a series of operations on sets and filters. We show in nine experiments that our approach outperforms SILK 2.5.1 with respect to runtime by up to four orders of magnitude.
Keywords
Knowledge Base Link Data Link Specification Formal Grammar Atomic Measure1 Introduction
The Linked Data Web has evolved from 12 knowledge bases in May 2007 to 295 knowledge bases in September 2011^{1} [1]. While the number of RDF triples available on the Linked Data Web has now surpassed 31 billion, the number of links still stagnates around 500 million. Consequently, less than 2 % of these triples are links between knowledge bases [17, 19]. In addition, most knowledge bases are linked to only one knowledge base.^{2} Yet, links between knowledge bases play a key role in important tasks such as crossontology question answering [14], largescale inferences [26] and data integration [3]. Given the enormous amount of information available on the Linked Data Web, timeefficient link discovery (LD) frameworks have become indispensable for implementing the fourth Linked Data principle, i.e., the provision of links between data sources [17, 27]. These frameworks rely on link specifications, which explicate conditions for computing new links between entities across knowledge bases. Due to the mere size of the Web of Data, detecting links even when using trivial specifications can be timedemanding. Moreover, nontrivial LD tasks require complex link specifications for discovering accurate links between instances and are consequently even more challenging to optimize with respect to runtime. Our approach is based on original insights on the distribution of property domains and ranges on the Web of Data. Based on these insights, we deduce requirements to efficient LD frameworks. We then use these requirements to specify the timeefficient approaches that underlie our framework, LIMES version 0.5.^{3} We show that our framework outperforms the state of the art by orders of magnitude with respect to runtime while abiding to the restriction of not losing recall.^{4}
 1.
We present a formal grammar for link specifications that encompass the functionality of stateoftheart frameworks for LD.
 2.
Based on this grammar, we present a timeefficient approach for LD that is based on translating complex link specifications into a combination of atomic specifications via a concatenation of operations on sets and filter operations.
 3.
We use this method to enable the PPJoin+ [31] and EDJoin [30] algorithms to be used for processing complex link specifications.
 4.
We specify and evaluate HYPPO, a novel LD approach designed to operate on numeric values in metric spaces.
 5.
We evaluate our approach against SILK (version 2.5.1) within nine experiments and show that we outperform it by up to four orders of magnitude with respect to runtime while abiding to the constraint of not losing recall. Note that we chose SILK because it is the only freely available framework which supports specifications with a similarity measure whose complexity is similar to those supported by our approach.

We present a broader motivation for our approach, including a study of property types across several knowledge bases.

We extended the specification of the grammar underlying our framework and include a corresponding example.

Moreover, we explicate the inner workings of LIMES by presenting its current architecture and Graphical User Interface.

The experiments and results sections are completely new. All experiments presented in [16] were repeated with the (at the time of writing) newest release of SILK (version 2.5.1). In addition, six new experiments were designed to compare the runtime of HYPPO with that of SILK’s numeric processing algorithm.
2 Related Work
Current frameworks for LD on the Web of Data can be subdivided into two categories: domainspecific and universal frameworks [17]. Domainspecific LD frameworks were developed with the aim of discovering links between knowledge bases from a particular domain. For example, RKB’s Consistent Reference Service (RKBCRS) [9] is a service that aims to compute URI equivalence within the domain of academia. It applies string similarity functions to properties such as publications’ titles to detect initial equivalences. It then uses this knowledge to conclude on the equivalence of other entities such as authors, places of work and conferences. Another domainspecific tool is GNAT [23], which was designed especially for the music domain. To compute the similarity of resources, GNAT relies on audio fingerprinting. This approach can also combine the similarity of resources with that of their neighbors to compute owl:sameAs links. Further simple or domainspecific approaches can be found in [6, 10, 21, 22, 25].
Universal LD frameworks are designed to carry out mapping tasks independently from the domain of the source and target knowledge bases. For example, RDFAI [24] implements a fivestep approach that comprises the preprocessing, matching, fusion, interlink and postprocessing of data sets. These modules can be configured by means of XMLfiles. The SILK framework [11] implements a LD approach dubbed Multiblock. In contrast to several other blocking approaches, MultiBlock is guaranteed to be lossless, which means that given a link specification, it is guaranteed to generate all triples that abide by the specification. To achieve this goal, Multiblock maps the different similarities included in complex link specifications to a multidimensional space. The coordinates of the resources that are to be linked are then computed by the means of an elaborate indexing scheme. The computation of links is finally achieved by computing overlapping blocks and carrying out similarity computations within these blocks only. Like LIMES, SILK can be configured using an XMLbased language. The original LIMES approach [17] is a timeefficient and lossless approach for LD, which presupposes that the datasets to link are in a metric space. By using the characteristics of metric spaces, it begins by computing exemplars, which are prototypical points for portions of space. The approach then uses the triangle inequality to compute pessimistic approximations of distances. Based on these approximations, it can discard a large number of computations without losing links.
LD is closely related with record linkage [7, 29] and deduplication [4], topics upon which a wealth of literature has been written (see e.g., [5] for a survey). Link discovery builds upon this research but goes beyond these two tasks by aiming to provide the means to link entities via any of the relations available on the Linked Data Web. For example, the LIMES framework has been used to link drugs with their active moiety and inactive ingredients as well as to link houses in Oxford with nearby geospatial entities [13].^{5} Different blocking techniques such as standard blocking, sortedneighborhood, bigram indexing, canopy clustering and adaptive blocking have been developed by the database community to address the problem of the quadratic time complexity of brute force comparison [12]. In addition, timeefficient approaches have been proposed to compute string similarities for record linkage, including AllPairs [2], PPJoin and PPJoin+ [31], EDJoin [30] and TrieJoin [28]. The most timeefficient string matching algorithms can only deal with simple link specifications (i.e., they can only compare entities by the means of one pair of property values), which is mostly insufficient when computing links between large knowledge bases. In this paper, we show how we can harness timeefficient approaches by combining them in a framework that enables them to be used when dealing with complex configurations. We integrate PPJoin+ and EDJoin in our framework. We also present the novel Hypersphere Approximation Algorithm (HYPPO), which ensures that our framework can deal efficiently with numeric values and consequently with the whole diversity of data types found on the Web of Data.
3 Preliminaries
3.1 Problem Definition
The goal of LD is to discover the set of pair of instances \((s,t)\in S \times T\) that are related by a relation \(R\), where \(S\) and \(T\) are two not necessarily distinct sets of instances. One way to automate this discovery is to compare the \(s \in S\) and \(t \in T\) based on their properties using a (in general complex) similarity metric. Two entities are then considered to be linked via \(R\) if their similarity is superior to a threshold \(\theta \). We are aware that several categories of approaches can be envisaged for discovering links between instances, for example using formal inferences or semantic similarity functions. Throughout this paper, we will consider LD via properties. This is the most common definition of instancebased LD [17, 27], which translates into the following formal definition:
Definition 1
(Link discovery). Given two sets \(S\) (source) and \(T\) (target) of instances, a (complex) similarity measure \(\sigma \) over the properties of \(s \in S\) and \(t \in T\) and a similarity threshold \(\theta \in [0,1]\), the goal of LD is to compute the set of pairs of instances \((s,t)\in S\times T\) such that \(\sigma (s, t)\ge \theta \).
This problem can be expressed equivalently as follows:
Definition 2
(Link discovery on distances). Given two sets \(S\) and \(T\) of instances, a (complex) distance measure \(\delta \) over the properties of \(s \in S\) and \(t \in T\) and a distance threshold \(\theta \in [0,\infty [\), the goal of LD is to compute the set of pairs of instances \((s,t)\in S\times T\) such that \(\delta (s, t)\le \tau \).
Note that a distance function \(\delta \) can always be transformed into a normed similarity function \(\sigma \) by setting \(\sigma (x,y)=(1+\delta (x, y))^{1}\). Hence, the distance threshold \(\tau \) can be transformed into a similarity threshold \(\theta \) by means of the equation \(\theta =(1+\tau )^{1}\). Consequently, distance and similarities are used interchangeably within our framework.
Although it is sometimes sufficient to define atomic similarity functions (i.e., similarity functions that operate on exactly one property pair) for LD, many LD problems demand the specification of complex similarity functions to return accurate links. For example, while the name of bands can be used for detecting duplicate bands across different knowledge bases, linking cities from different knowledge bases requires taking more properties into consideration (e.g., the different names of the cities as well as their latitude and longitude) to compute links accurately. The same holds for movies, where similarity functions based on properties such as the label and length of the movie as well as the name of its director are necessary to achieve highaccuracy link discovery. Consequently, linking on the Data Web demands frameworks that support complex link specifications.
3.2 Categorization of Approaches to Link Discovery
Three main categories of approaches can be envisaged when dealing with complex link specifications. The first type of approaches, which we dub multidimensional, address linking by mapping each instance to one or several points in a multidimensional (usually but not necessarily metric) space. They then use runtime reduction techniques, most commonly blocking [12], to discard comparisons that would not lead to a similarity above the usergiven threshold. An example of such an approach is SILK’s Multiblock [11]. The main advantage of such approaches is that they can exclude a large number of comparisons and that they are able to detect blocks that do not overlap significantly.
The second category of approaches, which we call monodimensional, relies on generating necessary constraints across single dimensions of the similarity space and using these for extracting linking candidates based on these constraints. These candidates are then merged and validated, i.e., checked for whether they satisfy the linking condition specified by the user. The main advantage of such approaches is that they deal with only one dimension at once, thus making runtime reduction approaches computationally cheaper. On the other hand, discarding on only one dimension at once is usually less accurate since converting the usergiven constraints to one dimension usually leads to necessary but not sufficient conditions. Consequently, monodimensional approaches generate more candidates that must be validated.
Hybrid approaches aim to make the best of both worlds by using runtime reduction techniques on the fragments of the link specifications where the blocks do not overlap significantly (like multidimensional approaches) to generate candidates. These candidates are then merged to generate the final list of links (monodimensional approaches). By these means, hybrid approaches thus aim to ensure that only the cheapest computations which discard a large percentage of nonmatches are carried out. In this paper, we describe such an approach. The basic intuition behind our approach is that timeefficient linking frameworks designed for the Web of Data should provide dedicated algorithms for processing the most commonly used data types found on the Web of Data. These approaches should make use of the intrinsic characteristics of the data type that they process to operate as efficiently as possible. Our approach can efficiently process all property values that can be mapped efficiently to a metric space by applying the HYPPO algorithm in that space. In addition, it provides dedicated algorithms for processing data types that cannot (yet) be efficiently mapped to metric spaces. To determine the most common data types on the Web of Data, we carried out a short study of the distribution of property ranges across different knowledge bases and used it to specify our approach to LD.
3.3 Requirements to Link Discovery Frameworks
Distribution of datatype property ranges on the Web of Data
Knowledge bases  #Datatype properties\(^*\)  #Data types  String  Numeric  Others 

LGD  1,001  3  0  1,001  0 
DBpedia  1,048  60  282  765  1 
DailyMed\(^{*}\)  17  1  17  0  0 
Jamendo\(^{*}\)  15  4  8  5  2 
DBLP\(^{*}\)  5  2  4  1  0 
Our framework thus implements a hybrid approach that takes the distribution of data types into account by implementing dedicated functionality for processing simple linking tasks on strings (e.g., comparing the names of two cities) and combinations of numeric values (e.g., comparing the population and elevation of two cities) as well for merging their results to carry out complex linking tasks (e.g., linking cities using their labels, population and elevation) efficiently. To achieve this goal, our framework implements a grammar for transforming complex linking tasks into a combination of simple linking tasks and set operations. In the following, we present this grammar and then present efficacious approaches to carrying out simple linking tasks that lead to the timeefficient completion of linking tasks.
4 Link Specifications as Operations on Sets
In stateoftheart LD frameworks, the condition for establishing links is usually expressed by using combinations of operations such as MAX (maximum), MIN (minimum) and linear combinations on binary similarity measures that compare property values of two instances \((s, t) \in S \times T\). Note that transformation operations may be applied to the property values (for example a lowercase transformation for strings) but do not affect our formal model. We present a formal grammar that encompasses complex link specifications as found in current LD frameworks (e.g., LIMES [19], SILK [11], KnoFuss [20]) and show how complex configurations resulting from this grammar can be translated into a sequence of set and filter operations on simple configurations. We use \(\rightsquigarrow \) to denote generation rules for metrics and specifications. The symbol \(\equiv \) denotes the equivalence of two specifications.
 1.
\(m \rightsquigarrow atomicMeasure\)
 2.
\(m \rightsquigarrow metricOp(m_1, m_2)\)
 1.
specification operators \(specOp\) such as AND (the conditions of both specifications must be satisfied, equivalent to set intersection), OR (set union), XOR (symmetric set difference), or DIFF (set difference) and
 2.
a filtering threshold.
 1.
\(spec(m, \theta )\rightsquigarrow atomicSpec(m, \theta )\)
 2.
\(spec(m, \theta )\!\rightsquigarrow \! specOp (spec (m_1,\theta _1), spec(m_2,\theta _2),\theta _3)\)
 1.
\(spec(MAX (m_1, m_2), \theta ) \equiv AND(spec (m_1, \theta ),\) \( spec(m_2, \theta ), 0)\)
 2.
\(spec(MIN (m_1, m_2), \theta ) \equiv OR(spec (m_1, \theta ),\) \( spec(m_2, \theta ), 0)\)
 3.
\(spec(\alpha m_1 + \beta m_2, \theta ) \equiv AND(spec (m_1, (\theta  \beta )/\alpha ),\) \(spec(m_2, (\theta  \alpha )/\beta ), \theta )\)
5 Processing Simple Configurations
Our framework implements a hybrid approach to LD and implements two main types of matchers for processing simple configurations: string matchers and numeric matchers. In the following, we present the idea behind the string matching algorithms we employ as well as present the novel HYPPO algorithm.
5.1 Processing Strings
The first category of matchers implemented in our framework deals exclusively with strings by harnessing the nearduplicate detection algorithms PPJoin+ [31] and EDJoin [30]. Instead of mapping strings to a vector space, PPJoin+ and EDJoin use a combination of three main insights to implement a timeefficient string comparison approach. First, they use the idea that strings with a given similarity must share a certain number of characters in their prefix to be able to have a similarity beyond the userspecified threshold. A similar intuition governs the suffix filtering implemented by these algorithms. Finally, the algorithms make use of the position of each word \(w\) in the index to retrieve a lower and upper bound of the index of the terms with which \(w\) might be similar. By combining these three approaches, PPJoin+ and EDJoin can discard a large number of nonmatches. The integration of these two algorithms into our framework ensures that we can mitigate the pitfall of the timedemanding transformation of strings to vector spaces as implemented by multidimensional approaches. The main drawback of PPJoin+ and EDJoin is that they can only operate on one dimension [12]. However, by applying the transformations of configurations specified above, we make the algorithms at hand applicable to link discovery tasks with complex configurations. While mapping strings to a vector space demands some transformation steps and can be thus computationally demanding, all numeric values explicitly describe a vector space. The second approach implemented in our framework deals exclusively with numeric values and implements a novel approach dubbed HYPPO.
5.2 Processing Numeric Values
Current approaches to LD mostly focus on processing strings efficiently. Yet, as shown in Table 1, values that can be mapped to real numbers (e.g., elevations, temperatures, populations, etc.) play an important role on the Link Data Web. We developed the HYPPO algorithm to address the efficient processing of such property values. HYPPO stands for HYpersphere aPPrOximation algorithm. It addresses the problem of efficiently mapping instance pairs \((s,t) \in S \times T\) described using exclusively numeric values in a \(n\)dimensional metric space. The approach assumes a distance metric \(\delta \) for measuring the distance between objects and returns all pairs such that \(\delta (s,t) \le \theta \), where \(\theta \) is a distance threshold. Let \(\omega =(\omega _1,\ldots ,\omega _n)\) and \(x=(x_1,\ldots ,x_n)\) be points in the \(n\)dimensional space \(\Omega =S\cup T\). The observation behind HYPPO is that in spaces \((\Omega ,\delta )\) with orthogonal, i.e., uncorrelated dimensions, the most common distance metrics can be decomposed into the combination of functions \(\phi _{i,i\in \{1\ldots n\}}\) which operate on exactly one dimension of \(\Omega :\delta =f(\phi _1, \ldots , \phi _n)\). For example, for Minkowski distances of order \(p>1,\phi _i(x,\omega )=x_i\omega _i\) for all values of \(i\) and \(\delta (x,\omega )=\root p \of {\sum \phi _i(x,\omega )^p}\). Note that the Euclidean distance is the Minkowsky distance of order 2. The Minkowski distance can be extended further by weighting the different axes of \(\Omega \). In this case, \(\delta (x,\omega )=\root p \of {\sum \gamma _{ii}^p\phi _i(x,\omega )^p}\) and \(\phi _i(x,\omega )={\gamma }_{ii}x_i\omega _i\), where \(\gamma _{ii}\) are the entries of a positive diagonal matrix.
The basic intuition behind HYPPO is that the hypersphere \(H(\omega , \theta )=\{x\in \Omega : \delta (x, \omega ) \le \theta \}\) is a subset of the hypercube \(V\) defined as \(V(\omega ,\theta )=\{x\in \Omega : \forall i \in \{1 \ldots n\}, \phi _i(x_i, \omega _i) \le \theta \}\) due to inequality (1). Consequently, one can reduce the number of comparisons necessary to detect all elements of \(H(\omega , \theta )\) by discarding all elements which are not in \(V(\omega , \theta )\) as nonmatches. HYPPO uses this intuition by implementing a twostep approach to LD. First, it divides \(\Omega \) into hypercubes of the same volume. Second, it compares each \(s \in S\) with those \(t \in T\) that lie in cubes at a distance below \(\theta \). Note that these two steps differ from the steps followed by similar algorithms (such as blocking) in two ways. First, we do not use only one but several hypercubes to approximate \(H(\omega ,\theta )\). Most blocking approach rely on finding one block that contains the elements that are to be compared with \(\omega \) [12]. Note that in contrast to most blocking techniques, HYPPO is guaranteed to be lossless as \(H\) is completely enclosed in \(V\).
6 Implementation
7 Evaluation
We compared our approach with that implemented in SILK version 2.5.1. We chose SILK because (to the best of our knowledge) it is the only other LD framework that allows the specification of such complex linking experiments. We could not separate the fetching and the indexing of the data in SILK as these two processes are intertwined. To ensure that our evaluation was not biased towards LIMES, we only measured the time needed by SILK to compare links. This was realized by allowing both tools to download all data necessary for the linking experiments unto the hard drive of our machine. Note that SILK indexes the data. It downloads and stores the data and the index locally on the hard drive. LIMES on the other hand simply downloads the data and serializes them into a file. We ran the experiments by allowing both systems to retrieve the data necessary for linking from the hard drive of the local machine. As all computations are done on the fly in LIMES (i.e., no preindexing or data segmentation is carried out during the download of the data), our measurements of LIMES’ overall runtime display the sum of the tiling and link computation time, while the measurements of SILK reflect exclusively the time necessary for the computation of links.
We ran all experiments on the same computer running a Windows 7 Enterprise 64bit installation on a 2.8 GHz i7 processor with 8 GB RAM. The JVM was allocated 4 GB RAM in the first series of experiments and 7.4 GB RAM in the second series of experiments. All experiments were carried out five times except when stated otherwise. In all cases, we report best runtimes. The apriori complexity of the experiments was computed as \(nST\) where \(n\) is the number of property pairs used during the experiments, \(S\) is the size of the source knowledge base and \(T\) is the size of the target knowledge base.
7.1 Experiments with HYPPO
In our first series of experiments, we aimed at determining the behavior of HYPPO on problems with a varying number of dimensions. Thus, we evaluated HYPPO within six use cases of 1, 2 or 3 dimensions and compared it with SILK. To ensure that we compared solely HYPPO and SILK, we designed experiments that aimed at deduplicating instances in DBpedia and executed solely the fragment of the specification that dealt with numeric values. We chose DBpedia because it contains a large amount of general knowledge. Note that all data sets were retrieved from a local copy of DBpedia 3.6. We ran all experiments with distance thresholds (\(\theta \)) and granularity (\(\alpha \)) values between 1 and 16. In all experiments, we used the normed similarity based on the Euclidean Distance.
Summary of experimental setups for experiments on HYPPO
Experiment  # Instances  Apriori complexity  # Dimensions 

Town  27,525  \(0.76 \times 10^9\)  1 
Books  14,714  \(0.22 \times 10^9\)  1 
Vacations  21,925  \(0.96 \times 10^9\)  2 
Actors  15,909  \(0.51 \times 10^9\)  2 
Series  4,841  \(0.07 \times 10^9\)  3 
Hydrology  19,095  \(1.09 \times 10^9\)  3 
7.2 Experiments with LIMES
We compared our whole framework with SILK2.5.1 in three experiments of different complexity based on geographic data. We chose to use geographic datasets because they are large and require the use of several attributes for linking. Given the complexity of the data, having a timeefficient indexing scheme plays a central role in these experiments. Thus, we first measured the runtime without indexing (like in the previous experiments). In addition, we approximated the runtime necessary for the ARQ^{8} library (which is used in both tools) to fetch the data to compute from the endpoints. By these means, we could approximate the total runtime of both approaches including indexing. In the first experiment, we computed links between villages in DBpedia and LinkedGeoData based on the rdfs:label and the population of instances. The link condition was twofold: (1) the difference in population had to be lower or equal to \(\theta \) and (2) the labels had to have a trigram similarity larger or equal to \(\tau \). In the second experiment, we aimed to link towns and cities from DBpedia with populated places in Geonames. We used the names (gn:name), alternate names (gn:alternateName) and population of cities as criteria for the comparison. Finally, we computed links between Geolocations in LinkedGeoData and GeoNames using four combinations of criteria for comparing entities: their longitude (wgs84:long), latitude (wgs84:lat), preferred names and names.
Summary of experimental setups for LIMES and SILK
Experiment  \(S\)  \(T\)  Dims  Complexity  Thresholds 

Villages  26,717  103,175  2  \(5.5 \times 10^9\)  \(\tau _\mathrm{s}, \theta _\mathrm{p}\) 
Cities  36,877  39,800  3  \(4.4 \times 10^9\)  \(\tau _\mathrm{s}, \theta _\mathrm{p}\) 
GeoLocations  50,031  74,458  4  \(14.9 \times 10^9\)  \(\tau _\mathrm{s}, \theta _\mathrm{p}, \theta _\mathrm{l}\) 
8 Discussion and Future Work
In this paper, we presented and evaluated a novel hybrid approach to LD. We first presented a series of requirements to LD frameworks. Based on these requirements, we specified the characteristics of such frameworks. We then presented original insights for converting complex link specifications into simple link specifications. Based on these conversions, we inferred that efficient means for processing simple link specifications are the key for timeefficient linking. We then presented the key timeefficient approaches implemented in LIMES and showed how these approaches can be combined for timeefficient linking. A thorough evaluation of our framework in nine experiments showed that we outperform SILK by up to 4.5 orders of magnitude while not losing a single link.
One of the central innovations of this paper is the HYpersphere aPPrOximation algorithm, HYPPO. Although it was defined for numeric values, HYPPO can be easily generalized to the efficient computation of pairs of entities that are totally ordered, i.e., to all sets of entities \(e=(e_1,\ldots ,e_n) \in E\) such that a real function \(f_i\) exists, which preserves the order \(\succ \) on the ith dimension of \(E\), ergo \(\forall e,e^{\prime }\in E: e_i \succ e^{\prime }_i\rightarrow f(e_i)>f(e^{\prime }_i)\). Yet, it is important to notice that such a function can be complex and thus lead to overheads that may nullify the time gain of HYPPO. In future work, we will aim to find such functions for different data types. In addition, we will aim to formulate an approach for determining the best value of \(\alpha \) for any given link specification. The new version of LIMES promises to be a stepping stone for the creation of a multitude of novel semantic applications, as it is timeefficient enough to make complex interactive scenarios for link discovery possible even at large scale [18].
Footnotes
 1.
 2.
While it is clear that most knowledge bases should be linked to several other knowledge bases, determining the desirable proportion of links on the Linked Data Cloud remains work in progress.
 3.
An online demo of the framework can be found at http://limes.sf.net.
 4.
Not losing recall is used in the same sense as [11] and means in this context that given a link specification, our approach is guaranteed to find all pairs of source and target instances that abide by the said specification.
 5.
The corresponding link specifications are available as download at http://aksw.org/Projects/LIMES.
 6.
Note that we consider numerical data to be data with a datatype such that there is a bijective mapping between the set of all elements of these datatypes and the real numbers.
 7.
See http://limes.sf.net. The user manual available at the same page describes the architecture presented herein in more detail.
 8.
Notes
Acknowledgments
This work was supported by the Eurostars Project E!4604 SCMS and a research fellowship grant of the Research Unit Media Convergence of the University of Mainz.
References
 1.Auer S, Lehmann J, Ngonga Ngomo AC (2011) Introduction to linked data and its lifecycle on the web. In: Reasoning web, pp 1–75Google Scholar
 2.Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140Google Scholar
 3.BenDavid D, Domany T, Tarem A (2010) Enterprise data classification using semantic web technologies. In: ISWCGoogle Scholar
 4.Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv 41(1):1–41CrossRefGoogle Scholar
 5.Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555CrossRefGoogle Scholar
 6.CudréMauroux P, Haghani P, Jost M, Aberer K, de Meer H (2009) idmesh: graphbased disambiguation of linked data. In: WWW, pp 591–600Google Scholar
 7.Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19:1–16CrossRefGoogle Scholar
 8.Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15MathSciNetzbMATHCrossRefGoogle Scholar
 9.Glaser H, Millard IC, Sung WK, Lee S, Kim P, You BJ (2009) Research on linked data and coreference resolution. University of Southampton, Technical ReportGoogle Scholar
 10.Hogan A, Polleres A, Umbrich J, Zimmermann A (2010) Some entities are more equal than others: statistical methods to consolidate linked data. In: Workshop on new forms of reasoning for the semantic web: scalable and dynamic (NeFoRS2010)Google Scholar
 11.Isele R, Jentzsch A, Bizer C (2011) Efficient multidimensional blocking for link discovery without losing recall. In: WebDBGoogle Scholar
 12.Köpcke H, Thor A, Rahm E (2009) Comparative evaluation of entity resolution approaches with fever. Proc VLDB Endow 2(2):1574–1577Google Scholar
 13.Lehmann J, Furche T, Grasso G, Ngonga Ngomo AC, Schallhart C, Sellers A, Unger C, Bühmann L, Gerber D, Höffner K, Liu D, Auer S (2012) Deqa: deep web extraction for question answering. In: Proceedings of ISWC, (to appear)Google Scholar
 14.Lopez V, Uren V, Sabou MR, Motta E (2009) Cross ontology query answering on the semantic web: an initial evaluation. In: KCAP ’09: proceedings of the fifth international conference on knowledge capture, New York, NY, USA. ACM, pp 17–24Google Scholar
 15.Manlove D, Irving R, Iwama K, Miyazaki S, Morita Y (2002) Hard variants of stable marriage. Theor Comput Sci 276(1–2):261–279MathSciNetzbMATHCrossRefGoogle Scholar
 16.Ngonga Ngomo AC (2011) A timeefficient hybrid approach to link discovery. In: Sixth international workshop on ontology matching at ISWCGoogle Scholar
 17.Ngonga Ngomo AC, Auer S (2011) Limes: a timeefficient approach for largescale link discovery on the web of data. In: Proceedings of the international joint conference on artificial intelligenceGoogle Scholar
 18.Ngonga Ngomo AC, Lehmann J, Auer S, Höffner K (2011) RAVEN: active learning of link specifications. In: Proceedings of the sixth international ontology matching workshopGoogle Scholar
 19.Ngonga Ngomo AC, Lyko K (2012) Eagle: efficient active learning of link specifications using genetic programming. In: Proceedings of ESWCGoogle Scholar
 20.Nikolov A, D’Aquin M, Motta E (2012) Unsupervised learning of data linking configuration. In: Proceedings of ESWCGoogle Scholar
 21.Nikolov A, Uren VS, Motta E, De Roeck AN (2009) Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution. In: ASWC, pp 332–346Google Scholar
 22.Papadakis G, Ioannou E, Niedere C, Palpanasz T, Nejdl W (2011) Eliminating the redundancy in blockingbased entity resolution methods. In: JCDLGoogle Scholar
 23.Raimond Y, Sutton C, Sandler M (2008) Automatic interlinking of music datasets on the semantic web. In: Proceedings of the 1st workshop about linked data on the web Google Scholar
 24.Scharffe F, Liu Y, Zhou C (2009) Rdfai: an architecture for rdf datasets matching, fusion and interlink. In: Proceedings of IJCAI 2009 workshop on identity, reference, and knowledge representation (IRKR), Pasadena (CA US)Google Scholar
 25.Sleeman J, Finin T (2010) Computing foaf coreference relations with rules and machine learning. In: Proceedings of the third international workshop on social data on the webGoogle Scholar
 26.Urbani J, Kotoulas S, Maassen J, van Harmelen F, Bal H (2010) Owl reasoning with webpie: calculating the closure of 100 billion triples. In: Proceedings of the ESWC 2010Google Scholar
 27.Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and maintaining links on the web of data. In: ISWC, pp 650–665Google Scholar
 28.Wang J, Li G, Feng J (2010) Triejoin: efficient triebased string similarity joins with editdistance constraints. PVLDB 3(1):1219–1230Google Scholar
 29.Winkler W (2006) Overview of record linkage and current research directions. Technical Report, Bureau of the Census, Research Report SeriesGoogle Scholar
 30.Xiao C, Wang W, Lin X (2008) Edjoin: an efficient algorithm for similarity joins with edit distance constraints. Proc VLDB Endow 1(1):933–944Google Scholar
 31.Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140Google Scholar