Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The Web of Linked Data has seen huge growth in the past few years. As of September 2011, the Linked Open Data Cloud has grown to a size of 31.6 billion triples. This includes a wide range of data sources belonging to the government (42 %), geographic (19.4 %), life sciences (9.6 %) and other domainsFootnote 1. A common way that the instances in these sources are linked to others is the use of the owl:sameAs property. Though the size of Linked Data Cloud seems to be increasing drastically (10 % over the 28.5 billion triples in 2010), inspection of the sources at the ontology level reveals that only a few of them (15 out of the 190 sources) have some mapping of the vocabularies. For the success of the Semantic Web, it is important that these heterogenous schemas be linked. As described in our previous papers on Linking and Building Ontologies of Linked Data [8] and Aligning Ontologies of Geospatial Linked Data [7], an extensional technique can be used to generate alignments between the ontologies behind these sources. In these previous papers, we introduced the concept of restriction classes, which is the set of instances that satisfy a conjunction of value restrictions on properties (property-value pairs).

Though our algorithm was able to identify a good number of alignments, it was unable to completely cover one source with the classes in the other source. Upon closer look, we found that most of these alignments that we missed did not have a corresponding restriction class in the other source, and instead subsumed multiple restriction classes. While reviewing these subset relations, we discovered that in many cases the union of the smaller classes completely covered the larger class. In this paper, we describe how we extend our previous work to discover such concept coverings by introducing more expressive set of class descriptions (unions of value restrictions)Footnote 2. In most of these coverings, the smaller classes are also found to be disjoint. In addition, further analysis of the alignments of these coverings provides a powerful tool to discover incorrect links in the Web of Linked Data, which can potentially be used to point out and rectify the inconsistencies in the instance alignments.

This paper is organized as follows. First, we describe the Linked Open Data sources that we try to align in the paper. Second, we briefly review our alignment algorithm from [8] along with the limitations of the results that were generated. Third, we describe our approach to finding alignments between unions of restrictions classes. Fourth, we describe how outliers in these alignments help to identify inconsistencies and erroneous links. Fifth, we describe the experimental results on union alignments over additional domains. Finally, we compare against related work, and discuss our contributions and future work.

2 Sources Used for Alignments

Linked Data, by definition, links the instances of multiple sources. Often, sources conform to different, but related, ontologies that can also be meaningfully linked [8]. In this section we describe some of these sources from different domains that we try to align, instances in which are linked using an equivalence property like owl:sameAs.

Linking GeoNames with places in DBpedia: DBpedia (dbpedia.org) is a knowledge base that covers multiple domains including around 526,000 places and other geographical features from the Geospatial domain. We try to align the concepts in DBpedia with GeoNames (geonames.org), which is a geographic source with about 7.8 million things. It uses a flat-file like ontology, where all instances belong to a single concept of Feature. This makes the ontology rudimentary, with the type data (e.g. mountains, lakes, etc.) about these geographical features instead in the Feature Class & Feature Code properties.

Linking LinkedGeoData with places in DBpedia: We also try to find alignments between the ontologies behind LinkedGeoData (linkedgeodata.org) and DBpedia. LinkedGeoData is derived from the Open Street Map initiative with around 101,000 instances linked to DBpedia using the owl:sameAs property.

Linking species from Geospecies with DBpedia: The Geospecies (geospecies.org) knowledge base contains species belonging to plant, animal, and other kingdoms linked to species in DBpedia using the skos:closeMatch property. Since the instances in the taxonomies in both these sources are the same, the sources are ideal for finding the alignment between the vocabularies.

Linking genes from GeneID with MGI: The Bio2RDF (bio2rdf.org) project contains inter-linked life sciences data extracted from multiple data-sets that cover genes, chemicals, enzymes, etc. We consider two sources from the Genetics domain from Bio2RDF, GeneID (extracted from the National Center for Biotechnology Information database) and MGI (extracted from the Mouse Genome Informatics project), where the genes are marked equivalent.

Although we provide results of the above four mentioned alignments in Section 4, in the rest of this paper we explain our methodology by using the alignment of GeoNames with DBpedia as an example.

3 Aligning Ontologies on the Web of Linked Data

First, we briefly describe our previous work on finding subset and equivalent alignments between restriction classes from two ontologies. Then, we describe how to use the subset alignments to finding more expressive union alignments. Finally, we discuss how outliers in these union alignments often identify incorrect links in the Web of Linked Data.

3.1 Our Previous Work on Aligning Ontologies of Linked Data

In [8] we introduced the concept of restriction classes to align extensional concepts in two sources. A restriction class is a concept that is derived extensionally and defined by a conjunction of value restrictions for properties (called property-value pairs) in a source. Such a definition helps overcome the problem of aligning rudimentary ontologies with more sophisticated ones. For example, GeoNames only has a single concept (Feature) to which all of its instances belong, while DBpedia has a rich ontology. However, Feature has several properties that can be used to define more meaningful classes. For example, the set of instances in GeoNames with the value PPL in the property featureCode, nicely aligns with the instances of City in DBpedia.

Our algorithm explored the space of restriction classes from two ontologies and was able to find equivalent and subset alignments between these restriction classes. Fig. illustrates the instance sets considered to score an alignment hypothesis. We first find the instances belonging to the restriction class \(r_1\) from the first source and \(r_2\) from the second source. We then compute the image of \(r_1\) (denoted by \(I(r_1)\)), which is the set of instances from the second source linked to instances in \(r_1\) (dashed lines in the figure). By comparing \(r_2\) with the intersection of \(I(r_1)\) and \(r_2\) (shaded region), we can determine the relationship between \(r_1\) and \(r_2\). We defined two metrics P and R, as the ratio of \(|I(r_1) \cap r_2|\) to \(|I(r_1)|\) and \(|r_2|\) respectively, to quantify set-containment relations. For example, two classes are equivalent if \(P=R=1\). In order to allow a certain margin of error induced by the data-set, we used the relaxed versions P’ and R’ as part of our scoring mechanism. In this case, two classes were considered equivalent if \(P' > 0.9\) and \(R' > 0.9\) For example, consider the alignment between restriction classes (lgd:gnis%3AST_alpha=NJ) from LinkedGeoData and (dbpedia:Place#type=http://dbpedia.org/resource/City_(New_Jersey)) from DBpedia. Based on the extension sets, our algorithm finds \(|I(r_1)|\) = 39, \(|r_2|\) = 40, \(|I(r_1) \cap r_2|\) = 39, \(R'\) = 0.97 and \(P'\) = 1.0. Based on our error margins, we assert the alignment as equivalent in an extensional sense. The exploration of the space of alignments and the scoring procedure is described in detail in [8].

Fig. 1.
figure 1

Comparing the linked instances from two ontologies.

Though the approach produced a large number of equivalent alignments, we were not able to find a complete coverage because some restriction classes did not have a corresponding equivalent restriction class and instead subsumed multiple smaller restriction classes. For example, in the GeoNames and DBpedia alignment, we found that {rdf:type=dbpedia:EducationalInstitution} from DBpedia subsumed {geonames:featureCode=S.SCH}, {geonames:featureCode=S.SCHC} and {geonames:featureCode=S.UNIV} (i.e. Schools, Colleges and Universities from GeoNames). We discovered that taken together, the union of these three restriction classes completely define rdf:type=dbpedia:EducationalInstitution. To find such previously undetected alignments we decided to extend the expressivity of our restriction classesby introducing a disjunction operator to detect concept coverings completely.

3.2 Identifying Union Alignments

In our current work, we use the subset and equivalent alignments generated by the previous work to try and align a larger class from one ontology with a union of smaller subsumed restriction classes in the other ontology. Since the problem of finding alignments with conjunctions and disjunction of property-value pairs of restriction classes is combinatorial in nature, we focus only on subset and equivalence relations that map to an restriction classes with a single property-value pair. This helps us find the simplest definitions of concepts and also makes the problem tractable. Alignments generated by our previous work that satisfy the single property-value pair constraint are first grouped according to the subsuming restriction classes and then according to the property of the smaller classes. Since restriction classes are constructed by forming a set of instances that have one of the properties restricted to a single value, aggregating restriction classes from the group according to their properties builds a more intuitive definition of the union. We can now define the disjunction operator that constructs the union concept from the smaller restriction classes in these sub-groups. The disjunction operator is defined for restriction classes, such that (i) the concept formed by the disjunction of the classes represents the union of their set of instances, (ii) each of the classes that are aggregated contain only a single property-value pair and (iii) the property for all those property-value pairs is the same. We then try to detect the alignment between the larger common restriction class and the union by using an extensional approach similar to our previous paper. We call such an alignment a hypothesis union alignment.

We define \(U_S\) as the set of instances that is the union of individual smaller restriction classes  Union(\(r_2\)); \(U_L\) as the image of the larger class by itself, Img(\(r_1\))); and \(U_A\) as the overlap between these sets, union(\(Img(r_1) \cap r_2)\)). We check whether the larger restriction class is equivalent to the union concept by using scoring functions analogous to \(P'\) & \(R'\) from our previous paper. The new scoring mechanism defines \(P'_U\) as \(\frac{|U_A|}{|U_S|}\) and \(R'_U\) as \(\frac{|U_A|}{|U_L|}\) with relaxed scoring assumptions as in \(P'\) & \(R'\). To accommodate errors in the data-set, we consider it a complete coverage when the score is greater than a relaxed score of 0.9. That is, the hypothesis union alignment is considered equivalent if \(P'_U>\) 0.9 & \(R'_U >\) 0.9. Since by construction, each of the subset already satisfies \(P'>\) 0.9, then we are assured that \(P'_U\) is always going to be greater than 0.9. Thus, a union alignment is equivalent if \(R'_U >\) 0.9.

Figure 2 provides an example of the approach. Our previous algorithm finds that {geonames:featureCode = S.SCH}, {geonames:featureCode = S.SCHC}, {geonames:featureCode = S.UNIV} are subsets of {rdf:type=dbpedia:EducationalInstitution}. As can be seen in the Venn diagram in Fig. 2, \(U_L\) is Img({rdf:type = dbpedia:EducationalInstitution}), \(U_S\) is {geonames:featureCode = S.SCH} \(\cup \) {geonames:featureCode = S.SCHC} \(\cup \) {geonames:featureCode = S.UNIV}, and \(U_A\) is the intersection of the two. With the educational institutions example, \(R'_U\) for the alignment of dbpedia:EducationalInstitution to the union of S.SCH, S.SCHC & S.UNIV is 0.98. We can thus confirm the hypothesis and consider this union alignment equivalent. Section 4 shows additional examples of union alignments.

Fig. 2.
figure 2

Spatial covering of Educational Institutions from DBpedia

3.3 Using Outliers in Union Mappings to Identify Linked Data Errors

The computation of union alignments allows for a margin of error in the subset computation. It turns out that the outliers, the instances that do not satisfy the restriction classes in the alignments, are often due to incorrect links. Thus, our algorithm also provides a novel method to curate the Web of Linked Data.

Consider the outlier found in the {dbpedia:country = Spain} \(\equiv \) {geonames:countryCode = ES} alignment. Of the 3918 instances of dbpedia:country=Spain, 3917 have a link to a geonames:countryCode=ES. The one instance not having country code ES has an assertion of country code IT (Italy) in GeoNames. The algorithm would flag this situation as a possible linking error, since there is overwhelming support for the ES being the country code of Spain. A more interesting case occurs in the alignment of {rdf:type = dbpedia:EducationalInstitution} to {geonames:featureCode \(\in \) {S.SCH, S.SCHC, S.UNIV}. For {rdf:type = dbpedia:EducationalInstitution}, 396 instances out of the 404 Educational Institutions were accounted for as having their geonames:featureCode as one of S.SCH, S.SCHC or S.UNIV. From the 8 outliers, 1 does not have a geonames:featureCode property asserted. The other 7 have their feature codes as either S.BLDG (3 buildings), S.EST (1 establishment), S.HSP (1 hospital), S.LIBR (1 library) or S.MUS (1 museum). This case requires more sophisticated curation and the outliers may indicate a case for multiple inheritance. For example, the hospital instance in geonames may be a medical college that could be classified as a university.

Our union alignment algorithm is able to detect similar other outliers and provides a powerful tool to quickly focus on links that require human curation, or that could be automatically flagged as problematic, and provides evidence for the error.

4 Experimental Results

The results of union alignment algorithm over the four pairs of sources we consider appear in Table 1. In total, the 7069 union alignments explained (covered) 77966 subset alignments, for a compression ratio of 90 %.

Table 1. Union alignments found in the 4 source pairs

The resulting alignments were intuitive. Some interesting examples appear in Tables 2, 3 and 4. In the tables, for each union alignment, column 2 describes the large restriction class from \(ontology_1\) and column 3 describes the union of the (smaller) classes on \(ontology_2\) with the corresponding property and value set. The score of the union is noted in column 4 (\(R'_U=\frac{|U_A|}{|U_L|}\)) followed by \(|U_A|\) and \(|U_L|\) in columns 5 and 6. Column 7 describes the outliers, i.e. values of \(v_2\) that form restriction classes that are not direct subsets of the larger restriction class. Each of these outliers also has a fraction with the number of instances that belong to the intersection upon the the number of instances of the smaller restriction class (or \(\frac{|Img(r_1) \cap r_2|}{|r_2|}\)). It can be seen that the fraction is less than our relaxed subset score. If the value of this fraction was greater than the relaxed subset score (i.e. 0.9), the set would have been included in column 3 instead. The last column mentions how many of the total \(U_L\) instances were we able to explain using \(U_A\) and the outliers. For example, the union alignment #1 of Table 2 is the Educational Institution example described before. It shows how educational institutions from DBpedia can be explained by schools, colleges and universities in GeoNames. Column 4, 5 and 6 explain the alignment score \(R'_U\) (0.98), the size \(U_A\) (396) and the size of \(U_L\) (404). Outliers (S.BLDG, S.EST, S.LIBR, S.MUS, S.HSP) along with their \(P'\) fractions appear in column 7. We were able to explain 403 of the total 404 instances (see column 8).

Table 2. Example alignments from the GeoNames-DBpedia  LinkedGeoData-DBpedia.
Table 3. Example alignments from the LinkedGeoData-DBpedia, Geospecies-DBpedia
Table 4. Example alignments from the GeneID-MGI

We find other interesting alignments, a representative few of which are shown in the tables. In some cases, the union alignments found were intuitive because of an underlying hierarchical nature of the concepts involved, especially in case of alignments of administrative divisions in geospatial sources and alignments in the biological classification taxonomy. For example, #3 highlights alignments that reflect the containment properties of administrative divisions. Other interesting types of alignment were also found. For example #7 tries to map two non-similar concepts. It explains the license plate codes found in the state (bundesland) of SaarlandFootnote 3. Due to lack of space, we explain the other union alignments alongside in the tables. The complete set of alignments discovered by our algorithm are available on our group page.Footnote 4

Outliers. In alignments that had inconsistencies, we identified three main reasons: (i) Incorrect instance alignments - outliers arising out of possible erroneous equivalence link between instances (e.g. #4, #8, etc.), (ii) Missing instance alignments - insufficient support for coverage due to missing links between instances or missing instances (e.g. #9, etc.), (iii) Incorrect values for properties - outliers arising out of possible erroneous assertion for property (e.g. #5, #6, etc.). In the tables, we also mention the classes that these inconsistencies belong to along with their support.

5 Related Work

Ontology alignment and schema matching have been a well explored area of research since the early days of ontologies [1, 3] and received renewed interest in recent years with the rise of the Semantic Web and Linked Data. Though most work done in the Web of Linked Data is on linking instances across different sources, an increasing number of authors have looked into aligning the source ontologies in the past couple of years. Jain et al. [4] describe the BLOOMS approach which uses a central forest of concepts derived from topics in Wikipedia. An update to this is the BLOOMS+ approach [5] that aligns Linked Open Data ontologies with an upper-level ontology called Proton. BLOOMS & BLOOMS+ are unable to find alignments because of the small number of classes in GeoNames that have vague declarations. The advantage of our approach over these is that our use of restriction classes is able to find a large set of alignments in cases like aligning GeoNames with DBpedia where Proton fails due to a rudimentary ontology. Cruz et al. [2] describe a dynamic ontology mapping approach called AgreementMaker that uses similarity measures along with a mediator ontology to find mappings using the labels of the classes. From the subset and equivalent alignment between GeoNames(10 concepts) and DBpedia(257 concepts), AgreementMaker was able to achieve a precision of 26 % and a recall of 68 %. We believe that since their approach did not consider unions of concepts, it would not have been able to find alignments like the Educational Institutions example (#1) by using only the labels and the structure of the ontology, though a thorough comparison is not possible. In our work, we find equivalent relations between a concept on one side and a union of concepts on another side. CSR [9] is a similar work to ours that tries to align a concept from one ontology to a union of concepts from the other ontology. In their approach, the authors describe how the similarity of properties are used as features in predicting the subsumption relationships. It differs from our approach in that it uses a statistical machine learning approach for detection of subsets rather than the extensional approach. An approach that uses statistical methods for finding alignments, similar to our work, has also been described in Völker et al. [10]. This work induces schemas for RDF data sources by generating OWL2 axioms using an intermediate associativity table of instances and concepts (called transaction data-sets) and mining associativity rules from it.

6 Conclusions and Future Work

We described an approach to identifying union alignments in data sources on the Web of Linked Data from the Geospatial, Biological Classification and Genetics domains. By extending our definition of restriction classes with the disjunction operator, we are able to find alignments of union concepts from one source to larger concepts from the other source. Our approach produce coverings where concepts at different levels in the ontologies of two sources can be mapped even when there is no direct equivalence. We are also able to find outliers that enable us to identify inconsistencies in the instances that are linked by looking at the alignment pattern. The results provide deeper insight into the nature of the alignments of Linked Data.

As part of our future work we want to try to find a more complete descriptions for the sources. Our preliminary findings show that the results of this paper can be used to find patterns in the properties. For example, the countryCode property in GeoNames is closely associated with the country property in DBpedia, though their ranges are not exactly equal. We believe that an in-depth analysis of the alignment of ontologies of sources is warranted with the recent rise in the links in the Linked Data cloud. This is an extremely important step for the grand Semantic Web vision.