Data-Driven RDF Property Semantic-Equivalence Detection Using NLP Techniques
DBpedia extracts most of its data from Wikipedia’s infoboxes. Manually-created “mappings” link infobox attributes to DBpedia ontology properties (dbo properties) producing most used DBpedia triples. However, infoxbox attributes without a mapping produce triples with properties in a different namespace (dbp properties). In this position paper we point out that (a) the number of triples containing dbp properties is significant compared to triples containing dbo properties for the DBpedia instances analyzed, (b) the SPARQL queries made by users barely use both dbp and dbo properties simultaneously, (c) as an exploitation example we show a method to automatically enhance SPARQL queries by using syntactic and semantic similarities between dbo properties and dbp properties.
KeywordsSPARQL query Query enhancement DBpedia Spanish DBpedia Property mapping
DBpedia  is the central hub of the Linked Open Data (LOD) cloud because it provides a vast amount of information and most of the datasets in the LOD cloud link to DBpedia. The extraction process  in DBpedia generates properties of two types: (1) properties in the DBpedia ontology (we name these dbo properties), and (2) properties not in the DBpedia ontology (let us name them dbp properties). The dbp properties come from the attribute-value pairs found in Wikipedia infoboxes that has no manually-created mappings1. The analysis of the Spanish DBpedia (esDBpedia) found  that, despite the high number of mappings (100+ classes), for each 4 triples containing a dbo property there is 1 triple containing a dbp property. In this work, we extend this analysis to English and German DBpedia instances, with similar results. For instance, in the English DBpedia this ratio goes to almost one to one.
In this position paper we hypothesize that triples can not be accessed because most queries are comprised of dbo properties. DBpedia defines around 2500 properties, but only 2 % infoboxes fields are mapped to the DBpedia ontology. Thus, there are many dbp properties in DBpedia: 58,239 for the English version, 17,111 for the Spanish and 12,167 for the German. Therefore, users that query the DBpedia endpoint by using SPARQL queries containing only dbo properties have no access to a significant amount of triples and could lead to null or incomplete results even if the relevant data is available in DBpedia.
In this work, we start by checking the assumption that users barely mix dbp and dbo properties in SPARQL queries. Later we provide a method to automatically identify the most similar dbp properties for a given dbo property. This method takes advantage of techniques from Natural Language Processing and Statistical Methods. The goal of the proposed method is to generate “automatic mappings” with a certain confidence level. These mappings can be manually approved by a specialist or through crowd-sourcing in a semi-automatic manner. Some examples point out that these mappings can enhance the SPARQL queries to generate better results by accessing more information in different DBpedia instances.
In this section, we explore two hypotheses addressed in this paper. On the one hand, we analyze the amount of information described by using dbp properties in 3 DBpedia instances. On the other hand, we analyze how dbo and dbp properties are used in SPARQL queries made to the English DBpedia.
Top-10 dbp properties for the English, Spanish and German DBpedia instances (2015-04 version).
Secondly, we analyzed a SPARQL query log to evaluate the assumption that users do not frequently use dbp properties in their SPARQL queries. We used the Linked SPARQL Queries Dataset , which provides a RDF model to know details about SPARQL queries made to several endpoints. We explored the data from the English DBpedia to see how many queries use both dbp and dbo properties. Out of 1,208,762 distinct queries only 2,328 queries use both dbo and dbp properties in the same query. We made a similar analysis for agents (IPs): out of 3,041 distinct agents (IPs), only 473 use both dbp and dbo properties in the same SPARQL query. This illustrates that the majority of the SPARQL queries miss some portion of the data. We argue that this information can be reached by enhancing the SPARQL queries by using our proposed mappings between dbo and dbp properties.
3 An Approach for Automatically Enhancing SPARQL Queries
The second step involves processing each small group by using Natural Language pre-Processing which includes tokenization and stemming/lemmatization. Many dbp properties are compound words (e.g. birthPlace \(\rightarrow \) (birth, place)). It is necessary to do some pre-processing for tokenizing those properties before applying linguistic techniques to find syntactic and semantic similarity. For dbp properties that use the camel case convention, this tokenization can be done easily by breaking the words using the camel case convention. For the rest, for instance the dbp properties that use all simple letters (e.g. oldcode or testaverage) or all capitals, dictionary tools that break the compound words into separate tokens of known words can be used. We also used other punctuation marks such as brackets (e.g. numEmployees(globally)) for tokenization when they were applicable. In addition, lemmatization can be used for finding more results by normalizing the different variations such the inconsistent use of singular and plural words (e.g. coachTeams \(\rightarrow \) (coach, team)).
As the majority of the dbo properties only have labels in English, when non-English dbp properties are detected in DBpedia instances such as the Spanish one or the German, translation tools are used to convert the property into English for mapping with the dbo property (e.g., geburtsort \(\rightarrow \) birthPlace).
The third step comprises similarity techniques. The simplest is the syntactic distance, which includes classical string distance metrics (e.g. Jaro-Winkler distance, Damerau-Levenshtein distance), and token-based techniques (e.g. Jaccard similarity, Cosine Similarity). Several techniques can be used to identify different types of variations in dbp properties, for instance, edit distance-based measures such as Damerau-Levenshtein perform better for identifying typos but they are sensitive to substring locality. Using syntactic techniques such as string similarity we can identify that dbp:birzPlace means dbo:birthPlace. Semantics techniques go a step forward, and we have tested two ‘semantic similarity’ measures: (1) a dictionary-based method for synonyms and (2) a synsets-based method using WordNet. Semantic similarity allows us to identify that dbp properties like dbp:birthLocation or dbp:cityOfBirth are similar to the dbo property dbo:birthPlace. Further studies will be focused on finding the most accurate semantic-similarity methods for these tasks.
3.1 Enhancing SPARQL Queries by Using Dbp Properties
4 Evaluation Example
As a complete evaluation would require more space, we only show an evaluation example to check our hypothesis that SPARQL query results can be improved by using dbp properties with the same semantics that the dbo properties used in a SPARQL query. Following the proposed method described in Sect. 3, we use the dbo:birthPlace property for the analysis. Table 2 shows the possible dbp properties mapping the dbo:birthPlace property for the three DBpedia instances analyzed, distinguishing between syntactic and semantic techniques as described in Sect. 3. Then, a simple query is used to analyze the number of results returned when only dbo:birthPlace is used (similar to Listing 1.1) and when an enhanced query is used (similar to Listing 1.2). This enhancement, denoted \(\varDelta _1\) in the table leads to 350 % improvement in the case of English DBpedia (3,940,073 results instead of 1,211,868), 221 % improvement in the case of Spanish DBpedia (765,633 results instead of 346,515), and 132 % improvement in the case of German DBpedia (1,319,892 results instead of 986,323). These results illustrate that enhancing the queries using the approach proposed in this paper leads to better answers to the queries regarding the number of results. In the future, we plan to evaluate the correctness of the answers of the enhanced queries to assess if there is an impact on the quality of the results.
5 Related Work
Both Rahm and Bernstein , and Shvaiko and Euzenat  provide surveys of schema matching approaches and classify schema matching approaches into categories. The approach proposed in this paper combines several linguistic techniques that are mentioned in the survey including both syntactic and semantic techniques. Rinser et al.  propose a three-stage instance-based schema matching approach for mapping infoboxes from Wikipedias of different languages. The presented approach is only about Wikipedia, however it can be used to complement the property mappings proposed in this paper. Zhang et al.  propose Statistical Knowledge Patterns for identifying synonymous relations in large linked datasets. The method presented in this paper uses a similar technique for property clustering, but also compliment it with the NLP techniques. Palmero Aprosio et al.  emphasize the problem of non-mapped infoboxes in DBpedia and proposes an approach for automatic mapping generation applied to the Italian chapter of DBpedia.
6 Conclusions and Future Work
Our work starts by realizing that DBpedia triples are comprised not only by properties defined in the DBpedia ontology (dbo properties) but, to a big extent, by other properties (dbp properties). The DBpedia extraction process generates triples containing dbo properties when there is a mapping between a field in a Wikipedia infobox and a dbo property. But the extraction process also generates dbp properties for the fields in Wikipedia infoboxes that do not have such mapping. In the case of the English DBpedia, almost 50 % of all triples contain dbp properties in its predicate. Therefore, queries containing only dbo properties cannot access big parts of the DBpedia dataset.
In order to check the infra-utilization of dbp properties, we have analyzed a SPARQL query log repository containing SPARQL queries form several datasets, concluding that our hypothesis is correct at least for the English DBpedia.
As an initial application of this work, we have sketched a method to find the most similar dbp properties for a given dbo property. This could be used to automatically enhance SPARQL queries in order to get more results and we have shown some simple usage examples.
The proposed method depends on many parameters and we have applied them to three DBpedia instances (English, Spanish and German). Future work will explore the most adequate parameters for a wider set of local DBpedia instances. For instance, we should identify the most appropriated method and parameters for syntactic similarity. A too restrictive similarity parameters would not provide much more data, and too relaxed parameters could produce wrong results. Concerning semantic similarity we have to find a similar balance. In both cases we have to test the results with real users by means of a testing tool. This tool will allow us to get the best parameters, for a given language, in order to provide the most similar dbp properties for a given dbo.
But this method is only an example of the utility of dbp properties. We claim dbp properties as first-class citizens, and linked data tools should allow users to exploit them. We show LOUPE as an exploring tool, which allows ‘property exploration’ for both, dbo and dbp, properties.
In summary, dbp properties are a good complement for dbo properties in SPARQL queries because they give us access to a richer DBpedia.
This work was funded by the JCI-2012-12719 contract, the BES-2014-068449 grant under the 4V project (TIN2013-46238-C4-2-R), JC2015-00028 and UNPM13-4E-1814.
- 2.Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
- 3.Mihindukulasooriya, N., Rico, M., García-Castro, R., Gómez-Pérez, A.: An analysis of the quality issues of the properties available in the Spanish DBpedia. In: Puerta, J.M., Gámez, J.A., Dorronsoro, B., Barrenechea, E., Troncoso, A., Baruque, B., Galar, M. (eds.) CAEPIA 2015. LNCS, vol. 9422, pp. 198–209. Springer International Publishing, Cham (2015)CrossRefGoogle Scholar
- 5.Mihindukulasooriya, N., Villalon, M.P., García-Castro, R., Gómez-Pérez, A.: Loupe-An Online Tool for Inspecting Datasets in the Linked Data Cloud (2015)Google Scholar