DBkWik: extracting and integrating knowledge from thousands of Wikis

Hertling, Sven; Paulheim, Heiko

doi:10.1007/s10115-019-01415-5

DBkWik: extracting and integrating knowledge from thousands of Wikis

Regular Paper
Published: 02 November 2019

Volume 62, pages 2169–2190, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

772 Accesses
10 Citations
5 Altmetric
Explore all metrics

Abstract

Popular cross-domain knowledge graphs, such as DBpedia and YAGO, are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia. Furthermore, we discuss the potential use of DBkWik as a benchmark for knowledge graph matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBkWik++- Multi Source Matching of Knowledge Graphs

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

One Knowledge Graph to Rule Them All? Analyzing the Differences Between DBpedia, YAGO, Wikidata & co.

Notes

Pronounced dee-bee-quick.
http://dbkwik.org/.
http://creativecommons.org/licenses/by-sa/3.0/.
https://github.com/dbpedia/extraction-framework.
https://www.mediawiki.org/wiki/MediaWiki.
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format.
In the scope of this work, we restrict ourselves to Wikis created using the MediaWiki software, but this is merely a technical, not a conceptual limitation—as long as a Wiki software is able to create reasonably structured Wikis, e.g., allows to create infoboxes, categories, etc., it could be used as a source for knowledge graph creation.
https://github.com/sven-h/extraction-framework.
http://fandom.wikia.com/.
http://www.alexa.com/topsites/category/Computers/Software/Groupware/Wiki/Wiki_Farms.
Hauptseite is German for main page.
The gold standard is available at https://github.com/sven-h/dbkwik.
The short abstract is generated by extracting the first text paragraph from a Wiki page, whereas the long abstract is all text before the first headline.
We restricted the workers to have a 95% approval rate and a minimum of 100 approved HITs (human intelligence tasks), following the recommendations by [25] and [16], and restricted their location to the USA to attract a large fraction of native speakers. We paid $0.40 for a HIT of finding matching pages for 10 pages in two Wikis. In total, the creation of the gold standard took 10 days. Details on the task design as well as the resulting gold standard are available online at https://github.com/sven-h/dbkwik.
In many cases, this can be explained by the notability criteria of Wikipedia, cf. http://en.wikipedia.org/wiki/Wikipedia:Notability.
Given request “site:darkscape.fandom.com” on Google (approximately 2660 results) and Bing (approximately 42,500 results)—tested on 8/8/2019.
Request: “bear site:darkscape.fandom.com” Google: (1) Penguin Hide and Seek/Spawn locations (2) Spria (3) Joseph and Bing: (1)Bear ribs (2) Grizzly bear cub (3) Grizzly bear.
https://duckduckgo.com.
https://duckduckgo.com/api.
page https://memory-alpha.fandom.com/wiki/Jean-Luc_Picard#External_links contains links to Wikipedia Memory Beta etc.
The numbers are taken from [20] and [47]. Note that WebIsALOD does not distinguish classes and instances; therefore, we cannot count classes.
https://wikiapiary.com/wiki/Statistics.

References

Algergawy A, Cheatham M, Faria D, Ferrara A, Fundulaki I, Harrow I, Hertling S, Jiménez-Ruiz E, Karam N, Khiat A, Lambrix P, Li H, Montanelli S, Paulheim H, Pesquita C, Saveta T, Schmidt D, Shvaiko P, Splendiani A, Thiéblin E, Trojahn C, Vataščinová J, Zamazal O, Zhou L (2018) Results of the ontology alignment evaluation initiative 2018. In: OM 2018-13th ISWC workshop on ontology matching
Alstott J, Bullmore E, Plenz D (2014) Powerlaw: a Python package for analysis of heavy-tailed distributions. PloS one 9(1):e85777
Article Google Scholar
Bryl V, Bizer C (2014) Learning conflict resolution strategies for cross-language Wikipedia data fusion. In: Proceedings of the 23rd international conference on world wide web. ACM, pp 1129–1134
Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on web search and data mining, pp 101–110
Clauset A, Shalizi CR, Newman ME (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703
Article MathSciNet Google Scholar
Dohrn H, Riehle D (2011) Design and implementation of the Sweble Wikitext parser: unlocking the structured data of wikipedia. In: Proceedings of the 7th international symposium on wikis and open collaboration. ACM, pp 72–81
Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 601–610
Endris KM, Giménez-García JM, Thakkar H, Demidova E, Zimmermann A, Lange C, Simperl E (2017) Dataset reuse: an analysis of references in community discussions, publications and data. Extraction 500:1
Google Scholar
Erling O (2012) Virtuoso, a hybrid rdbms/graph column store. IEEE Data Eng Bull 35(1):3–8
Google Scholar
Euzenat J, Meilicke C, Stuckenschmidt H, Shvaiko P, Trojahn C (2011) Ontology alignment evaluation initiative: six years of experience. J Data Semant XV:158–192
Article Google Scholar
Faria D, Pesquita C, Balasubramani BS, Tervo T, Carriço D, Garrilha R, Couto FM, Cruz IF (2018) Results of AML participation in OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matching
Fellbaum C (1998) WordNet—an electronic lexical database. MIT Press, Cambridge
Book Google Scholar
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378
Article Google Scholar
Galárraga L, Teflioudi C, Hose K, Suchanek FM (2015) Fast rule mining in ontological knowledge bases with AMIE++. VLDB J Int J Very Large Data Bases 24(6):707–730
Article Google Scholar
Guzewicz P, Manolescu I (2018) Quotient RDF summaries based on type hierarchies. In: DESWeb 2018—data engineering meets the semantic web 2018
Hauser DJ, Schwarz N (2016) Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behav Res Methods 48(1):400–407. https://doi.org/10.3758/s13428-015-0578-z
Article Google Scholar
Heath T, Bizer C (2011) Linked data: evolving the web into a global data space, vol 1, no 1. Synthesis lectures on the semantic web: theory and technology. Morgan & Claypool, San Rafael, pp 1–136
Google Scholar
Heist N, Paulheim H (2019) Uncovering the semantics of Wikipedia categories. In: International semantic web conference
Heist N, Hertling S, Paulheim H (2018) Language-agnostic relation extraction from abstracts in Wikis. Information 9(4):75
Article Google Scholar
Hertling S, Paulheim H (2017) Webisalod: providing hypernymy relations extracted from the web as linked open data. In: International semantic web conference. Springer, pp 111–119
Hertling S, Paulheim H (2018a) Dbkwik: A consolidated knowledge graph from thousands of wikis. In: 2018 IEEE international conference on big knowledge (ICBK). IEEE, pp 17–24
Hertling S, Paulheim H (2018b) Dome results for OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matching
Hofmann A, Perchani S, Portisch J, Hertling S, Paulheim H (2017) Dbkwik: towards knowledge graph creation from thousands of wikis. In: International semantic web conference (posters and demos)
Jiménez-Ruiz E, Grau BC, Cross V (2018) Logmap family participation in the OAEI 2018. In: OM 2018-13th ISWC workshop on ontology matching
Kazai G (2011) In search of quality in crowdsourcing for search engine evaluation. Springer, Berlin, pp 165–176. https://doi.org/10.1007/978-3-642-20161-5_17
Book Google Scholar
Kliegr T (2015) Linked hypernyms: enriching DBpedia with targeted hypernym discovery. Web Semant Sci Serv Agents World Wide Web 31:59–69
Article Google Scholar
Laadhar A, Ghozzi F, Megdiche I, Ravat F, Teste O, Gargouri F (2018) OAEI 2018 results of POMap++. In: OM 2018-13th ISWC workshop on ontology matching
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Article Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Lehmann J (2009) Dl-learner: learning concepts in description logics. J Mach Learn Res 10(Nov):2639–2642
MathSciNet MATH Google Scholar
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2013) DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant Web J 6(2):278–286
Google Scholar
Lenat DB (1995) CYC: a large-scale investment in knowledge infrastructure. Commun ACM 38(11):33–38
Article Google Scholar
Mahdisoltani F, Biega J, Suchanek FM (2013) YAGO3: a knowledge base from multilingual Wikipedias. In: CIDR
Muñoz E, Hogan A, Mileo A (2014) Using linked data to mine RDF from Wikipedia’s tables. In: Proceedings of the 7th ACM international conference on web search and data mining. ACM, pp 533–542
Noia TD, Ostuni VC, Tomeo P, Sciascio ED (2016) Sprank: semantic path-based ranking for top-n recommendations using linked open data. ACM Trans Intell Syst Technol (TIST) 8(1):9
Google Scholar
Nuzzolese AG, Gangemi A, Presutti V, Ciancarini P (2012) Type inference through the analysis of wikipedia links. In: LDOW
Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:489–508
Article Google Scholar
Paulheim H (2017) Data-driven joint debugging of the DBpedia mappings and ontology. In: European semantic web conference. Springer, pp 404–418
Paulheim H (2018) How much is a triple? estimating the cost of knowledge graph creation. In: ISWC 2018 posters and demonstrations, industry and blue sky ideas tracks
Paulheim H, Bizer C (2013) Type inference on noisy RDF data. In: International semantic web conference. Springer, pp 510–525
Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst (IJSWIS) 10(2):63–86
Article Google Scholar
Paulheim H, Gangemi A (2015) Serving DBpedia with DOLCE—more than just adding a cherry on top. In: International semantic web conference. Springer, pp 180–196
Paulheim H, Ponzetto SP (2013) Extending DBpedia with Wikipedia list pages. In: NLP-DBPEDIA workshop
Paulheim H, Hertling S, Ritze D (2013) Towards evaluating interactive ontology matching tools. In: Extended semantic web conference. Springer, pp 31–45
Ponzetto SP, Strube M (2008) Wikitaxonomy: a large scale knowledge resource. In: ECAI, Citeseer, vol 178, pp 751–752
Rico M, Mihindukulasooriya N, Kontokostas D, Paulheim H, Hellmann S, Gómez-Pérez A (2018) Predicting incorrect mappings: a data-driven approach applied to DBpedia. In: Proceedings of the 33rd annual ACM symposium on applied computing, pp 323–330
Ringler D, Paulheim H (2017) One knowledge graph to rule them all? analyzing the differences between DBpedia, YAGO, Wikidata & co. In: Joint German/Austrian conference on artificial intelligence (Künstliche Intelligenz). Springer, pp 366–372
Roussille P, Megdiche I, Teste O, Trojahn C (2018) Holontology: results of the 2018 OAEI evaluation campaign. In: OM 2018-13th ISWC workshop on ontology matching
Schmachtenberg M, Bizer C, Paulheim H (2014) Adoption of the linked data best practices in different topical domains. In: International semantic web conference. Springer, pp 245–260
Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto SP (2016) A large database of hypernymy relations extracted from the web. In: LREC
Töpper G, Knuth M, Sack H (2012) DBpedia ontology enrichment for inconsistency detection. In: Proceedings of the 8th international conference on semantic systems. ACM, pp 33–40
Völker J, Niepert M (2011) Statistical schema induction. In: Extended semantic web conference. Springer, pp 124–138
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledge base. Commun ACM 57(10):78–85
Article Google Scholar

Download references

Acknowledgements

We would like to thank Alexandra Hofmann, Samresh Perchani, and Jan Portisch, who helped developing the first prototype of DBkWik in the course of a student project.

Author information

Authors and Affiliations

Data and Web Science Group, University of Mannheim, Mannheim, Germany
Sven Hertling & Heiko Paulheim

Authors

Sven Hertling
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Paulheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sven Hertling.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hertling, S., Paulheim, H. DBkWik: extracting and integrating knowledge from thousands of Wikis. Knowl Inf Syst 62, 2169–2190 (2020). https://doi.org/10.1007/s10115-019-01415-5

Download citation

Received: 02 January 2019
Revised: 03 October 2019
Accepted: 05 October 2019
Published: 02 November 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10115-019-01415-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBkWik: extracting and integrating knowledge from thousands of Wikis

Abstract

Access this article

Similar content being viewed by others

DBkWik++- Multi Source Matching of Knowledge Graphs

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

One Knowledge Graph to Rule Them All? Analyzing the Differences Between DBpedia, YAGO, Wikidata & co.

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DBkWik: extracting and integrating knowledge from thousands of Wikis

Abstract

Access this article

Similar content being viewed by others

DBkWik++- Multi Source Matching of Knowledge Graphs

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

One Knowledge Graph to Rule Them All? Analyzing the Differences Between DBpedia, YAGO, Wikidata & co.

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation