Chaudron: Extending DBpedia with Measurement

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10249)

Abstract

Wikipedia is the largest collaborative encyclopedia and is used as the source for DBpedia, a central dataset of the LOD cloud. Wikipedia contains numerous numerical measures on the entities it describes, as per the general character of the data it encompasses. The DBpedia Information Extraction Framework transforms semi-structured data from Wikipedia into structured RDF. However this extraction framework offers a limited support to handle measurement in Wikipedia.

In this paper, we describe the automated process that enables the creation of the Chaudron dataset. We propose an alternative extraction to the traditional mapping creation from Wikipedia dump, by also using the rendered HTML to avoid the template transclusion issue.

This dataset extends DBpedia with more than 3.9 million triples and 949.000 measurements on every domain covered by DBpedia. We define a multi-level approach powered by a formal grammar that proves very robust on the extraction of measurement. An extensive evaluation against DBpedia and Wikidata shows that our approach largely surpasses its competitors for measurement extraction on Wikipedia Infoboxes. Chaudron exhibits a F1-score of .89 while DBpedia and Wikidata respectively reach 0.38 and 0.10 on this extraction task.

Keywords

Wikipedia Extraction DBpedia Measurement RDF Formal grammar 

References

  1. 1.
    Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets. In: LDOW (2009)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52 CrossRefGoogle Scholar
  3. 3.
    Ferschke, O., Zesch, T., Gurevych, I., Wikipedia revision toolkit: efficiently accessing Wikipedia’s edit history. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 97–102 (2011)Google Scholar
  4. 4.
    Fuchs, N.E., Kaljurand, K., Kuhn, T.: Attempto controlled english for knowledge representation. In: Baroglio, C., Bonatti, P.A., Małuszyński, J., Marchiori, M., Polleres, A., Schaffert, S. (eds.) Reasoning Web. LNCS, vol. 5224, pp. 104–124. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85658-0_3 CrossRefGoogle Scholar
  5. 5.
    Fuchs, N.E., Schwertel, U., Schwitter, R.: Attempto controlled english — not just another logic specification language. In: Flener, P. (ed.) LOPSTR 1998. LNCS, vol. 1559, pp. 1–20. Springer, Heidelberg (1999). doi:10.1007/3-540-48958-4_1 CrossRefGoogle Scholar
  6. 6.
    Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The yago-naga approach to knowledge discovery. ACM SIGMOD Rec. 37(4), 41–47 (2009)CrossRefGoogle Scholar
  7. 7.
    Krötzsch, M., Vrandečić, D., Völkel, M.: Semantic MediaWiki. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 935–942. Springer, Heidelberg (2006). doi:10.1007/11926078_68 CrossRefGoogle Scholar
  8. 8.
    Leal, D., Schröder, A.: RDF vocabulary for physical properties, quantities and units. Technical report, ScadaOn-Web (2002). http://www.s-ten.eu/scadaonweb/NOTE-units/2002-08-05/NOTE-units.html
  9. 9.
    Lefrançois, M., Zimmermann, A.: Supporting arbitrary custom datatypes in RDF and SPARQL. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 371–386. Springer, Cham (2016). doi:10.1007/978-3-319-34129-3_23 CrossRefGoogle Scholar
  10. 10.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
  11. 11.
    Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: Wonderweb deliverable d18, ontology library (final). ICT project, 33052 (2003)Google Scholar
  12. 12.
    Murray-Rust, P.: Chemistry for everyone. Nature 451(7179), 648–651 (2008)CrossRefGoogle Scholar
  13. 13.
    Pedhazur, E.J., Schmelkin, L.P.: Measurement, Design, Analysis: An Integrated Approach. Psychology Press, New York (2013)Google Scholar
  14. 14.
    Pinto, H.S. Martins, J.: Revising and extending the units of measure “subontology”. In: Proceedings of IJCAI’s Workshop on IEEE Standard Upper Ontology, Seattle, WA. Citeseer (2001)Google Scholar
  15. 15.
    Probst, F.: Observations, measurements and semantic reference spaces. Appl. Ontol. 3(1–2), 63–89 (2008)Google Scholar
  16. 16.
    Rijgersberg, H., van Assem, M., Top, J.: Ontology of units of measure and related concepts. Semant. Web 4(1), 3–13 (2013)Google Scholar
  17. 17.
    Rijgersberg, H., Wigham, M., Top, J.L.: How semantics can improve engineering processes: a case of units of measure and quantities. Adv. Eng. Inform. 25(2), 276–287 (2011)CrossRefGoogle Scholar
  18. 18.
    Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_16 Google Scholar
  19. 19.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)Google Scholar
  20. 20.
    Unger, C., Forascu, C., Lopez, V., Ngomo, A.-C.N., Cabrio, E., Cimiano, P., Walter, S.: Question answering over linked data (QALD-5). In: Working Notes of CLEF (2015)Google Scholar
  21. 21.
    Vrandečić, D.: Wikidata: a new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web, pp. 1063–1064. ACM (2012)Google Scholar
  22. 22.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  23. 23.
    Weld, D.S., Wu, F., Adar, E., Amershi, S., Fogarty, J., Hoffmann, R., Patel, K., Skinner, M.: Intelligence in Wikipedia. In AAAI, vol. 8, pp. 1609–1614 (2008)Google Scholar
  24. 24.
    Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 504–518. Springer, Cham (2014). doi:10.1007/978-3-319-07443-6_34 CrossRefGoogle Scholar
  25. 25.
    Williams, G.: Extensible SPARQL functions with embedded javascript. In: Auer, S., Bizer, C., Heath, T., Grimnes, G.A. (eds.) Proceedings of the ESWC 2007 Workshop on Scripting for the Semantic Web, SFSW, Innsbruck, Austria. CEUR Workshop Proceedings, vol. 248. CEUR-WS.org, 30 May 2007Google Scholar
  26. 26.
    Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 41–50. ACM (2007)Google Scholar
  27. 27.
    Wu, F., Weld, D.S.: Open information extraction using Wikipedia. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 118–127. Association for Computational Linguistics (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Univ Lyon, UJM-Saint-Etienne, CNRS, Laboratoire Hubert Curien, UMR 5516Saint-EtienneFrance

Personalised recommendations