Web-Scale Extension of RDF Knowledge Bases from Templated Websites

  • Lorenz Bühmann
  • Ricardo Usbeck
  • Axel-Cyrille Ngonga Ngomo
  • Muhammad Saleem
  • Andreas Both
  • Valter Crescenzi
  • Paolo Merialdo
  • Disheng Qiu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8796)

Abstract

Only a small fraction of the information on the Web is represented as Linked Data. This lack of coverage is partly due to the paradigms followed so far to extract Linked Data. While converting structured data to RDF is well supported by tools, most approaches to extract RDF from semi-structured data rely on extraction methods based on ad-hoc solutions. In this paper, we present a holistic and open-source framework for the extraction of RDF from templated websites. We discuss the architecture of the framework and the initial implementation of each of its components. In particular, we present a novel wrapper induction technique that does not require any human supervision to detect wrappers for web sites. Our framework also includes a consistency layer with which the data extracted by the wrappers can be checked for logical consistency. We evaluate the initial version of REX on three different datasets. Our results clearly show the potential of using templated Web pages to extend the Linked Data Cloud. Moreover, our results indicate the weaknesses of our current implementations and how they can be extended.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)Google Scholar
  3. 3.
    Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., Aumueller, D.: Triplify: light-weight linked data publication from relational databases. In: WWW, pp. 621–630 (2009)Google Scholar
  4. 4.
    Bizer, C., Seaborne, A.: D2rq - treating non-rdf databases as virtual rdf graphs. In: ISWC 2004 (posters) (November 2004)Google Scholar
  5. 5.
    Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. In: Cordeiro, J., Pedrosa, V., Encarnação, B., Filipe, J. (eds.) WEBIST, pp. 247–254. INSTICC Press (2005)Google Scholar
  6. 6.
    Bühmann, L., Lehmann, J.: Universal OWL axiom enrichment for large knowledge bases. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 57–71. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Bühmann, L., Lehmann, J.: Pattern based knowledge base enrichment. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 33–48. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Applied Artificial Intelligence 22(1&2), 21–52 (2008)CrossRefGoogle Scholar
  9. 9.
    Crescenzi, V., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, Republic and Canton of Geneva, Switzerland, pp. 261–272. International World Wide Web Conferences Steering Committee (2013)Google Scholar
  10. 10.
    Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)CrossRefGoogle Scholar
  11. 11.
    Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Communications 17(2), 57–61 (2004)Google Scholar
  12. 12.
    Gentile, A.L., Zhang, Z., Augenstein, I., Ciravegna, F.: Unsupervised wrapper induction using linked data. In: Proceedings of the Seventh International Conference on Knowledge Capture, K-CAP 2013, pp. 41–48. ACM, New York (2013)CrossRefGoogle Scholar
  13. 13.
    Gerber, D., Hellmann, S., Bühmann, L., Soru, T., Usbeck, R., Ngonga Ngomo, A.-C.: Real-time RDF extraction from unstructured data streams. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 135–150. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  14. 14.
    Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM, New York (2011)Google Scholar
  15. 15.
    Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust Disambiguation of Named Entities in Text. In: Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, pp. 782–792 (2011)Google Scholar
  16. 16.
    Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 86–95. ACM, New York (2005)Google Scholar
  17. 17.
    McDowell, L., Cafarella, M.J.: Ontology-driven, unsupervised instance population. J. Web Sem. 6(3), 218–236 (2008)CrossRefGoogle Scholar
  18. 18.
    Mendes, P.N., Jakob, M., Garcia-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)Google Scholar
  19. 19.
    Parundekar, R., Knoblock, C.A., Ambite, J.L.: Linking the deep web to the linked dataweb. In: AAAI Spring Symposium: Linked Data Meets Artificial Intelligence. AAAI (2010)Google Scholar
  20. 20.
    Saleem, M., Padmanabhuni, S.S., Ngonga Ngomo, A.-C., Almeida, J.S., Decker, S., Deus, H.F.: Linked cancer genome atlas database. In: Proceedings of I-Semantics (2013)Google Scholar
  21. 21.
    Unbehauen, J., Stadler, C., Auer, S.: Accessing relational data on the web with sparqlmap. In: Takeda, H., Qu, Y., Mizoguchi, R., Kitamura, Y. (eds.) JIST 2012. LNCS, vol. 7774, pp. 65–80. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  22. 22.
    Usbeck, R., Ngomo, A.-C.N., Röder, M., Gerber, D., Coelho, S.A., Auer, S., Both, A.: AGDISTIS - graph-based disambiguation of named entities using linked data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 449–463. Springer, Heidelberg (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Lorenz Bühmann
    • 1
  • Ricardo Usbeck
    • 1
    • 2
  • Axel-Cyrille Ngonga Ngomo
    • 1
  • Muhammad Saleem
    • 1
  • Andreas Both
    • 2
  • Valter Crescenzi
    • 3
  • Paolo Merialdo
    • 3
  • Disheng Qiu
    • 3
  1. 1.Universität Leipzig, IFI/AKSWGermany
  2. 2.Unister GmbHLeipzigGermany
  3. 3.Università Roma TreItaly

Personalised recommendations