Real-Time RDF Extraction from Unstructured Data Streams

  • Daniel Gerber
  • Sebastian Hellmann
  • Lorenz Bühmann
  • Tommaso Soru
  • Ricardo Usbeck
  • Axel-Cyrille Ngonga Ngomo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8218)

Abstract

The vision behind the Web of Data is to extend the current document-oriented Web with machine-readable facts and structured data, thus creating a representation of general knowledge. However, most of the Web of Data is limited to being a large compendium of encyclopedic knowledge describing entities. A huge challenge, the timely and massive extraction of RDF facts from unstructured data, has remained open so far. The availability of such knowledge on the Web of Data would provide significant benefits to manifold applications including news retrieval, sentiment analysis and business intelligence. In this paper, we address the problem of the actuality of the Web of Data by presenting an approach that allows extracting RDF triples from unstructured data streams. We employ statistical methods in combination with deduplication, disambiguation and unsupervised as well as supervised machine learning techniques to create a knowledge base that reflects the content of the input streams. We evaluate a sample of the RDF we generate against a large corpus of news streams and show that we achieve a precision of more than 85%.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Auer, S., Lehmann, J., Ngonga Ngomo, A.-C., Zaveri, A.: Introduction to linked data and its lifecycle on the web. In: Rudolph, S., Gottlob, G., Horrocks, I., van Harmelen, F. (eds.) Reasoning Weg 2013. LNCS, vol. 8067, pp. 1–90. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Augenstein, I., Padó, S., Rudolph, S.: Lodifier: Generating linked data from unstructured text. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 210–224. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001)CrossRefGoogle Scholar
  4. 4.
    Brohée, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics (2006)Google Scholar
  5. 5.
    Davidov, D., Rappoport, A.: Classification of semantic relationships between nominals using pattern clusters. ACL (2008)Google Scholar
  6. 6.
    Exner, P., Nugues, P.: Entity extraction: From unstructured text to dbpedia rdf triples. In: Rizzo, G., Mendes, P., Charton, E., Hellmann, S., Kalyanpur, A. (eds.) Web of Linked Entities Workshop (WoLE 2012) (2012)Google Scholar
  7. 7.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: EMNLP, pp. 1535–1545. ACL (2011)Google Scholar
  8. 8.
    Gaag, A., Kohn, A., Lindemann, U.: Function-based solution retrieval and semantic search in mechanical engineering. In: IDEC 2009, pp. 147–158 (2009)Google Scholar
  9. 9.
    Gerber, D., Ngonga Ngomo, A.-C.: Bootstrapping the linked data web. In: 1st Workshop on Web Scale Knowledge Extraction @ ISWC 2011 (2011)Google Scholar
  10. 10.
    Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In: LREC (2012)Google Scholar
  11. 11.
    Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Wiegand, M., Weikum, G.: Robust disambiguation of named entities in text. In: Conference on Empirical Methods in Natural Language Processing: EMNLP 2011, Proceedings of the Conference, Edinburgh, United Kingdom, Stroudsburg, PA, July 27-31, pp. 782–792. ACL, MP (2011) 978-1-937284-11-4Google Scholar
  12. 12.
    Lehmann, J., Gerber, D., Morsey, M., Ngonga Ngomo, A.-C.: DeFacto - Deep Fact Validation. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 312–327. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  13. 13.
    Lin, D.: An Information-Theoretic Definition of Similarity. In: Shavlik, J.W., Shavlik, J.W. (eds.) ICML, pp. 296–304. Morgan Kaufmann (1998)Google Scholar
  14. 14.
    Mendes, P.N., Jakob, M., Garcia-Silva, A., Bizer, C.: DBpedia Spotlight: Shedding Light on the Web of Documents. In: I-SEMANTICS. ACM International Conference Proceeding Series, pp. 1–8. ACM (2011)Google Scholar
  15. 15.
    Morsey, M., Lehmann, J., Auer, S., Ngonga Ngomo, A.-C.: Dbpedia sparql benchmark - performance assessment with real queries on real data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 454–469. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Nakashole, N., Weikum, G.: Real-time population of knowledge bases: opportunities and challenges. In: Proceedings of AKBC-WEKEX (2012)Google Scholar
  17. 17.
    Ngonga Ngomo, A.-C., Heino, N., Lyko, K., Speck, R., Kaltenböck, M.: SCMS - Semantifying Content Management Systems. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part II. LNCS, vol. 7032, pp. 189–204. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Ngonga Ngomo, A.-C.: On link discovery using a hybrid approach. J. Data Semantics 1(4), 203–217 (2012)CrossRefGoogle Scholar
  19. 19.
    Ngonga Ngomo, A.-C., Schumacher, F.: Borderflow: A local graph clustering algorithm for natural language processing. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 547–558. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet: Similarity - measuring the relatedness of concepts. In: AAAI (2004)Google Scholar
  21. 21.
    Rizzo, G., Troncy, R., Hellmann, S., Brümmer, M.: NERD meets NIF: Lifting NLP extraction results to the linked data cloud. In: LDOW, France (2012)Google Scholar
  22. 22.
    Ruiz-Casado, M., Alfonseca, E., Castells, P.: Automatising the learning of lexical patterns: An application to the enrichment of wordnet by extracting semantic relationships from wikipedia (2007)Google Scholar
  23. 23.
    Sarawagi, S.: Information extraction. Found. Trends Databases (2008)Google Scholar
  24. 24.
    Stern, R., Sagot, B.: Population of a knowledge base for news metadata from unstructured text and web data. In: Proceedings of the AKBC-WEKEX (2012)Google Scholar
  25. 25.
    Wu, Z., Palmer, M.S.: Verb semantics and lexical selection. In: Pustejovsky, J. (ed.) ACL, pp. 133–138. Morgan Kaufmann Publishers / ACL (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Daniel Gerber
    • 1
  • Sebastian Hellmann
    • 1
  • Lorenz Bühmann
    • 1
  • Tommaso Soru
    • 1
  • Ricardo Usbeck
    • 1
  • Axel-Cyrille Ngonga Ngomo
    • 1
  1. 1.Institut für Informatik, AKSWUniversität LeipzigLeipzigGermany

Personalised recommendations