Data Extraction Using NLP Techniques and Its Transformation to Linked Data

  • Vincent Kríž
  • Barbora Hladká
  • Martin Nečaský
  • Tomáš Knap
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8856)

Abstract

We present a system that extracts a knowledge base from raw unstructured texts that is designed as a set of entities and their relations and represented in an ontological framework. The extraction pipeline processes input texts by linguistically-aware tools and extracts entities and relations from their syntactic representation. Consequently, the extracted data is represented according to the Linked Data principles. The system is designed both domain and language independent and provides users with data for more intelligent search than full-text search. We present our first case study on processing Czech legal texts.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gantz, J., Reinsel, D.: The digital universe decade - are you ready? (2010), http://goo.gl/ZaO0PR
  2. 2.
    Lassila, O., Swick, R.R.: Resource description framework (RDF) model and syntax specification. Technical report (1999), http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
  3. 3.
    Nečaský, M., Knap, T., Klímek, J., Holubová, I., Vidová-Hladká, B.: Linked open data for legislative domain - ontology and experimental data. In: Abramowicz, W. (ed.) BIS Workshops 2013. LNBIP, vol. 160, pp. 172–183. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  4. 4.
    Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Scientific American 284, 28–37 (2001)CrossRefGoogle Scholar
  5. 5.
    Biemann, C.: Ontology learning from text: A survey of methods. In: LDV forum, vol. 20, pp. 75–93 (2005)Google Scholar
  6. 6.
    Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM Conference on Digital Libraries, DL 2000, pp. 85–94. ACM, New York (2000)Google Scholar
  7. 7.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in Knowitall (preliminary results). In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 100–110. ACM, New York (2004)Google Scholar
  8. 8.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)Google Scholar
  9. 9.
    Banko, M., Etzioni, O.: Strategies for lifelong knowledge extraction from the web. In: Proceedings of the 4th International Conference on Knowledge Capture, K-CAP 2007, pp. 95–102. ACM, New York (2007)Google Scholar
  10. 10.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)Google Scholar
  11. 11.
    Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web, pp. 631–640. ACM (2009)Google Scholar
  12. 12.
    Abacha, A.B., Zweigenbaum, P.: Automatic extraction of semantic relations between medical entities: a rule based approach. J. Biomedical Semantics 2, S4 (2011)Google Scholar
  13. 13.
    Exner, P., Nugues, P.: Entity extraction: From unstructured text to dbpedia rdf triples. In: The Web of Linked Entities Workshop, WoLE 2012 (2012)Google Scholar
  14. 14.
    Baisa, V., Kovář, V.: Information extraction for czech based on syntactic analysis. In: Vetulani, Z. (ed.) Proceedings of 5th Language and Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, Pozna, Funcacja Universytetu im. A. Mickiewicza, pp. 466–470 (2011)Google Scholar
  15. 15.
    Biagioli, C., Francesconi, E., Passerini, A., Montemagni, S., Soria, C.: Automatic semantics extraction in law documents. In: Proceedings of the 10th International Conference on Artificial Intelligence and Law, pp. 133–140. ACM (2005)Google Scholar
  16. 16.
    Chiarcos, C., Hellmann, S., Nordhoff, S.: Introduction and overview. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics, pp. 1–12. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  17. 17.
    Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.): Semantic Processing of Legal Texts. LNCS, vol. 6036. Springer, Heidelberg (2010)Google Scholar
  18. 18.
    McCarty, L.T.: Deep semantic interpretations of legal texts. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law, ICAIL 2007, pp. 217–224. ACM, New York (2007)Google Scholar
  19. 19.
    Dell’Orletta, F., Marchi, S., Montemagni, S., Plank, B., Venturi, G.: The splet–2012 shared task on dependency parsing of legal texts. In: Proceedings of the 4th Workshop on Semantic Processing of Legal Texts 2012, Istanbul, Turkey (2012)Google Scholar
  20. 20.
    Pala, K., Rychlý, P., Šmerk, P.: Automatic identification of legal terms in czech law texts. In: Semantic Processing of Legal Texts, pp. 83–94. Springer, Berlin (2010)CrossRefGoogle Scholar
  21. 21.
    Pala, K., Mráková, E.: Legal terms and word sketches: a case study. In: Sojka, P., Horák, A. (eds.) Proceedings of Fourth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2010, Brno, Tribun s.r.o, pp. 31–39 (2010)Google Scholar
  22. 22.
    Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M.: Prague dependency treebank 2.0 (2006)Google Scholar
  23. 23.
    Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague dependency treebank 3.0. (2013), http://ufal.mff.cuni.cz/pdt3.0
  24. 24.
    Popel, M., Žabokrtský, Z.: TectoMT: Modular NLP framework. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 293–304. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Pajas, P., Štěpánek, J.: System for querying syntactically annotated corpora. In: Lee, G., Im Walde, S.S. (eds.) Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, pp. 33–36. Association for Computational Linguistics, Suntec (2009)CrossRefGoogle Scholar
  26. 26.
    Tiersma, P.: The Creation, Structure, and Interpretation of the Legal Text (2010), http://www.languageandlaw.org/LEGALTEXT.HTM
  27. 27.
    Kríž, V.: Detecting semantic relations in texts and their integration with external data resources. In: WDS 2013 Proceedings of Contributed Papers, Praha, Czechia, pp. 18–23. Matematicko-fyzikální fakulta Univerzity Karlovy, Matfyzpress (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Vincent Kríž
    • 1
  • Barbora Hladká
    • 1
  • Martin Nečaský
    • 2
  • Tomáš Knap
    • 2
  1. 1.Institute of Formal and Applied LinguisticsCharles University in PraguePraha 1Czech Republic
  2. 2.Department of Software Engineering Faculty of Mathematics and PhysicsCharles University in PraguePraha 1Czech Republic

Personalised recommendations