Advertisement

Semantic Relation Extraction. Resources, Tools and Strategies

  • Marcos Garcia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9727)

Abstract

Relation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged in machine-readable formats, useful for several applications that need structured semantic knowledge. The work presented in this paper explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Galician and Spanish. Both machine learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of an open information extraction tool. To implement the extraction systems, several natural language processing tools have been built for the three research languages: From sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and symbolic models. As a result of the performed work, new resources and tools are available for automated processing of texts in Portuguese, Galician and Spanish.

Keywords

Information extraction Natural language processing Named entity recognition Part-of-speech tagging Coreference resolution 

References

  1. 1.
    Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries, pp. 85–94 (2000)Google Scholar
  2. 2.
    Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007), pp. 2670–2676 (2007)Google Scholar
  3. 3.
    Barcala, F.M., Domínguez Noya, E.M., Otero, P.G., López Martínez, M., Moscoso Mato, E.M., Rojo, G., Santalla del Río, M.P., Sotelo Docío, S.: A corpus and lexical resources for multi-word terminology extraction in the field of economy in a in a minority language. In: Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 3rd Language & Technology Conference, pp. 359–363 (2007)Google Scholar
  4. 4.
    Bosque 8.0: Uma floresta integralmente revista por linguistas (2008)Google Scholar
  5. 5.
    Branco, A., Silva, J.R.: Contractions: breaking the tokenization-tagging circularity. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.G.V. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 167–170. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art POS taggers for portuguese. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), pp. 507–510 (2004)Google Scholar
  7. 7.
    Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of the WebDB Workshop at the 6th International Conference on Extending Database Technology (EDBT 1998), pp. 172–183 (1998)Google Scholar
  8. 8.
    Bruckschen, M., Camargo de Souza, J., Vieira, R., Rigo, S.: Sistema SeRELeP para o reconhecimento de relações entre entidades mencionadas. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, Chap. 14, pp. 247–260. Linguateca (2008)Google Scholar
  9. 9.
    Cardoso, N.: REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações ANálise Detalhada do Texto. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pp. 195–211. Linguateca (2008)Google Scholar
  10. 10.
    Carreras, X., Márquez, L., Padró, L.: A simple named entity extractor using AdaBoost. In: Proceedings of the 7th Conference on Natural Language Learning at HLT/NAACL 2003, vol. 4, pp. 152–155. ACL (2003)Google Scholar
  11. 11.
    Chaves, M.: Geo-ontologias e padrões para reconhecimento de locais e de suas relações em textos: o SEI-Geo no Segundo HAREM. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pp. 231–245. Linguateca (2008)Google Scholar
  12. 12.
    Corro, L.D., Gemulla, R.: ClausIE: clause-based open information extraction. In: Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), pp. 355–366 (2013)Google Scholar
  13. 13.
    Eleutério, S., Ranchhod, E., Mota, C., Carvalho, P.: Dicionários Electrónicos do Português. Características e Aplicações. In: Actas del VIII Simposio Internacional de Comunicación Social, pp. 636–642 (2003)Google Scholar
  14. 14.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D., Yates, A.: Web-scale information extraction in KnowItAll. In: Proceedings of the 13th International Conference on World Wide Web (WWW 2004), pp. 100–110. ACM (2004)Google Scholar
  15. 15.
    Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: the second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 3–10 (2011)Google Scholar
  16. 16.
    Gamallo, P., Garcia, M.: A resource-based method for named entity extraction and classification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 610–623. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  17. 17.
    Gamallo, P., Garcia, M., Fernández-Lanza, S.: Dependency-based open information extraction. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp. 10–18. ACL (2012)Google Scholar
  18. 18.
    Gamallo, P., González López, I.: A grammatical formalism based on patterns of part-of-speech tags. Int. J. Corpus Linguist. 16(1), 45–71 (2011)CrossRefGoogle Scholar
  19. 19.
    Garcia, M.: Extracção de Relações Semânticas. Recursos, Ferramentas e Estratégias. Ph.D. thesis, Universidade de Santiago de Compostela (2014)Google Scholar
  20. 20.
    Garcia, M., Gamallo, P.: Análise Morfossintáctica para Português Europeu e Galego: Problemas, Soluções e Avaliação. Linguamática. Revista para o Processamento Automático das Línguas Ibéricas 2(2), 59–67 (2010)Google Scholar
  21. 21.
    Garcia, M., Gamallo, P.: Using morphosyntactic post-processing to improve PoS-tagging accuracy. In: Proceedings of the 9th International Conference on Computational Processing of Portuguese Language (PROPOR 2010), Extended Activities Proceedings (2010)Google Scholar
  22. 22.
    Garcia, M., Gamallo, P.: A weakly-supervised rule-based approach for relation extraction. In: Proceedings of the XIV Conference of the Spanish Association for Artificial Intelligence (CAEPIA 2011). Workshop on Knowledge Extraction and Exploitation from Semi-structures Online Sources (KEESOS) (2011)Google Scholar
  23. 23.
    Garcia, M., Gamallo, P.: An exploration of the linguistic knowledge for semantic relation extraction in Spanish. In: Proceedings of the Joint Workshop FAM-LbR/KRAQ 2011. In: Learning by Reading and its Applications in Intelligent Question-Answering at 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 7–12 (2011)Google Scholar
  24. 24.
    Garcia, M., Gamallo, P.: Dependency-based text compression for semantic relation extraction. In: Proceedings of the Workshop on Information Extraction and Knowledge Acquisition (IEKA 2011) at 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp. 21–28 (2011)Google Scholar
  25. 25.
    Garcia, M., Gamallo, P.: Evaluating various features on semantic relation extraction. In: Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp. 721–726 (2011)Google Scholar
  26. 26.
    Garcia, M., Gamallo, P.: Exploring the effectiveness of linguistic knowledge for biographical relation extraction. Nat. Lang. Eng. 21(4), 519–551 (2013)CrossRefGoogle Scholar
  27. 27.
    Garcia, M., Gamallo, P.: An entity-centric coreference resolution system for person entities with rich linguistic information. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 741–752 (2014)Google Scholar
  28. 28.
    Garcia, M., Gamallo, P.: Entity-centric coreference resolution of person entities for open information extraction. Procesamiento del Lenguaje Natural 53, 25–32 (2014)Google Scholar
  29. 29.
    Garcia, M., Gamallo, P.: Multilingual corpora with coreference annotation of person entities. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3229–3233. ELRA (2014)Google Scholar
  30. 30.
    Garcia, M., Gamallo, P., Gayo, I., Pousada Cruz, M.: PoS-tagging the Web in Portuguese. National varieties, text typologies and spelling systems. Procesamiento del Lenguaje Natural 53, 95–101 (2014)Google Scholar
  31. 31.
    Garcia, M., Gayo, I., González López, I.: Identificação e Classificação de Entidades Mencionadas em Galego. Estudos de Lingüística Galega 4, 13–25 (2012)Google Scholar
  32. 32.
    Graña, J., Barcala, F.-M., Vilares, J.: Formal methods of tokenization for part-of-speech tagging. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 123–144. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  33. 33.
    Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539–545. ACL (1992)Google Scholar
  34. 34.
    Leach, G., Wilson, A.: Recommendations for the morphosyntactic annotation of corpora. Technical report, Expert Advisory Group on Language Engineering Standard (EAGLES) (1996)Google Scholar
  35. 35.
    Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., Jurafsky, D.: Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput. Linguist. 39(4), 885–916 (2013)CrossRefGoogle Scholar
  36. 36.
    Mikheev, A., Grover, C., Moens, M.: XML tools and architecture for Named Entity Recognition. J. Markup Lang. Theory Pract. 1(3), 89–113 (1998)CrossRefGoogle Scholar
  37. 37.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009), pp. 1003–1011. ACL (2009)Google Scholar
  38. 38.
    Mota, C., Santos, D. (eds.): Desafios na avaliação conjunta do reconhecimento de entidades mencionadas. O Segundo HAREM. Linguateca (2008)Google Scholar
  39. 39.
    Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). ELRA (2012)Google Scholar
  40. 40.
    Palomar, M., Ferrández, A., Moreno, L.: Martínez-Barco, P., Peral, J., Saiz-Noeda, M., Muñoz, R.: An algorithm for anaphora resolution in Spanish texts. Comput. Linguist. 27(4), 545–567 (2001)Google Scholar
  41. 41.
    Pantel, P., Pennacchiotti, M.: Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pp. 113–120. ACL (2006)Google Scholar
  42. 42.
    Recasens, M.: Martí, M.: AnCora-CO: coreferentially annotated corpora for Spanish and Catalan. Lang. Res. Eval. 44(4), 315–345 (2010)Google Scholar
  43. 43.
    Santos, D., Cardoso, N. (eds.): Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Grupo LyS, Departamento de Galego-Português, Francês e Linguística Faculdade de FilologiaUniversidade da Coruña, Campus da CoruñaCoruñaSpain

Personalised recommendations