Abstract
It is estimated that more than 80% of the information on the Web is stored in textual form. As such, it has become increasingly difficult for humans to sort and extract useful information from the daily influx of data. In order to automate this process, open information extraction (OIE) methods have been proposed, which can extract facts from large textual bases. While most OIE methods were initially developed for the English language, the importance of developing methods for other languages, such as Portuguese, has been increasingly recognized in recent literature. OIE methods based on hand-crafted rules and shallow syntactic analysis have achieved good performances for the English language. Nevertheless, methods based on similar approaches in the Portuguese language have not achieved equivalent success. We believe that the shallow syntactic patterns previously explored in the literature do not cover important aspects of the Portuguese language syntax. For this reason, we propose the DptOIE method based on a new set of syntax-based rules using dependency parsers and a depth-first search (DFS) algorithm for OIE and a set of grammar-based rules to cover specific syntactic phenomena of the language. DptOIE was compared against the state-of-the-art OIE for the Portuguese language, obtaining favorable results both in our empirical evaluation and at the IberLEF evaluation track of OIE systems for the Portuguese language. Furthermore, we believe our method can be easily adapted to other Romance languages related to Portuguese.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
https://www.internetlivestats.com/ (03/24/2021).
http://www.internetworldstats.com/stats7.htm (03/24/2021).
ArgOE and DepOE identify clause-like structures on the sentence employing manually-crafted rules on the structure of the dependency trees. Their work is more rigid in exploring the rich variation in syntactic realizations of semantic relations and information structures.
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515 (Acessado em 03/24/2021).
https://github.com/citiususc/Linguakit. (03/24/2021).
Available at: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515 (access 03/24/2021).
An improved version with a Portuguese tokenizer has been developed.
References
Akbik A, Broß J (2009) Wanderlust: extracting semantic relations from natural language text using dependency grammar patterns. In: SemSearch workshop day at World Wide Web conference (WWW2009), 2009, vol 48
Akbik A, Löser A (2012) KrakeN: N-ary facts in open information extraction. In: Proceedings of the joint workshop on automatic knowledge base construction and Web-scale knowledge extraction, 2012. Association for Computational Linguistics, pp 52–56
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. IJCAI 7:2670–2676
Bassa A, Kroll M, Kern R (2018) GerIE—an open information extraction system for the German language. J Univers Comput Sci 24(1):2–24
Bast H, Haussmann E (2013) Open information extraction via contextual sentence decomposition. In: 2013 IEEE seventh international conference on semantic computing (ICSC), 2013. IEEE, pp 154–159
Bechara E (2012) Moderna gramática portuguesa. Nova Fronteira, Rio de Janeiro
Bender EM (2009) Linguistically naïve!= language independent: why NLP needs linguistic typology. In: Proceedings of the EACL 2009 workshop on the interaction between linguistics and computational linguistics: virtuous, vicious or vacuous? 2009, pp 26–32
Buďa J (2017) A posição do adjetivo no sintagma nominal em português. Études romanes de Brno 38(1):219–238
Cabral B, Souza M, Claro DB (2020a) Explainable OpenIE classifier with morpho-syntactic rules. In: Proceedings of the workshop on hybrid intelligence for natural language processing tasks (HI4NLP 2020), 2020. CEUR-WS.org, pp 7–15
Cabral BS, Glauber R, Souza M, Claro DB (2020b) CrossOIE: cross-lingual classifier for open information extraction. In: International conference on computational processing of the Portuguese language, 2020. Springer, pp 368–378
Cimiano P, Wenderoth J (2005) Automatically learning Qualia structures from the Web. In: Proceedings of the ACL-SIGLEX workshop on deep lexical acquisition, 2005. Association for Computational Linguistics, pp 28–37
Claro DB, Souza M, Castellã Xavier C, Oliveira L (2019) Multilingual open information extraction: challenges and opportunities. Information 10(7):228. https://doi.org/10.3390/info10070228
Collovini S, Machado G, Vieira R (2016) Extracting and structuring open relations from Portuguese text. In: International conference on computational processing of the Portuguese language, 2016. Springer, pp 153–164
Collovini S, Neto JFS, Consoli BS, Terra J, Vieira R, Quaresma P, Souza M, Claro DB, Glauber R (2019) IberLEF 2019 Portuguese named entity recognition and relation extraction tasks. In: IberLEF@ SEPLN, 2019, pp 390–410
Cui L, Wei F, Zhou M (2018) Neural open information extraction. CoRR. arXiv:abs/1805.04270
Damiano E, Minutolo A, Esposito M (2018) Open information extraction for Italian sentences. In: 2018 32nd International conference on advanced information networking and applications workshops (WAINA), 2018, pp 668–673. https://doi.org/10.1109/WAINA.2018.00165
Del Corro L, Gemulla R (2013) ClausIE: clause-based open information extraction. In: Proceedings of the 22nd international conference on World Wide Web, 2013. ACM, pp 355–366
Dryer MS, Haspelmath M (eds) (2013) WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/
Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proceedings of the conference on empirical methods in natural language processing, 2011. Association for Computational Linguistics, pp 1535–1545
Faruqui M, Kumar S (2015) Multilingual open relation extraction using cross-lingual projection, pp 1351–1356. arXiv preprint. arXiv:1503.06450, http://www.aclweb.org/anthology/N15-1151
Gamallo P, Garcia M (2015) Multilingual open information extraction. In: Portuguese conference on artificial intelligence, 2015. Springer, pp 711–722
Gamallo P, Garcia M (2017) Linguakit: uma ferramenta multilingue para a análise linguística e a extração de informação. Linguamática 9(1):19–28
Gamallo P, Garcia M, Fernández-Lanza S (2012) Dependency-based open information extraction. In: Proceedings of the joint workshop on unsupervised and semi-supervised learning in NLP, 2012. Association for Computational Linguistics, pp 10–18
Garcia M, Gamallo P (2014) Entity-centric coreference resolution of person entities for open information extraction. Proces Leng Nat 53:25–32
Glauber R, Claro DB (2018) A systematic mapping study on open information extraction. Expert Syst Appl 112:372–387
Glauber R, de Oliveira LS, Sena CFL, Claro DB, Souza M (2018) Challenges of an annotation task for open information extraction in Portuguese. In: International conference on computational processing of the Portuguese language, 2018. Springer, pp 66–76
Guarasci R, Damiano E, Minutolo A, Esposito M, Pietro GD (2020) Lexicon-grammar based open information extraction from natural language sentences in Italian. Expert Syst Appl 143:112954. https://doi.org/10.1016/j.eswa.2019.112954
Jurafsky D, Martin JH (2017) Chapter 6: vector semantics. In: Jurafsky D, Martin JH (eds) Speech and language processing, 3rd edn. Prentice Hall, pp 101–130 (draft of 23 Sep 2018). https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
Kato MA (2000) A restrição de mono-argumentalidade da ordem vs no português do brasil. Fórum Linguíst 2(1):97–127
Kilgarriff A, Grefenstette G (2001) Web as corpus. In: Proceedings of corpus linguistics 2001, Corpus Linguistics. Readings in a widening discipline, 2001, pp 342–344
Léchelle W, Gotti F, Langlais P (2018) WiRe57: a fine-grained benchmark for open information extraction. arXiv preprint. arXiv:1809.08962
Leung H, Li CY, Li J, Li K, Ljubešić N, Loginova O, Lyashevskaya O, Lynn T, Macketanz V, Makazhanov A et al (2017) Universal dependencies 2.1
Lockard C, Shiralkar P, Dong XL (2019) OpenCeres: when open information extraction meets the semi-structured Web. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies: long and short papers, 2019, vol 1. Association for Computational Linguistics, Minneapolis, pp 3047–3056. https://doi.org/10.18653/v1/N19-1309
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: ACL (system demonstrations), 2014, pp 55–60
Nivre J, Hall J, Nilsson J (2006) MaltParser: a data-driven parser-generator for dependency parsing. Proc LREC 6:2216–2219
Oliveira L, Glauber R, Claro DB (2017) DependentIE: an open information extraction system on Portuguese by a dependence analysis. In: ENIAC—2017 XIV Encontro Nacional de Inteligência Artificial e Computacional. http://comissoes.sbc.org.br/ce-ia/pg/historico/?file=ENIAC-2017|Anais-ENIAC-2017.pdf
Pereira V, Pinheiro V (2015) Report-um sistema de extração de informações aberta para língua portuguesa (report-an open information extraction system for Portuguese language). In: Proceedings of the 10th Brazilian symposium in information and human language technology, 2015, pp 191–200
Pilati E (2016) Sobre a ordem verbo-sujeito no português brasileiro: 30 anos em mirada crítica. Rev Linguí\(\int \)t 12(2):183–205. https://doi.org/10.31513/linguistica.2016.v12n2a5474
Ro Y, Lee Y, Kang P (2020) Multi\(\hat{}\) 2OIE: multilingual open information extraction based on multi-head attention with BERT. arXiv preprint. arXiv:2009.08128
Rodríguez JM, Merlino HD, Pesado P, García-Martínez R (2016) Performance evaluation of knowledge extraction methods. In: International conference on industrial engineering and other applications of applied intelligent systems, 2016. Springer, pp 16–22
Sacconi LA (2012) Gramática Para Todos os Cursos e Concursos -Teoria e Prática, 5th edn. Nova Geração
Santos D, Cardoso N (2007) Reconhecimento de entidades mencionadas em português: Documentação e atas do HAREM, a primeira avaliação conjunta na área. Linguateca, Lisboa
Schmitz M, Bart R, Soderland S, Etzioni O et al (2012) Open language learning for information extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012. Association for Computational Linguistics, pp 523–534
Sena CFL, Claro DB (2019) InferPortOIE: a Portuguese open information extraction system with inference. Nat Lang Eng 25:287–306. https://doi.org/10.1017/S135132491800044X
Sena CFL, Claro DB (2020) PragmaticOIE: a pragmatic open information extraction for Portuguese language. Knowl Inf Syst 62:3811–3836
Sena CFL, Glauber R, Claro DB (2017) Inference approach to enhance a Portuguese open information extraction. In: Proceedings of the 19th international conference on enterprise information systems (ICEIS), 2017, vol 1. INSTICC, ScitePress, pp 442–451. https://doi.org/10.5220/0006338204420451
Stanovsky G, Michael J, Zettlemoyer L, Dagan I (2018) Supervised open information extraction. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies: long papers, 2018, vol 1, pp 885–895
Teixeira RFA (1986) Zero Anaphora in Brazilian Portuguese subjects and objects: morphological and typological considerations (Brazil). University of California, Berkeley
Virtanen A, Kanerva J, Ilo R, Luoma J, Luotolahti J, Salakoski T, Ginter F, Pyysalo S (2019) Multilingual is not enough: BERT for Finnish. arXiv preprint. arXiv:1912.07076
Wu S, Dredze M (2020) Are all languages created equal in multilingual BERT? arXiv preprint. arXiv:2005.09093
Wu F, Weld DS (2010) Open information extraction using Wikipedia. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, 2010. Association for Computational Linguistics, pp 118–127
Xavier CC, de Lima VLS, Souza M (2013) Open information extraction based on lexical–syntactic patterns. In: 2013 Brazilian conference on intelligent systems (BRACIS), 2013. IEEE, pp 189–194
Xavier CC, de Lima VLS, Souza M (2015) Open information extraction based on lexical semantics. J Braz Comput Soc 21(1):4
Zeman D, Hajič J, Popel M, Potthast M, Straka M, Ginter F, Nivre J, Petrov S (2018) CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, 2018. Association for Computational Linguistics, Brussels, pp 1–21. http://www.aclweb.org/anthology/K18-2001
Acknowledgements
We would like to thank FAPESB and CAPES Finance Code 001 for their financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Oliveira, L., Claro, D.B. & Souza, M. DptOIE: a Portuguese open information extraction based on dependency analysis. Artif Intell Rev 56, 7015–7046 (2023). https://doi.org/10.1007/s10462-022-10349-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10349-4