Abstract
In order to avoid ambiguity and to ensure, as far as possible, a strict interpretation of law, legal texts usually define the specific lexical terms used within their discourse by means of normative rules. With an often large amount of rules in effect in a given domain, extracting these definitions manually would be a costly undertaking. This paper presents an approach to cope with this problem based in a variation of an automated technique of natural language processing of Brazilian Portuguese texts. For the sake of generality, the proposed solution was developed to address the more general problem of building a glossary from domain specific texts that contain definitions amongst their content. This solution was applied to a corpus of texts on the telecommunications regulations domain and the results are reported. The usual pipeline of natural language processing has been followed: preprocessing, segmentation, and part-of-speech tagging. A set of feature extraction functions is specified and used along with reference glossary information on whether or not a text fragment is a definition, to train a SVM classifier. At last, the definitions are extracted from the texts and evaluated upon a testing corpus, which also contains the reference glossary annotations on definitions. The results are then discussed in light of other definition extraction techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alarcón, R., Sierra, G., Bach, C.: Developing a Definitional Knowledge Extraction System. In: Proceedings of Third Language & Technology Conference, LTC 2007 (2007)
Alarcón, R., Sierra, G., Bach, C.: ECODE: A Definition Extraction System. In: Vetulani, Z., Uszkoreit, H. (eds.) LTC 2007. LNCS, vol. 5603, pp. 382–391. Springer, Heidelberg (2009)
Alarcón, R., Sierra, G., Bach, C.: Description and evaluation of a definition extraction system for Spanish language. In: Proceedings of the 1st Workshop on Definition Extraction, pp. 7–13. Association for Computational Linguistics, Borovets (2009)
Aluísio, S.M., Pinheiro, G., Finger, M., Nunes, M.G.V., Tagnin, S.E.: The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of Corpus Linguistics, Lancaster, UK, vol. 16, pp. 14–21 (2003)
Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)
Aranha, M.I., Lima, J.A.O.: Coleção Brasileira de Direito das Telecomunicações, Grupos de Pesquisa. v. 3. Brasília, Brazil (2009)
Blair-Goldensohn, S., McKeown, K.R., Schlaikjer, A.H.: Answering definitional questions: A hybrid approach. New directions in question answering. AAAI Press (2004)
Borg, C., Rosner, M., Pace, G.J.: Towards Automatic Extraction of Definitions. In: Proceedings of the 5th Computer Science Annual Workshop, CSAW 2007 (2007)
Borg, C., Rosner, M., Pace, G.: Evolutionary algorithms for definition extraction. In: Proceedings of the 1st Workshop on Definition Extraction, pp. 26–32. Association for Computational Linguistics, Stroudsburg (2009)
Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art POS taggers for Portuguese. In: Proceedings of the 4th Language Resources and Evaluation Conference, LREC 2004, Lisbon, Portugal, pp. 507–510 (2004)
BRASIL. Lei nº 8.666 (1993), http://www3.dataprev.gov.br/sislex/paginas/42/1993/8666.html
BRASIL. Lei Complementar nº 95 (1998), http://www.lexml.gov.br/urn/urn:lex:br:federal:lei.complementar:1998-02-26;95
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing – ANLC, pp. 152–155. Association for Computational Linguistics, Trento (1992)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(27) (2011), http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf
Clark, A., Fox, C., Lappin, S. (Orgs.): The Handbook of Computational Linguistics and Natural Language Processing. John Wiley and Sons (2010)
Del Gaudio, R., Branco, A.: Automatic Extraction of Definitions in Portuguese: A Rule-Based Approach. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 659–670. Springer, Heidelberg (2007)
Del Gaudio, R., Branco, A.: Extraction of definitions in portuguese: An imbalanced data set problem. In: Proceedings of Text Mining and Applications at EPIA (2009)
Demšar, J., Zupan, B., Leban, G., Curk, T.: Orange: From Experimental Machine Learning to Interactive Data Mining. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 537–539. Springer, Heidelberg (2004)
Fahmi, I., Bouma, G.: Learning to identify definitions using syntactic features. In: Proceedings of the Workshop on Learning Structured Information in Natural Language Applications, pp. 64–71. Association for Computational Linguistics, Trento (2006)
Feldman, R., Sanger, J.: The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press (2007)
Fernandes, A.D.: Answering definitional questions before they are asked. PhD Thesis. Massachusetts Institute of Technology, Cambridge, USA (2004)
Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukwac, a very large web-derived corpus of english. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp. 47–54. Marrakech, Marrocos (2008)
Kiss, T., Strunk, J.: Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4), 485–525 (2006)
Klavans, J.L., Muresan, S.: DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text. In: Proceedings of the AMIA Symposium, pp. 1049–1049 (2000)
Loper, E., Bird, S.: NLTK: the Natural Language Toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics – ETMTNLP, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)
26. Magnini, B.; Cappelli, A.; Tamburini, F.: Evaluation of natural language tools for italian: Evalita 2007. Proceedings of the International Language Resources and Evaluation Conference, LREC 2008, vol. 8, p. 2536-2543, 2008.
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Computational Linguistic 19(2), 313–330 (1993)
Marques, N.C., Lopes, J.G.P.: A Neural Network Approach to Portuguese Part-of-Speech Tagging. In: Garcia, L.S. (ed.) Anais do II Encontro para o Processamento Computacional de Português Escrito e Falado. CEFET-PR, Curitiba (1996)
Miliaraki, S., Androutsopoulos, I.: Learning to identify single-snippet answers to definition questions. In: Proceedings of the 20th International Conference on Computational Linguistics - COLING 2004. Association for Computational Linguistics, Stroudsburg (2004)
Navigli, R., Velardi, P.: Learning word-class lattices for definition and hypernym extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1318–1327 (2010)
Pearson, J.: Terms in context. John Benjamins Publishing Company (1998)
Pinto, A.S., Oliveira, D.: Extracção de definições no Corpógrafo. Faculdade de Letras da Universidade do Porto, Portugal (2004), http://comum.rcaap.pt/bitstream/123456789/281/1/OliveiraPintoOut2004.pdf
Przepiórkowski, A., Degórski, Ł., Wójtowicz, B.: Towards the automatic extraction of definitions in Slavic. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pp. 43–50. Association for Computational Linguistics, Prague (2007)
Rigutini, L., Diligenti, M., Maggini, M., Gori, M.: A Fully Automatic Crossword Generator. In: Proceedings of the Seventh International Conference on Machine Learning and Applications, pp. 362–367. IEEE Computer Society (2008)
Rondeau, G.: Introduction à la Terminologie, Québec, Gaëten Morin Editeur (1984)
Sager, J.C.: A practical course in terminology processing. J. Benjamins Pub. Co. (1990)
Saggion, H.: Identifying Definitions in Text Collections for Question Answering. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)
Saggion, H.: Mining Profiles and Definitions with Natural Language Processing. In: Prado, H.A., Ferneda, E. (Orgs.) Emerging Technologies of Text Mining: Techniques and Applications, IGI Global, Hershey (2008)
Sang, E.T.K., Bouma, G., De Rijke, M.: Developing offline strategies for answering medical questions. In: Proceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, Pittsburgh, USA, pp. 41–45 (2005)
Sarmento, L., Maia, B., Santos, D.: The Corpógrafo – a Web-based environment for corpora research. In: Proceedings of the International Language Resources and Evaluation Conference, LREC 2004, pp. 449–452 (2004)
Shaw, W.C.: The Art of Debate. Allyn and Bacon, New York (1922)
Tanev, H., Negri, M., Magnini, B., Kouylekov, M.: The DIOGENE Question Answering System at CLEF-2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 435–445. Springer, Heidelberg (2005)
Westerhout, E., Monachesi, P.: Extraction of Dutch definitory contexts for elearning purposes. In: Proceedings of Computational Linguistics in the Netherlands, CLIN 2006 (2006)
Wüster, E.: Die allgemeine Terminologielehre–ein Grenzgebiet zwischen Sprachwissenschaft, Logik, Ontologie, Informatik und den Sachwissenschaften. Linguistics 12(119), 61–106 (1974)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferneda, E., do Prado, H.A., Batista, A.H., Pinheiro, M.S. (2012). Extracting Definitions from Brazilian Legal Texts. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2012. ICCSA 2012. Lecture Notes in Computer Science, vol 7335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31137-6_48
Download citation
DOI: https://doi.org/10.1007/978-3-642-31137-6_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31136-9
Online ISBN: 978-3-642-31137-6
eBook Packages: Computer ScienceComputer Science (R0)