Extracting Definitions from Brazilian Legal Texts

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7335)


In order to avoid ambiguity and to ensure, as far as possible, a strict interpretation of law, legal texts usually define the specific lexical terms used within their discourse by means of normative rules. With an often large amount of rules in effect in a given domain, extracting these definitions manually would be a costly undertaking. This paper presents an approach to cope with this problem based in a variation of an automated technique of natural language processing of Brazilian Portuguese texts. For the sake of generality, the proposed solution was developed to address the more general problem of building a glossary from domain specific texts that contain definitions amongst their content. This solution was applied to a corpus of texts on the telecommunications regulations domain and the results are reported. The usual pipeline of natural language processing has been followed: preprocessing, segmentation, and part-of-speech tagging. A set of feature extraction functions is specified and used along with reference glossary information on whether or not a text fragment is a definition, to train a SVM classifier. At last, the definitions are extracted from the texts and evaluated upon a testing corpus, which also contains the reference glossary annotations on definitions. The results are then discussed in light of other definition extraction techniques.


Information extraction Definition extraction Natural Language Processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alarcón, R., Sierra, G., Bach, C.: Developing a Definitional Knowledge Extraction System. In: Proceedings of Third Language & Technology Conference, LTC 2007 (2007)Google Scholar
  2. 2.
    Alarcón, R., Sierra, G., Bach, C.: ECODE: A Definition Extraction System. In: Vetulani, Z., Uszkoreit, H. (eds.) LTC 2007. LNCS, vol. 5603, pp. 382–391. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Alarcón, R., Sierra, G., Bach, C.: Description and evaluation of a definition extraction system for Spanish language. In: Proceedings of the 1st Workshop on Definition Extraction, pp. 7–13. Association for Computational Linguistics, Borovets (2009)Google Scholar
  4. 4.
    Aluísio, S.M., Pinheiro, G., Finger, M., Nunes, M.G.V., Tagnin, S.E.: The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of Corpus Linguistics, Lancaster, UK, vol. 16, pp. 14–21 (2003)Google Scholar
  5. 5.
    Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Aranha, M.I., Lima, J.A.O.: Coleção Brasileira de Direito das Telecomunicações, Grupos de Pesquisa. v. 3. Brasília, Brazil (2009)Google Scholar
  7. 7.
    Blair-Goldensohn, S., McKeown, K.R., Schlaikjer, A.H.: Answering definitional questions: A hybrid approach. New directions in question answering. AAAI Press (2004)Google Scholar
  8. 8.
    Borg, C., Rosner, M., Pace, G.J.: Towards Automatic Extraction of Definitions. In: Proceedings of the 5th Computer Science Annual Workshop, CSAW 2007 (2007)Google Scholar
  9. 9.
    Borg, C., Rosner, M., Pace, G.: Evolutionary algorithms for definition extraction. In: Proceedings of the 1st Workshop on Definition Extraction, pp. 26–32. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  10. 10.
    Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art POS taggers for Portuguese. In: Proceedings of the 4th Language Resources and Evaluation Conference, LREC 2004, Lisbon, Portugal, pp. 507–510 (2004)Google Scholar
  11. 11.
  12. 12.
  13. 13.
    Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing – ANLC, pp. 152–155. Association for Computational Linguistics, Trento (1992)CrossRefGoogle Scholar
  14. 14.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(27) (2011),
  15. 15.
    Clark, A., Fox, C., Lappin, S. (Orgs.): The Handbook of Computational Linguistics and Natural Language Processing. John Wiley and Sons (2010)Google Scholar
  16. 16.
    Del Gaudio, R., Branco, A.: Automatic Extraction of Definitions in Portuguese: A Rule-Based Approach. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 659–670. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  17. 17.
    Del Gaudio, R., Branco, A.: Extraction of definitions in portuguese: An imbalanced data set problem. In: Proceedings of Text Mining and Applications at EPIA (2009)Google Scholar
  18. 18.
    Demšar, J., Zupan, B., Leban, G., Curk, T.: Orange: From Experimental Machine Learning to Interactive Data Mining. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 537–539. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  19. 19.
    Fahmi, I., Bouma, G.: Learning to identify definitions using syntactic features. In: Proceedings of the Workshop on Learning Structured Information in Natural Language Applications, pp. 64–71. Association for Computational Linguistics, Trento (2006)Google Scholar
  20. 20.
    Feldman, R., Sanger, J.: The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press (2007)Google Scholar
  21. 21.
    Fernandes, A.D.: Answering definitional questions before they are asked. PhD Thesis. Massachusetts Institute of Technology, Cambridge, USA (2004)Google Scholar
  22. 22.
    Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukwac, a very large web-derived corpus of english. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp. 47–54. Marrakech, Marrocos (2008)Google Scholar
  23. 23.
    Kiss, T., Strunk, J.: Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4), 485–525 (2006)CrossRefGoogle Scholar
  24. 24.
    Klavans, J.L., Muresan, S.: DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text. In: Proceedings of the AMIA Symposium, pp. 1049–1049 (2000)Google Scholar
  25. 25.
    Loper, E., Bird, S.: NLTK: the Natural Language Toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics – ETMTNLP, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)CrossRefGoogle Scholar
  26. 26.
    26. Magnini, B.; Cappelli, A.; Tamburini, F.: Evaluation of natural language tools for italian: Evalita 2007. Proceedings of the International Language Resources and Evaluation Conference, LREC 2008, vol. 8, p. 2536-2543, 2008. Google Scholar
  27. 27.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Computational Linguistic 19(2), 313–330 (1993)Google Scholar
  28. 28.
    Marques, N.C., Lopes, J.G.P.: A Neural Network Approach to Portuguese Part-of-Speech Tagging. In: Garcia, L.S. (ed.) Anais do II Encontro para o Processamento Computacional de Português Escrito e Falado. CEFET-PR, Curitiba (1996)Google Scholar
  29. 29.
    Miliaraki, S., Androutsopoulos, I.: Learning to identify single-snippet answers to definition questions. In: Proceedings of the 20th International Conference on Computational Linguistics - COLING 2004. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar
  30. 30.
    Navigli, R., Velardi, P.: Learning word-class lattices for definition and hypernym extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1318–1327 (2010)Google Scholar
  31. 31.
    Pearson, J.: Terms in context. John Benjamins Publishing Company (1998)Google Scholar
  32. 32.
    Pinto, A.S., Oliveira, D.: Extracção de definições no Corpógrafo. Faculdade de Letras da Universidade do Porto, Portugal (2004),
  33. 33.
    Przepiórkowski, A., Degórski, Ł., Wójtowicz, B.: Towards the automatic extraction of definitions in Slavic. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pp. 43–50. Association for Computational Linguistics, Prague (2007)CrossRefGoogle Scholar
  34. 34.
    Rigutini, L., Diligenti, M., Maggini, M., Gori, M.: A Fully Automatic Crossword Generator. In: Proceedings of the Seventh International Conference on Machine Learning and Applications, pp. 362–367. IEEE Computer Society (2008)Google Scholar
  35. 35.
    Rondeau, G.: Introduction à la Terminologie, Québec, Gaëten Morin Editeur (1984)Google Scholar
  36. 36.
    Sager, J.C.: A practical course in terminology processing. J. Benjamins Pub. Co. (1990)Google Scholar
  37. 37.
    Saggion, H.: Identifying Definitions in Text Collections for Question Answering. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)Google Scholar
  38. 38.
    Saggion, H.: Mining Profiles and Definitions with Natural Language Processing. In: Prado, H.A., Ferneda, E. (Orgs.) Emerging Technologies of Text Mining: Techniques and Applications, IGI Global, Hershey (2008)Google Scholar
  39. 39.
    Sang, E.T.K., Bouma, G., De Rijke, M.: Developing offline strategies for answering medical questions. In: Proceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, Pittsburgh, USA, pp. 41–45 (2005)Google Scholar
  40. 40.
    Sarmento, L., Maia, B., Santos, D.: The Corpógrafo – a Web-based environment for corpora research. In: Proceedings of the International Language Resources and Evaluation Conference, LREC 2004, pp. 449–452 (2004)Google Scholar
  41. 41.
    Shaw, W.C.: The Art of Debate. Allyn and Bacon, New York (1922)Google Scholar
  42. 42.
    Tanev, H., Negri, M., Magnini, B., Kouylekov, M.: The DIOGENE Question Answering System at CLEF-2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 435–445. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  43. 43.
    Westerhout, E., Monachesi, P.: Extraction of Dutch definitory contexts for elearning purposes. In: Proceedings of Computational Linguistics in the Netherlands, CLIN 2006 (2006)Google Scholar
  44. 44.
    Wüster, E.: Die allgemeine Terminologielehre–ein Grenzgebiet zwischen Sprachwissenschaft, Logik, Ontologie, Informatik und den Sachwissenschaften. Linguistics 12(119), 61–106 (1974)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Graduate Program on Knowledge and IT ManagementCatholic University of BrasiliaBrasíliaBrazil
  2. 2.Embrapa - Management and Strategy SecretariatBrasíliaBrazil
  3. 3.Logistics and Information Technology SecretariatMinistry of Planning, Budget and ManagementBrasíliaBrazil
  4. 4.COPPEFederal University of Rio de JaneiroRio de JaneiroBrazil
  5. 5.Centro de TecnologiaCidade UniversitáriaRio de JaneiroBrazil

Personalised recommendations