Skip to main content

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

  • 1756 Accesses

Abstract

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    (1) Original version: http://corporavm.uni-koeln.de/colonia; (2) With our annotation and normalized lemmas: http://corp.hum.sdu.dk/cqp.pt.html.

  2. 2.

    http://www.usp.br/gmhp/CorpI.html.

  3. 3.

    TreeTagger does not distinguish between common and proper nouns, but for the ‘unknown’ count, names were removed by inspection.

  4. 4.

    At the time of writing it was not clear if this text had been subject to philological editing in its current form, which might explain its fairly modern orthography.

  5. 5.

    Parts of fused tokens were counted individually in the statistics, the token count is therefore higher than it would be counting the original text tokens as-is.

  6. 6.

    Note that the figures constitute a lower bound. In order to achieve a precision close to 100 %, only chunks with at least 4 (clear Latin 3) non-name words were treated, so individual loan words or mini-quotes are not included.

References

  1. Bick, E.: PALAVRAS, a constraint grammar-based parsing system for Portuguese. In: Working with Portuguese Corpora, pp. 279–302 (2014)

    Google Scholar 

  2. Bick, E., Módolo, M.: Letters and editorials: a grammatically annotated corpus of 19th century Brazilian Portuguese. In: Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, pp. 271–280 (2005)

    Google Scholar 

  3. Britto, H., Finger, M., Galves, C.: Computational and linguistic aspects of the Tycho Brahe parsed corpus of historical Portuguese. In: Romance Corpus Linguistics: Corpora and Spoken Language, pp. 137–146 (2002)

    Google Scholar 

  4. Davies, M.: Creating and using the corpus do Português and the frequency dictionary of Portuguese. In: Working with Portuguese Corpora, pp. 89–110 (2014)

    Google Scholar 

  5. Galves, C., Faria, P.: Tycho Brahe Parsed Corpus of Historical Portuguese (2010). http://www.tycho.iel.unicamp.br/tycho/corpus/en/index.html

  6. Hendrickx, I., Marquilhas, R.: From old texts to modern spellings: an experiment in automatic normalisation. JLCL 26(2), 65–76 (2011)

    Google Scholar 

  7. Hirohashi, A.: Aprendizado de Regras de Substituição para Normatização de Textos Históricos (2005)

    Google Scholar 

  8. Junior, A.C., Aluísio, S.M.: Building a corpus-based historical Portuguese dictionary: challenges and opportunities. TAL 50(2), 73–102 (2009)

    Google Scholar 

  9. Murakawa, C.D.A.A.: A Construção de um Dicionário Histórico: o Caso do Dicionário Histórico do Português do Brasil-séculos XVI, XVII e XVIII. Estudos de Lingüística Galega 6, 199–216 (2014)

    Google Scholar 

  10. Nevins, A., Rodrigues, C., Tang, K.: The rise and fall of the L-shaped morphome: diachronic and experimental studies. Probus 27(1), 101–155 (2015)

    Article  Google Scholar 

  11. Niculae, V., Zampieri, M., Dinu, L.P., Ciobanu, A.M.: Temporal text ranking and automatic dating of texts. In: Proceedings of EACL, pp. 17–21 (2014)

    Google Scholar 

  12. Rocio, V., Alves, M.A., Lopes, J.G., Xavier, M.F., Vicente, G.: Automated creation of a medieval Portuguese partial treebank. In: Abeillé, A. (ed.) Treebanks, pp. 211–227. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Santos, D., Mota, C.: A Admiração à Luz dos Corpos. Oslo Stud. Lang. 7(1), 57–77 (2015)

    Google Scholar 

  14. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 44–49 (1994)

    Google Scholar 

  15. Silvestre, J.P., Villalva, A.: A morphological historical root dictionary for Portuguese, pp. 967–971 (2014)

    Google Scholar 

  16. Zampieri, M., Becker, M.: Colonia: corpus of historical Portuguese. ZSM Studien, Special Volume on Non-standard Data Sources in Corpus-Based Research, pp. 77–84 (2013)

    Google Scholar 

  17. Zampieri, M., Malmasi, S., Dras, M.: Modeling language change in historical corpora: the case of Portuguese. In: Proceedings of LREC, pp. 4098–4104 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Zampieri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bick, E., Zampieri, M. (2016). Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics