Skip to main content

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

  • 1599 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 9924)

Abstract

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our method allows to create tailor-made standardization dictionaries for historical Portuguese with optional period or author frequencies.

Keywords

  • Historical corpus
  • Corpus annotation
  • Dictionary

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-45510-5_1
  • Chapter length: 9 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-45510-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.

Notes

  1. 1.

    (1) Original version: http://corporavm.uni-koeln.de/colonia; (2) With our annotation and normalized lemmas: http://corp.hum.sdu.dk/cqp.pt.html.

  2. 2.

    http://www.usp.br/gmhp/CorpI.html.

  3. 3.

    TreeTagger does not distinguish between common and proper nouns, but for the ‘unknown’ count, names were removed by inspection.

  4. 4.

    At the time of writing it was not clear if this text had been subject to philological editing in its current form, which might explain its fairly modern orthography.

  5. 5.

    Parts of fused tokens were counted individually in the statistics, the token count is therefore higher than it would be counting the original text tokens as-is.

  6. 6.

    Note that the figures constitute a lower bound. In order to achieve a precision close to 100 %, only chunks with at least 4 (clear Latin 3) non-name words were treated, so individual loan words or mini-quotes are not included.

References

  1. Bick, E.: PALAVRAS, a constraint grammar-based parsing system for Portuguese. In: Working with Portuguese Corpora, pp. 279–302 (2014)

    Google Scholar 

  2. Bick, E., Módolo, M.: Letters and editorials: a grammatically annotated corpus of 19th century Brazilian Portuguese. In: Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, pp. 271–280 (2005)

    Google Scholar 

  3. Britto, H., Finger, M., Galves, C.: Computational and linguistic aspects of the Tycho Brahe parsed corpus of historical Portuguese. In: Romance Corpus Linguistics: Corpora and Spoken Language, pp. 137–146 (2002)

    Google Scholar 

  4. Davies, M.: Creating and using the corpus do Português and the frequency dictionary of Portuguese. In: Working with Portuguese Corpora, pp. 89–110 (2014)

    Google Scholar 

  5. Galves, C., Faria, P.: Tycho Brahe Parsed Corpus of Historical Portuguese (2010). http://www.tycho.iel.unicamp.br/tycho/corpus/en/index.html

  6. Hendrickx, I., Marquilhas, R.: From old texts to modern spellings: an experiment in automatic normalisation. JLCL 26(2), 65–76 (2011)

    Google Scholar 

  7. Hirohashi, A.: Aprendizado de Regras de Substituição para Normatização de Textos Históricos (2005)

    Google Scholar 

  8. Junior, A.C., Aluísio, S.M.: Building a corpus-based historical Portuguese dictionary: challenges and opportunities. TAL 50(2), 73–102 (2009)

    Google Scholar 

  9. Murakawa, C.D.A.A.: A Construção de um Dicionário Histórico: o Caso do Dicionário Histórico do Português do Brasil-séculos XVI, XVII e XVIII. Estudos de Lingüística Galega 6, 199–216 (2014)

    Google Scholar 

  10. Nevins, A., Rodrigues, C., Tang, K.: The rise and fall of the L-shaped morphome: diachronic and experimental studies. Probus 27(1), 101–155 (2015)

    CrossRef  Google Scholar 

  11. Niculae, V., Zampieri, M., Dinu, L.P., Ciobanu, A.M.: Temporal text ranking and automatic dating of texts. In: Proceedings of EACL, pp. 17–21 (2014)

    Google Scholar 

  12. Rocio, V., Alves, M.A., Lopes, J.G., Xavier, M.F., Vicente, G.: Automated creation of a medieval Portuguese partial treebank. In: Abeillé, A. (ed.) Treebanks, pp. 211–227. Springer, Heidelberg (2003)

    CrossRef  Google Scholar 

  13. Santos, D., Mota, C.: A Admiração à Luz dos Corpos. Oslo Stud. Lang. 7(1), 57–77 (2015)

    Google Scholar 

  14. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 44–49 (1994)

    Google Scholar 

  15. Silvestre, J.P., Villalva, A.: A morphological historical root dictionary for Portuguese, pp. 967–971 (2014)

    Google Scholar 

  16. Zampieri, M., Becker, M.: Colonia: corpus of historical Portuguese. ZSM Studien, Special Volume on Non-standard Data Sources in Corpus-Based Research, pp. 77–84 (2013)

    Google Scholar 

  17. Zampieri, M., Malmasi, S., Dras, M.: Modeling language change in historical corpora: the case of Portuguese. In: Proceedings of LREC, pp. 4098–4104 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Zampieri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bick, E., Zampieri, M. (2016). Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)