Skip to main content

Early Experiments on Automatic Annotation of Portuguese Medieval Texts

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2022)

Abstract

This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (\(\sim \)155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.

Research for this paper was partially funded by public funds through Fundação para a Ciência e a Tecnologia: J. Baptista and F. Batista (INESC-ID Lisboa, proj.ref UIDB/50021/2020), E. Cardeira (School of Arts and Humanities, Center of Linguistics, University of Lisbon, proj.ref. UIDP/00214/2020) and M.I. Bico by Ph.D grant (proj.ref. UI/BD/152806/2022).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://teitok.clul.ul.pt/teitok/cta/ (last access: 2022/09/05 01:11:19). All the remaining URL in this paper were check on this date.

  2. 2.

    http://www.clul.ulisboa.pt/grupo/filologia.

  3. 3.

    http://teitok.clul.ul.pt/teitok/cta/index.php?action=criterios.

  4. 4.

    Biblioteca Nacional (Portugal), Alc.198, fls. 1r–155r.

  5. 5.

    Biblioteca do Museu de Aveiro, ms. 1 [33/CD], fls. 48a–110b.

  6. 6.

    Biblioteca Mun. Porto, Safe n. 527 (Cat. n. 683), ff. 196v–208v.

  7. 7.

    Arq. Mun. Alfredo Pimenta (Guimarães), Ms. da Colegiada 793, fls. 211r–236r.

  8. 8.

    Arq. Nac. Torre do Tombo. Fragm., Cx. 21, n.26 (Casa Forte). Lorvão, Livro 10, fl. 13r. Fragm., Cx. 21, n.23a (Casa Forte).

  9. 9.

    Lisboa, Valentim Fernandes, [1496?]. Biblioteca Nacional (Portugal), Inc. 571.

  10. 10.

    Biblioteca do Museu de Aveiro, ms. 1 [33/CD], fls. 48a-110b.

  11. 11.

    https://collatex.net/.

References

  1. Britto, H., Finger, M.: Constructing a parsed corpus of historical Portuguese. In: Proceedings of International Humanities Computing Conference, University of Virginia, Charlottesville. ACH/ALLC (1999)

    Google Scholar 

  2. Camps, J.B., Ing, L., Spadini, E.: Collating medieval vernacular texts. aligning witnesses, classifying variants. In: Digital Humanities Conference (DHC) 2019. Utrecht, Netherlands (2019). https://hal.archives-ouvertes.fr/hal-02268348

  3. Davies, M.: New directions in Spanish and Portuguese corpus linguistics. Stud. Hisp. Lusophone Linguist. 1(1), 149–186 (2008)

    Article  Google Scholar 

  4. Eleutério, S., Ranchhod, E., Freire, H., Baptista, J.: A system of electronic dictionaries of Portuguese. Linguisticae Investigationes 19(1), 57–82 (1995)

    Article  Google Scholar 

  5. Gamallo, P., Pichel, J.R., Santalha, J.M.M., Neves, M.: Uso de tecnologias linguísticas para estudar a evolução dos sufixos -çom e -vel no galego-português medieval a partir de corpora históricos. Linguamática 13(2), 3–17 (2021)

    Article  Google Scholar 

  6. Gonçalves, M.F., Banza, A.P.: Da antiga à nova Filologia: o Projecto MEP-BPEDig. In: Actas del XXVI Congreso Internacional de Lingüística y de Filología Románicas. Tome VII, vol. 7, pp. 205–210. Walter de Gruyter (2013)

    Google Scholar 

  7. Gross, M.: La construction de dictionnaires électroniques. Ann. Télécommun. 44, 4–19 (1989). https://doi.org/10.1007/BF02999875

    Article  Google Scholar 

  8. Hendrickx, I., Marquilhas, R.: From old texts to modern spellings: an experiment in automatic normalisation. J. Lang. Technol. Comput. Linguist. 26(2), 65–76 (2011)

    Google Scholar 

  9. Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceedings of the 10\(^{th}\) International Conference on Language Resources and Evaluation (LREC 2016), pp. 4037–4043. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://aclanthology.org/L16-1637

  10. Jurafsky, D., Martin, J.H.: Speech and language processing (draft) (2021). https://web.stanford.edu/jurafsky/slp3/

  11. Lopes, J., Rocio, V., Xaxier, M.F., Vicente, G.: Criação automática de uma colecção de textos de português medieval parcialmente anotados sintacticamente. In: Actas del Segundo Seminário de Escuela Interlatina de Altos Estudios en Lingüística Aplicada, pp. 203–220 (2002)

    Google Scholar 

  12. Mendes, A.: Linguística de corpus e outros usos dos corpora em linguística. In: Martins, A.M., Carrilho, E. (eds.) Manual de linguística portuguesa, vol. 16, pp. 224–251. Walter de Gruyter GmbH & Co KG (2016)

    Google Scholar 

  13. Parkinson, S.R., Emiliano, A.H.: Encoding medieval abbreviations for computer analysis (from Latin-Portuguese and Portuguese non-literary sources). Literary Linguist. Comput. 17(3), 345–360 (2002)

    Article  Google Scholar 

  14. Ranchhod, E., Mota, C., Baptista, J.: A computational lexicon of Portuguese for automatic text parsing. In: Standardizing Lexical Resources (SIGLEX 1999), pp. 74–80. ACL/SIGLEX, Maryland, USA (1999)

    Google Scholar 

  15. Ranchhod, E.M.: O uso de dicionários e de autómatos finitos na representação lexical. In: Ranchhod, E.M. (ed.) Tratamento das Línguas por Computador. Uma introdução à Linguística Computacional e suas aplicações, pp. 13–47. Caminho (2001)

    Google Scholar 

  16. Rocio, V., Alves, M.A., Lopes, J.G.P., Xavier, M.F., Vicente, G.: Automated creation of a medieval Portuguese partial treebank. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, vol. 20, pp. 211–227. Springer, Dordrecht (2003)

    Chapter  Google Scholar 

  17. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 154–163 (1994)

    Google Scholar 

  18. Schmid, H.: Deep learning-based morphological taggers and lemmatizers for annotating historical texts. In: Proceedings of the 3\(^{rd}\) International Conference on Digital Access to Textual Cultural Heritage, pp. 133–137 (2019)

    Google Scholar 

  19. de Sousa, M.C.P.: O corpus Tycho Brahe: Contribuições para as Humanidades Digitais no Brasil. Filologia e linguística portuguesa 16(esp.), 53–93 (2014)

    Google Scholar 

  20. Vaamonde, G., Janssen, M.: Da edición dixital á análise lingüística. A creación de corpus históricos na plataforma TEITOK, pp. 271–292 (01 2020). https://doi.org/10.17075/cbfc.2020.008

  21. van Zundert, J., Haentjens Dekker, R., Van Hulle, D., Neyt, V., Middell, G.: Computer-supported collation of modern manuscripts: CollateX and the Beckett digital manuscript project. Literary and Linguistics Computing 30(3), 452–470 (2014). https://doi.org/10.1093/llc/fqu007

    Article  Google Scholar 

  22. Xavier, M.F.: O CIPM - Corpus Informatizado do Português Medieval, fonte de um dicionário exaustivo. In: Lingüística de corpus y lingüística histórica iberorrománica, pp. 137–156. De Gruyter (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Inës Bico .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bico, M.I., Baptista, J., Batista, F., Cardeira, E. (2022). Early Experiments on Automatic Annotation of Portuguese Medieval Texts. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16802-4_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16801-7

  • Online ISBN: 978-3-031-16802-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics