Skip to main content

Advertisement

Log in

Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the Penn Parsed Historical Corpora (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The Portuguese corpora referred to in this paper correspond only to the four corpora that encode syntactic information using the Penn Parsed Corpora of Historical English model. In addition to these, it is worth mentioning other Portuguese treebanks, such as TreeBankPT (Branco et al., 2011) and Floresta Sintática (Freitas et al., 2008), which adopt systems also based on constituency relations, or CINTIL-UDep (Branco et al., 2022) and the UD Portuguese Bosque Treebank (Rademaker et al., 2017), which follow dependency-type annotation systems.

  2. The Manual is available at http://alfclul.clul.ul.pt/portuguesesyntacticannotation/home.html and https://www.tycho.iel.unicamp.br/manual

  3. Depending on the variety, contraction also occurs between prepositions and pronominal clitics, and between verbs and determiners. Note incidentally that to represent the phenomenon of mesoclisis in which the clitic interpolates between the verbal root and the future inflection, an additional symbol (!) had to be created, ex: fa-lo-ei/VB-R! CL.

  4. eDictor is freely downloadable from http://www.tycho.iel.unicamp.br/~tycho/apps/ or from http://edictor.net

  5. The examples are all taken from the Portuguese Corpora. In addition to the code of the sentence, which appears at the end of the tree, together with the name of the Corpus to which it belongs, we cite the corresponding text in the way it is presented in its respective corpus. The name of the author, his birthdate, and the genre of the text in the TBC; the type of letter and its date of production in the Post Scriptum Corpus; the name of the text and its century, in the WOChWEL Corpus, and the place and the year of the interview in the CORDIAL-SIN Corpus.

  6. Martins (1994), Galves et al., (2017) a.o.

  7. cf. Manual: 8.5.

  8. Additionally, adjectives and adverbs can have complements, in which case they also head a phrase.

  9. By convention, traces are annotated in the first position in IP. (cf. Manual: 12.7).

  10. cf. Manual: 4.2; 4.3.

  11. cf. Manual: 13.1.13.

  12. cf. Manual: 13.1.5.

  13. cf. Manual:13.1.4.2.2.

  14. cf. Manual:19.

  15. This is a nice example of the treatment of non-standard structures. It should be emphasized that, in historical corpora, the very notion of (non-)standard structures has no real meaning since what is expected is variation. There are indeed many morpho-syntactic differences in the four corpora that compose the Portuguese Corpora annotated in the way presented here, since they range over a period of eight centuries, and include different social and geographical varieties. For most of the cases, this is not a problem for the annotation scheme. For instance, null objects in Brazilian Portuguese, which appear as soon as the eighteenth century, can be simply treated like null subjects, i.e., annotated with the null category *pro*. However, some non-standard structures are such that they require a special treatment when they are found in the texts. This is the case of (20), which is at the origin of the creation of the CP-D label, initially restricted to that construction (See Carrilho (2005) for evidence that, in this kind of construction, the CP level is activated). Interestingly, it later proved useful for the annotation of another, apparently non-related, structure in which an adverb precedes a subordinating conjunction, as in (21).

  16. cf. Manual: 13.1.6.

  17. cf. Manual: 13.1.6.3.

  18. cf. Manual: 1.2.

  19. For a recent discussion of this issue, see Eckhoff (2022) and Meelen & Willis (2022).

  20. cf. Manual: 3.1.4.

  21. Section 3.1.4 of the Manual presents the different annotations of SE depending on its function in the sentence. In passive constructions SE is co-indexed with the argument of the transitive verb, which is the syntactic subject of the clause (cf. 23). In indefinite constructions, SE is coindexed with a null expletive subject, and if the verb takes an argument, it is treated as the object (cf. 26). As for the other kinds of SE, like anti-causatives and emotional middles, they are simply annotated as NP-SE, without co-indexation. Note that in the case in which SE can be argued to be dropped, which occurs in many contexts in Brazilian Portuguese, no null SE is inserted. SE is simply not there.

  22. Note, however, that since the aim of the annotation scheme is not to represent the right analysis but to provide linguists with a robust way of retrieving the necessary syntactic information to empirically found their diachronic analysis, what is crucial is that the chosen representation be explicitly formulated in the documentation of the annotation system. In the case of Portuguese SE, the Manual informs the users that they will have to run different searches to find data analogous to (25) and (26) in the different Portuguese Corpora Portuguese. This is not the ideal situation, but it has the great advantage of being clear.

  23. A similar semi-automated methodology was adopted for the semantic annotation of the AC/DC project materials. For details on this workflow, see the description of the corte-e-costura environment in Santos & Mota (2010).

  24. The version of ParsPort considered in this section is the original one, which runs locally and operates through CorpusSearch. An online implementation of ParsPort at the Tycho Brahe Platform (https://www.tycho.iel.unicamp.br) is currently undergoing testing and refinement and will be publicly available soon. This web-based version of ParsPort offers a graphical interface for executing the parser and viewing its output and makes it possible to set breakpoints for analyzing intermediate stages of the syntactic generation process. The parser will be available on the platform as an integrated feature to the text edition and syntactic revision tools, and it will also be usable by other softwares through API services.

  25. For a complete inventory of the CorpusSearch revision functions, see the CorpusSearch Users Guide (http://corpussearch.sourceforge.net/CS-manual/Contents.html). Within the scope of this paper, only the revision functions used in the Portuguese annotation tool are presented.

  26. The examples presented in this section are intended to illustrate the sequential operation of ParsPort. Thus, in each case, the result of the execution of a given revision query (and those that eventually preceded it) is shown. The information introduced as a result of subsequent operations is not included in the representations, which are mostly incomplete.

  27. Depending on the constituent being formed and the revision instruction to be applied, the occurrence of punctuation at a certain point of the string can be a relevant indicator or, on the contrary, an aspect to be ignored. CorpusSearch allows adding the command "ignore_nodes" to the search conditions to deal with cases in which the presence of a comma, for example, must be ignored when executing the query.

  28. Notice that the manual correction of intermediate outputs is merely optional as it is intended to obtain an optimal final output. In its basic mode of execution, ParsPort runs in one fell swoop. The evaluation of its performance presented in Sect. 4 assumes this basic mode of execution.

  29. It is important to point out that these differences in annotation schema among the Portuguese Corpora only occur when research results show unequivocally that the data to be annotated are the manifestation of different grammars. In all other cases, the system handles ambiguous syntactic patterns by adopting default conventions or unmarked solutions: for example, ambiguous instances of the clitic SE (inherent, reflexive, middle) are dominated by a non-coindexed NP-SE.

  30. The design of the public version of ParsPort was based on the contemporary European Portuguese grammar. This version is documented and offers a description of how each of the 174 queries behaves. Its variants were created to be used by annotators of the Portuguese Corpora and are intended to deal with specificities of certain textual genres or texts from certain periods or regions (e.g., high frequency of null subject in dialogic texts; null complementizer in Classical texts; clitic doubling in certain dialectal varieties; etc.). External users of the tool will be able to refine the public version of ParsPort, by making the adaptations that best respond to the data to be annotated. These refinements may affect the queries themselves or their ordering.

  31. Materials used, including versions of the actual probabilistic parsers, can be downloaded from https://redu.unicamp.br/dataset.xhtml?persistentId=doi:10.25824/redu/IVKQQV

  32. In response to reviewers’ comments, a comparison between Bikel’s and Stanford PCFG parsing engines was conducted. A comparison with Stanford Neural Parser was not conducted because it does dependency parsing, which is not directly relevant here. From the same set of test sentences used in the comparison reported in this section, the subset of 411 sentences with 40 or less words was extracted, because we were not able to have Stanford parser processing longer sentences (memory issues). We also added the Portuguese language to Stanford LexicalizedParser and trained both parsers on the same set of 150,000 sentences. Their outputs were compared, using evalb, to a golden standard fixed by a human annotator and the performances obtained (f-scores) were very similar: 75.05 (Bikel’s) and 74.48 (Stanford). In this new round, Bikel’s performance is higher than the reported below, but still lower than RuleP’s performance. We attribute the higher performance of Bikel’s parser here to the new setting of training and testing with only 40 or less word sentences. It may be the case that sentences in this subset are more homogeneous than the whole corpus. Although performances for both parsers might possibly be improved, even then, as we argue in this section, both parsers are still unable to annotate empty categories and co-indexation. This further result strengthens the support for our claim about the benefits of the rule-based solution.

  33. Some data preparation and preprocessing were necessary for the study on parsing presented in Sect. 4. First, we used a tool we developed to load the parsed corpora and to export the training and test sections without empty categories and co-indexations. We also removed special coding and sentence’s IDs, using regular expressions. This clean-up is necessary, because we want Bikel’s parser to be trained for what POS files have: words and punctuations, and their tags. We also replaced ‘%’ characters by ‘_’ in CORDIAL-SIN, for compatibility with Bikel’s parser. Finally, we also replaced all ‘-’ (hyphens) used for subtags, as in IP-MAT (matrix inflectional phrase), by the ‘_’ symbol. With this change, Bikel’s parser works with both the phrase category and its syntactic function as one indivisible tag. Consequently, its parses can reproduce them. Our script for managing the parsing automatically replaces back all hyphens after Bikel’s parser outputs a file.

  34. The Tycho Brahe Parsed Corpus of Historical Portuguese is an electronic corpus of texts written in Portuguese by Brazilian and Portuguese authors born between 1380 and 1978, which has been developed at the Linguistics Department of the University of Campinas since 1998. It is currently composed of 95 texts in xml format (3.789,646 words), of which 33 are available with a POS and syntactic annotation (1.574,957 words), and 18 with only POS tagging.

  35. The CORDIAL-SIN treebank is a collection of 177,596 syntactic parse trees of the Syntax-oriented Corpus of Portuguese Dialects (Martins, 1999–2022; Magro et al., 2020). CORDIAL-SIN is a corpus of spoken dialectal European Portuguese, developed at Centro de Linguística da Universidade de Lisboa (CLUL), that compiles excerpts of spontaneous and semi-directed speech selected from fieldwork interviews carried out in 42 locations within the continental territory of Portugal and the archipels of Madeira and Azores. The materials for this corpus, which amounts to c. 650,000 words, were drawn from the recordings of dialect speech collected by the CLUL ATLAS team as fieldwork interviews for linguistic atlases between 1974 and 2004.

  36. It should be noted that the test section does not include any sentences from the Post Scriptum corpus. Since ParsPort/RuleP was built based on the Post Scriptum corpus (cf. Sect.  3.2), there is no possibility that the 174 rules of RuleP respond directly to the requirements of the test data.

  37. POS tagging for our Portuguese corpora is carried out by the tagger developed in Kepler (2005), which provides 95.51% of accuracy. Its output is further reviewed by a human annotator before parsing.

  38. In the adopted syntactic annotation system, main clauses are separated even when conjoined. Conjoined IP-MAT thus correspond to independent root nodes and the conjunction head is encoded immediately under the second (or subsequent) IP. This option allows the user to control the retrieval and quantification of phenomena typical of coordinated second members, such as the occurrence of null constituents or particular instances of word order, since search tools detail the number of "hits" (distinct constituents containing the structure), the number of matrix sentences containing hits, and the total number of matrix sentences in the file.

    Therefore, the split of independent coordinate clauses in the test set is not a strategy to temper with it, but a necessary step in the annotation process given the intended final result.

  39. The fact that we needed to split some sentences in the POS file indicates that the training corpus might contain, due to annotation errors, some non-split independent coordination too. However, this is unlikely to affect results; on the contrary, given that there are several more complex structures in the training corpus, parsing split sentences should actually be made easier.

  40. Notice however that the comparative evaluation of the two parsers considers the results generated by the fully automated version of RuleP, that is, the output produced by running the 174 rules in one fell swoop without human correction of intermediate stages of the process.

  41. The annotator had prior experience with output from both parsers. One of the reviewers pointed out the importance of having at least two human annotators revising the outputs, for we would be able to assess inter-annotator agreement, which is an important indicator for the clarity and consistency of annotation guidelines. A second human annotator would also allow us to have a second order for the partitions, starting with RuleP’s partition 1 instead. Although we agree and see the benefit of this suggestion, unfortunately the currently available annotators are among the authors of this paper, and this could bias the present evaluation.

  42. In a CorpusSearch investigation of the training section, we found 85 instances of NP immediately dominating VB-P among the 151,706 sentences. All these are much likely POS tagging errors. For IP-GER as root nodes, 125 instances were found, with minor inconsistencies.

  43. The architecture and development of the Tycho Brahe Platform is due to the computer scientist and linguist Luiz Veronesi.

  44. The development of web-based syntactic search interfaces that do not require software installation or deep knowledge about tree representations and formal query languages is a long-standing goal, pursued in several projects over the past two decades. The results achieved are uneven, both in the features provided to the user and in the search power. As good examples of search interfaces for querying constituency trees, see TigerSearch (Lezius, Biesinger & Gerstenberger, 2002), The Linguist’s Search Engine (Resnik & Elkiss, 2005), Fangorn (Ghode & Bird, 2012), GrETEL (Augustinus et al., 2012) and PaCQL (Ingason, 2016). For Portuguese treebanks, in addition to TEITOK and WebSync, the Milhafre Interface (Freitas & Rocha, 2008) and the CINTIL Treebank Searcher (Nunes & Branco, 2009) should also be mentioned.

References

  • Augustinus, L., Vandeghinste, V., & Van Eynde, F. (2012). Example-based treebank querying. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). pp. 3161–3167.

  • Beck, J., Ecay, A. & Ingason, A. K. (2011). Annotald. [Software for treebank annotation.] (http://github.com/janabeck/Annotald).

  • Beck, C., Booth, H., El-Assady, M., & Butt, M. (2020). Representation problems in linguistic annotations: ambiguity, variation, uncertainty, error and bias. The 14th Linguistic Annotation Workshop, pp. 60–73.

  • Bikel, D.M. (2004a). On the Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Dissertation, University of Pennsylvania.

  • Bikel, D. M. (2004b). Intricacies of collins’ parsing model. Computational Linguistics, 30(4), 479–511. https://doi.org/10.1162/0891201042544929

    Article  Google Scholar 

  • Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B. & Strzalkowski, T. (1991). A procedure for quantitatively comparing syntactic coverage of English grammars. Proceedings of the 4th DARPA Speech & Natural Language Workshop. pp. 306–311.

  • Branco, A., Silva, J., Costa, F., & Castro, S. (2011). Cintil treebank handbook: Design options for the representation of syntactic constituency. Lisboa: Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa. http://docs.di.fc.ul.pt/jspui/handle/10455/6746

  • Branco, A., Silva, J., Gomes, L., & Rodrigues, J. (2022). Universal grammatical dependencies for portuguese with cintil Data, LX Processing and CLARIN support. In: Proceedings of the 13th Conference on Language Resources and Evaluation. 5617–5626

  • Britto, H., Finger, M., & Galves, C. (2002). Computational and linguistic aspects of the construction of the Tycho Brahe Parsed Corpus of Historical Portuguese. Romance Corpus linguistics - Corpora and spoken language. Narr.

    Google Scholar 

  • Carrilho, E. (2005). Expletive Ele in European Portuguese dialects. Unpublished PhD dissertation, University of Lisbon.

  • CLUL (Ed.). (2014). P.S. Post Scriptum. A Digital Archive of Ordinary Writing in Early Modern Portugal and Spain. Lisboa: Centro de Linguística da Universidade de Lisboa. URL: http://teitok.clul.ul.pt/postscriptum/

  • Collins, M. (1997). Three generative, lexicalized models for statistical parsing. In: Proceedings of ACL 97. pp. 16–23.

  • Collins, M. (1999). Head-Driven Statistical Models for Natural Language Processing. PhD Dissertation, University of Pennylvania.

  • Costa, A. S. (2015). WebSinC: Uma Ferramenta Web para buscas sintáticas e morfossintáticas em corpora anotados - Estudo de Caso do Corpus DOViC – Bahia. Master Thesis, Universidade Estadual do Sudoeste da Bahia.

  • Costa, A. S. & Namiuti-Temponi, C. (forthcoming). WebSinC: Buscas online em corpora sintaticamente anotados. E-Book do Congresso de Humanidades Digitais em Portugal: Construir pontes e quebrar barreiras na era digital – 2015. Lisboa: Universidade Nova de Lisboa. Available at: https://cristianenamiutisite.files.wordpress.com/2017/05/2017-costa_namiuti-2017-no-prelo.pdf

  • Eckhoff, H. M. (2022). A long-haul change. Differential object marking in early Slavonic. Journal of Historical Syntax, 6, 4–11.

    Google Scholar 

  • Faria, P., & Galves, C. (2016). Criando “bancos de árvores”: O sistema de anotação e o processamento automático. Cadernos De Estudos Linguísticos, 58(2), 299–315.

    Article  Google Scholar 

  • Finger, M. (2000). Técnicas de Otimização da Precisão Empregadas no Etiquetador Tycho Brahe. Anais do V Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR2000).

  • Freitas, C., & Rocha, P. (2008). Primeiros vôos com o MILHAFRE (tutorial da ferramenta de busca em árvores sintácticas Milhafre). https://www.linguateca.pt/superb/busca_publ.pl?idi=1250585324.

  • Freitas, C., Rocha, P., & Bick, E. (2008). Um mundo novo na Floresta Sintá(c)tica – o treebank do Português. Calidoscópio, 6(3), 142–148. https://doi.org/10.4013/cld.20083.03

    Article  Google Scholar 

  • Galves, C., Andrade, A., & Faria, P. (2017). Tycho Brahe Parsed Corpus of Historical Portuguese, phase III, University of Campinas, Brazil. URL: http://www.tycho.iel.unicamp.br/corpus/index.html

  • Galves, C., Namiuti, C., & Paixão de Sousa, M. C. (2017a). The position of the verb in the history of Portuguese: Subject position. Clitic Placement and Prosody, Language, 93(3), 152–180.

    Google Scholar 

  • Galves, C., Sandalo, F., Sena, T., & Veronesi, L. (2017b). Annotating a polysynthetic language: From Portuguese to Kadiweu. Cadernos De Estudos Linguísticos, 59(3), 631–648.

    Article  Google Scholar 

  • Ghodke, S., & Bird, S. (2012). Fangorn: A System for Querying very large Treebanks. In: Proceedings of International Conference on Computational Linguistics 2012: Demonstration Papers. pp. 175–182.

  • Ingason, A. K. (2016). PaCQL: A new type of treebank search for the digital humanities. Italian Journal of Computational Linguistics Linguistics (special Issue: Digital Humanities and Computational Linguistics), 2(2), 51–66. https://doi.org/10.4000/ijcol.391

    Article  Google Scholar 

  • Janssen, M. (2014). TEITOK – The tokenized TEI environment. Lisboa: Centro de inguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/teitok/site/index.php?action=home.

  • Kepler, F. (2005). Um etiquetador baseado em cadeias de Markov de alcance variável. Master thesis, University of São Paulo.

  • Klein, D., & Manning, C. D. (2003). Fast exact inference with a factored model for NLP. Advances in Neural Language Processing Systems, 15, 3–10.

    Google Scholar 

  • Kroch, A., Tayler, A., & Santorini, B. (2000). The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, release 4 (http://www.ling.upenn.edu/ppche/ppche-release-2016/PPCME2-RELEASE-4).

  • Kroch, A., Santorini, B., & Delfs, L. (2004). The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, release 3 (http://www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3).

  • Kroch, A., Santorini, B., & Diertani, C. E. A. (2016). The Penn Parsed Corpus of Modern British English (PPCMBE2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, release 1 (http://www.ling.upenn.edu/ppche/ppche-release-2016/PPCMBE2-RELEASE-1).

  • Lezius, W., Biesinger, H., & Gerstenberger, C. (2002). TIGERSearch Manual. IMS, University of Stuttgart. Available at: https://www.researchgate.net/publication/2486273_TIGERSearch_Manual.

  • Magro, C. (2017). ParsPort. Revision queries for parsing Portuguese. Lisboa: Centro de Linguística da Universidade de Lisboa. http://parsport.sourceforge.net

  • Magro, C. (2018). QETch. Queries for Tree Searching. Lisboa: Centro de Linguística da Universidade de Lisboa. http://ps.clul.ul.pt/index.php?action=treequeries

  • Magro, C., & Galves, C. (2019). Portuguese Syntactic Annotation Manual. Lisboa/Campinas: Centro de Linguística da Universidade de Lisboa/Instituto de Ciências da Linguagem. http://alfclul.clul.ul.pt/portuguesesyntacticannotation/home.html

  • Magro, C., Carrilho, E., & Martins, A. M. (2020). CORDIAL-SIN Treebank. Lisboa: Centro de Linguística da Universidade de Lisboa. http://teitok.clul.ul.pt/synapse/index.php?action=downloads [anotação realizada por Catarina Magro, Márcia Bolrinha, Mélanie Pereira, Sandra Pereira]

  • Magro, C., & Vaamonde, G. (coords.) (2022). SynAPse – The Syntactic Atlas of European Portuguese. Lisboa: Centro de Linguística da Universidade de Lisboa. http://teitok.clul.ul.pt/synapse/index.php?action=home

  • Martineau, F. (2008). Modéliser le changement: Les voies du français (MCVF). University of Ottawa.

    Google Scholar 

  • Martins, A. M. (1994). Clíticos na história do português. Unpublished PhD Dissertation, University of Lisbon.

  • Martins, A. M. (coord.) (1999–2022). CORDIAL-SIN: Corpus Dialetal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. https://www.clul.ulisboa.pt/projeto/cordial-sin-corpus-dialectal-para-o-estudo-da-sintaxe

  • Martins, A. M., Pereira, S., & Cardoso, A. (2014–2015). Parsed Demanda do Santo Graal. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/wochwel/oldtexts.html

  • Martins, A. M., Pereira, S., & Cardoso, A. (2015). Parsed Legal Documents. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/wochwel/oldtexts.html

  • Martins, A. M., Pereira, S., & Cardoso, A. (2013–2015). Parsed José de Arimateia. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa). http://alfclul.clul.ul.pt/wochwel/oldtexts.html

  • Meelen, M., & Willis, D. (2022) Towards a historical treebank of Middle and Modern Welsh: syntactic parsing, Journal of Historical Syntax, 6, 4–11.

  • Nunes, P., & Branco, A. (2009). CINTIL-Treebank Searcher. In: Proceedings of the I Joint SIG-IL microsoft workshop on speech and language technologies for Iberian Languages. Porto Salvo, Portugal. p. 107.

  • Paixão de Sousa, M. C., Kepler, F., & Faria, P. (2013). eDictor. URL: https://edictor.net/download

  • Pardeshi, P. (2017). NINJAL Parsed Corpus of Modern Japanese (NPCMJ). Tokyo, National Institute of Japanese Language and Linguistics

  • Pereira, S. (2017). Parsed Crónica Geral de Espanha. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/wochwel/oldtexts.html

  • Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 44th ACL, 433–440.

  • Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., & de Paiva, V. (2017). Universal dependencies for portuguese. In: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017). Pisa,Italy. 197–206.

  • Randall, B. (2005–2015). CorpusSearch 2. [http://corpussearch.sourceforge.net]

  • Resnik, P. & Elkiss, A. (2005). The Linguist's Search Engine: An Overview. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. University of Michigan.

  • Santorini, B. (2016). Annotation manual for the Penn Historical Corpora and the York-Helsinki Corpus of Early English Correspondence.. Philadelphia: Department of Linguistics, University of Pennsylvania. https://www.ling.upenn.edu/histcorpora/annotation/index.html

  • Santos, D., & Mota, C. (2010). Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora. In Nicoletta, C., Khalid, C., Bente, M., Joseph, M., Jan, O., Stelios, P., Mike, R. & Daniel, T. (eds.), Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010). pp. 1437–1444.

  • Sekine, S. & Michael C. (1997). evalb software. Latest version at http://nlp.cs.nyu.edu/evalb

  • Silva, J., Branco, A., Castro, S., & Reis, R. (2010). Out-of-the-box robust parsing of Portuguese. Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language (PROPOR 2010). pp. 5–85.

  • Wallenberg, J. C., Ingason, A. K., Sigurðsson, E. F., & Rögnvaldsson, E. (2011). Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. http://www.linguist.is/icelandic_treebank

  • Willis, D., & Mittendorf, I. (2004). A Historical Corpus of the Welsh Language, 1500–1850. http://www.celticstudies.net

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charlotte Galves.

Additional information

We dedicate this paper to the memory of Anthony Kroch.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Faria, P., Galves, C. & Magro, C. Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces. Lang Resources & Evaluation 58, 301–346 (2024). https://doi.org/10.1007/s10579-023-09699-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09699-4

Keywords

Navigation